SlideShare une entreprise Scribd logo
1  sur  48
Machine Learning for
(Psycho-)Linguistics
Walter Daelemans
daelem@uia.ua.ac.be
http://cnts.uia.ac.be
CNTS, University of Antwerp
ILK, Tilburg University
QITL-02
Outline
• Machine Learning of Language
– Induction of rules and classes
– Learning by Analogy
• Case Studies
– Discovery of phonological categories and
morphological rules
– A single-route model of morphological processing
• Issues
– Probabilities versus symbolic structure induction
– Nativism versus empiricism
– Exemplar analogy versus rules
Output Input
Performance Component
Ri Rk
Rl
Rj
Learning Component
Search
Experience
BIAS
Problems with Probabilities
• Explanation
– Also applies to neural networks
• Event relevance
– Especially in unsupervised learning (clustering)
• Incorporation of linguistic knowledge
• Smoothing zero-frequency events
(symbolic) machine learning
• Rule induction (understandable induced theories)
• Inductive Logic Programming (incorporating
linguistic knowledge)
• Memory-based learning (similarity-based
smoothing of sparse data, feature weighting)
• …
Common Fallacies
• Rules = nativism
(and connections = empiricism)
• Generalization = abstraction
(and memory = table-lookup)
Rule-Based  Innate
• Rules can be induced from primary
linguistic data as well
• Applications in Linguistics
– Evaluation and comparison of linguistic
hypotheses
– Discovery of linguistic generalizations and
categories
Allomorphy in Dutch Diminutive
• “one of the more spectacular phenomena of
modern Dutch morphophonemics” Trommelen
(1983)
• Base form of Noun + [tje] (5 variants)
• Linguistic theory (from Te Winkel 1862)
Rime last syllable, stress, morphological structure, …
Trommelen 1983
Local phenomenon, stress & morphological structure
do not play a role
• CELEX data (3900 nouns)
- b i = - z @ = + m A nt  je
Allomorphs
-tje kikker-tje 1896 48.0 50.9
-etje roman-etje 395 10.0 10.9
-pje lichaam-pje 104 2.6 4.0
-kje koning-kje 77 1.9 3.8
-je wereld-je 1478 37.4 30.4
Decision Tree Learning
• Given a data set, construct a decision tree that
reflects the structure of the domain
• A decision tree is a tree where
– non-leaf nodes represent features (tests)
– branches leading out of a test represent possible values
for the feature
– leaf nodes represent outcomes (classes)
• Decision Tree can be translated into a set of IF-
THEN rules (with further optimization)
• Value grouping
Given a set of examples T
– If T contains one or more cases all belonging to
the same class C, then the decision tree for T is
a leaf node with category C.
– If T contains different classes then
• Choose a feature, and partition T into subsets that
have the same value for the feature chosen. The
decision tree consists of a node containing the
feature name, and a branch for each value leading to
a subset.
• Apply the procedure recursively to subsets created
this way.
Decision Tree Construction
Induced rule set
Default class is -tje
1. IF coda last is /lm/ or /rm/ THEN -pje
2. IF nucleus last is [+bimoraic] AND coda last is
/m/ THEN -pje
3. IF coda last is /N/ THEN
IF nucleus penultimate is empty or schwa THEN -etje
ELSE -kje
4. IF nucleus last is [+short] and coda last is [+nas]
or [+liq] THEN -etje
5. IF coda last is [+obstruent] THEN -je
Results
• Problem is almost perfectly learnable (98.4%)
• More than last syllable is needed for a full solution
• Only rime of last syllable (not stress or onset) is
relevant
• Induced Categories
– Nasals, liquids, obstruents, short vowels, bimoraic
vowels (consists of vowels, diphtongs, schwa)
– Task-dependent categories? Category formation is
dependent on the task to be learned, not absolute, not
language-independent
Conclusions: Rule Induction in
Linguistics
• Falsify existing linguistic theories
• Evaluate role of linguistic information sources
• (Re)discover interesting linguistic rules (=
supervised learning)
• (Re)discover interesting linguistic categories (=
unsupervised learning)
• Empiricist alternative for (mostly nativist) rule-
based systems
There is one small problem …
• Current methodology for comparative machine
learning experiments is not reliable (especially
with small data)
– Different runs of the algorithm provide different
resulting rule sets
– Algorithm can be tweaked to get high performance with
any information source combination
– Algorithm is highly sensitive to training data, feature
selection, algorithm parameter settings, …
• Only to be used as a heuristic
– As with your own rule induction module
Word Sense Disambiguation (do)
Similar: experience, material, say, then, …
61.0
60.8
Optimized parameters
59.5
60.8
Optimized parameters LC
47.9
49.0
Default
+ keywords
Local
Context
Generalisation  Abstraction
+ abstraction
- abstraction
+ generalisation - generalisation
Rule Induction
Connectionism
Inductive Logic Programming
Statistics
Handcrafting
Table Lookup
Memory-Based Learning
…
(Fill in your most hated
linguist here)
This “rule of nearest neighbor” has considerable
elementary intuitive appeal and probably corresponds to
practice in many situations. For example, it is possible
that much medical diagnosis is influenced by the doctor's
recollection of the subsequent history of an earlier patient
whose symptoms resemble in some way those of the current
patient. (Fix and Hodges, 1952, p.43)
MBL: Use memory traces of experiences as a basis for
analogical reasoning, rather than using rules or other
abstractions extracted from experience and replacing the
experiences.
-etje
-kje
Coda last syl
Nucleus last syl
Rule Induction
?
-etje
-kje
Coda last syl
Nucleus last syl
MBL
Memory-Based Learning
• Basis: k nearest neighbor algorithm:
– store all examples in memory
– to classify a new instance X, look up the k examples in
memory with the smallest distance D(X,Y) to X
– let each nearest neighbor vote with its class
– classify instance X with the class that has the most
votes in the nearest neighbor set
• Choices:
– similarity metric
– number of nearest neighbors (k)
– voting weights
Metrics
Metrics (2)
Voting options:
• Equal weight for each nearest neighbor
• Distance weighted voting
• Inverse distance 1/D(X,Y) (Wettschereck, 1994)
• RBF-style gaussian voting function (Shepard, 1987)
• Linear voting function (Dudani, 1976)
(NB: weighted NN distribution can be used as conditional probability)
MBL Acquisition
• Inflectional process is represented by a set
of exemplars in memory
– Exemplars act as models
– Learning is incremental storage of exemplars
– Compression and Metrics
• Exemplar consists of set of (mostly
symbolic) features
MBL Processing
• New instances of a performance process are
solved through
– Memory-lookup
– Analogical (Similarity-Based) Reasoning
• Similarity metric
– Language (faculty) - independent
– Adaptive (feature and exemplar weighting)
The properties of language
processing tasks …
• Language processing tasks are mappings between
linguistic representation levels that are
– context-sensitive (but mostly local!)
– complex (sub/ir/regularity), pockets of exceptions
• Similar representations at one linguistic level
correspond to similar representations at the other
level
• Several information sources interact in (often)
unpredictable ways at the same level
• Data is sparse
… fit the bias of MBL
• Inference is based on Similarity-Based /
Analogical Reasoning
• Adaptive data fusion / relevance assignment
is available through feature weighting
• It is a non-parametric approach
• Similarity-based smoothing is implicit
• Regularities and subregularities / exceptions
can be modeled uniformly
German and Dutch plurals
Data & Representation
• Symbolic features
– segmental information (syllable structure)
– stress
– gender
• German Plural (~ 25,000 from CELEX)
Vorlesung (lecture) l e - z U N F en
Classes: e (e)n s er - U- Uer Ue
• Dutch Plural (~ 62,000 from CELEX)
ontruiming (evacuation) 0 - O nt 1 r L - 0 m I N en
Classes: (e)n s (-eren, -i, -a, …)
Cognitive Architectures of
Inflectional Morphology
• Dual Route (Pinker, Clahsen, Marcus …)
– Rules for regular cases
• (over)generalization
• default behaviour
– Associative memory for exceptions
• irregularization / family effects
• Single Route (R&M, MacWhinney, Plunkett, Elman, …)
– Frequency-based regularity
Dual Route
Pattern
Associator Rule
Input Features
Suffix-class
Memory
Failure
German Plural
• Notoriously complex but routinely acquired
(at age 5)
• Evidence for Dual Route ?
-s suffix is default/regular (novel words,
surnames, acronyms, …)
-s suffix is infrequent (least frequent of the five
most important suffixes)
Class Frequency Umlaut Frequency Example
(e)n 11920 Abart
e 6656 no 4646 Abbau
yes 2010 Abdampf
- 4651 no 4402 Aasgeier
yes 249 Abwasser
er 974 no 287 Abbild
yes 687 Abgang
s 967 Abonnement
The default status of -s
• Similar item missing Fnöhk-s
• Surname, product name Mann-s
• Borrowings Kiosk-s
• Acronyms BMW-s
• Lexicalized phrases Vergissmeinnicht-s
• Onomatopoeia, truncated roots, derived nouns, ...
Discussion
• Three “classes” of plurals: ((-en -)(-e -er))(s)
the former 4 suffixes seem “regular”, can be accurately
learned using information from phonology and gender
-s is learned reasonably well but information is lacking
• Hypothesis: more “features” are needed (syntactic, semantic,
meta-linguistic, …) to enrich the “lexical similarity space”
• No difference in accuracy and speed of learning
with and without Umlaut
• Overall generalization accuracy very high: 95%
• Schema-based learning (Köpcke).
*,*,*,*,i,r,M e
Acquisition Data:
Summary of previous studies
• Existing nouns:
(Park 78; Veit 86; Mills 86; Schamer-Wolles 88; Clahsen et al. 93; Sedlak et al. 98)
– Children mainly overapply -e or -(e)n
– -s plurals are learned late
• Novel words:
(Mugdan 77; MacWhinney 78; Phillis & Bouma 80; Schöler & Kany 89)
– Children inflect novel words with -e or -(e)n
– More “irregular” plural forms produced than
“defaults”
MBL simulation
• model overapplies mainly -en
and -e
• -s is learned late and
imperfectly
• Mainly but not completely
parallel to input frequency
(more -s overgeneralization
than -er generalization)
Bartke, Marcus, Clahsen (1995)
• 37 children age 3.6 to 6.6
• pictures of imaginary things,
presented as neologisms
– names or roots
– rhymes of existing words or not
– choice -en or -s
• results:
– children are aware that unusual
sounding words require the default
– children are aware that names
require the default
MBL simulation
• sort CELEX data according to rhyme
• compare overgeneralization
– to -en versus to -s
– percentage of total number of errors
• results:
– when new words don’t rhyme more
errors are made
– overgeneralization to -en drops
below the level of
overgeneralization to -s
Dutch Plural
Suffixes -en and -s are both defaults, and are
in complementary distribution
Selection of -en or -s governed by
– phonological structure of the base noun
(stressed vs. unstressed last syllable)
– morphological structure (suffix of the base
noun)
– loan word status
– semantic feature person vs. thing
– both are possible after //
(Baayen et al. 2001)
Feature Relevance
Accuracy on CELEX
• Methodology
– “Leave-one-out”
• Results:
– MBL 94.9 % accuracy
Prec Rec F
-(e)n 95.8 97.2 96.4
-s 93.8 91.4 92.6
-i 82.0 77.2 79.5
– without stress 94.9 % accuracy
– last syllable with stress 92.6 % accuracy
– last syllable without stress 92.4 % accuracy
– rhyme last syllable 89.6 % accuracy
Accuracy on pseudo-words
• Methodology
– Train: Celex (all) and Celex (1000 most frequent types)
– Test: 8 * 10 pseudo-words (Baayen et al., 2001)
dreip - workel - bastus - bestroeting - kloertje
stape - stree - kadisme
• Results: accuracy = number of decisions equal to subject
majority for each item
– Subjects 87.5 %
– MBL (all) 83.8 %
– MBL (top 1000) 90.0 %
-s -en
dreip (-en) subjects 4 96
mblp-decisions 0 100
mblp-support 7 93
workel (-s) subjects 98 2
mblp-decisions 100 0
mblp-support 100 0
bastus (-en) subjects 0 100
mblp-decisions 0 90
mblp-support 0 100
bestroeting (-en) subjects 1 99
mblp-decisions 0 100
mblp-support 0 100
kloertje (-s) subjects 100 0
mblp-decisions 100 0
mblp-support 81 19
stape (?) subjects 30 70
mblp-decisions 90 10
mblp-support 86 14
stree (?) subjects 30 70
mblp-decisions 80 20
mblp-support 31 69
kadisme (?) subjects 25 75
mblp-decisions 0 100
mblp-support 6 94
muidus, muidi
nn: modus, modi
Celex bias
Low frequency and
loan word nearest
neighbours
Conclusions Memory-Based
Single Route
• MBLP picks up the main “schemata” of Dutch and
German plural formation and their exceptions
without recourse to explicit rules or a dual route
architecture
• MBLP trained on (part of) CELEX matches
subject behavior on pseudo words and acquisition
data
• Segmental information suffices to reliably predict
plural in Dutch and most plurals in German,
additional information needed for German -s
• Heterogeneity and density in lexical exemplar
space as source of behavior predictions
Overall Conclusions
• Advantages of symbolic machine learning
methods over ‘pure statistics’
– As a methodology for inducing interpretable
linguistic generalizations and categories
– As a way of introducing an operationalisation
of analogy-based methods into
(psycho)linguistics

Contenu connexe

Tendances

Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
butest
 
Topic models
Topic modelsTopic models
Topic models
Ajay Ohri
 
Doc format.
Doc format.Doc format.
Doc format.
butest
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
fridolin.wild
 
Cs599 Fall2005 Lecture 01
Cs599 Fall2005 Lecture 01Cs599 Fall2005 Lecture 01
Cs599 Fall2005 Lecture 01
Dr. Cupid Lucid
 
referát.doc
referát.docreferát.doc
referát.doc
butest
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
KU Leuven
 

Tendances (20)

Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Canini09a
Canini09aCanini09a
Canini09a
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Topicmodels
TopicmodelsTopicmodels
Topicmodels
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
 
Topic models
Topic modelsTopic models
Topic models
 
Doc format.
Doc format.Doc format.
Doc format.
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
L3 v2
L3 v2L3 v2
L3 v2
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Cs599 Fall2005 Lecture 01
Cs599 Fall2005 Lecture 01Cs599 Fall2005 Lecture 01
Cs599 Fall2005 Lecture 01
 
Framester and WFD
Framester and WFD Framester and WFD
Framester and WFD
 
referát.doc
referát.docreferát.doc
referát.doc
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

slides

  • 1. Machine Learning for (Psycho-)Linguistics Walter Daelemans daelem@uia.ua.ac.be http://cnts.uia.ac.be CNTS, University of Antwerp ILK, Tilburg University QITL-02
  • 2. Outline • Machine Learning of Language – Induction of rules and classes – Learning by Analogy • Case Studies – Discovery of phonological categories and morphological rules – A single-route model of morphological processing • Issues – Probabilities versus symbolic structure induction – Nativism versus empiricism – Exemplar analogy versus rules
  • 3. Output Input Performance Component Ri Rk Rl Rj Learning Component Search Experience BIAS
  • 4. Problems with Probabilities • Explanation – Also applies to neural networks • Event relevance – Especially in unsupervised learning (clustering) • Incorporation of linguistic knowledge • Smoothing zero-frequency events
  • 5. (symbolic) machine learning • Rule induction (understandable induced theories) • Inductive Logic Programming (incorporating linguistic knowledge) • Memory-based learning (similarity-based smoothing of sparse data, feature weighting) • …
  • 6. Common Fallacies • Rules = nativism (and connections = empiricism) • Generalization = abstraction (and memory = table-lookup)
  • 7. Rule-Based  Innate • Rules can be induced from primary linguistic data as well • Applications in Linguistics – Evaluation and comparison of linguistic hypotheses – Discovery of linguistic generalizations and categories
  • 8. Allomorphy in Dutch Diminutive • “one of the more spectacular phenomena of modern Dutch morphophonemics” Trommelen (1983) • Base form of Noun + [tje] (5 variants) • Linguistic theory (from Te Winkel 1862) Rime last syllable, stress, morphological structure, … Trommelen 1983 Local phenomenon, stress & morphological structure do not play a role • CELEX data (3900 nouns) - b i = - z @ = + m A nt  je
  • 9. Allomorphs -tje kikker-tje 1896 48.0 50.9 -etje roman-etje 395 10.0 10.9 -pje lichaam-pje 104 2.6 4.0 -kje koning-kje 77 1.9 3.8 -je wereld-je 1478 37.4 30.4
  • 10. Decision Tree Learning • Given a data set, construct a decision tree that reflects the structure of the domain • A decision tree is a tree where – non-leaf nodes represent features (tests) – branches leading out of a test represent possible values for the feature – leaf nodes represent outcomes (classes) • Decision Tree can be translated into a set of IF- THEN rules (with further optimization) • Value grouping
  • 11. Given a set of examples T – If T contains one or more cases all belonging to the same class C, then the decision tree for T is a leaf node with category C. – If T contains different classes then • Choose a feature, and partition T into subsets that have the same value for the feature chosen. The decision tree consists of a node containing the feature name, and a branch for each value leading to a subset. • Apply the procedure recursively to subsets created this way. Decision Tree Construction
  • 12. Induced rule set Default class is -tje 1. IF coda last is /lm/ or /rm/ THEN -pje 2. IF nucleus last is [+bimoraic] AND coda last is /m/ THEN -pje 3. IF coda last is /N/ THEN IF nucleus penultimate is empty or schwa THEN -etje ELSE -kje 4. IF nucleus last is [+short] and coda last is [+nas] or [+liq] THEN -etje 5. IF coda last is [+obstruent] THEN -je
  • 13. Results • Problem is almost perfectly learnable (98.4%) • More than last syllable is needed for a full solution • Only rime of last syllable (not stress or onset) is relevant • Induced Categories – Nasals, liquids, obstruents, short vowels, bimoraic vowels (consists of vowels, diphtongs, schwa) – Task-dependent categories? Category formation is dependent on the task to be learned, not absolute, not language-independent
  • 14. Conclusions: Rule Induction in Linguistics • Falsify existing linguistic theories • Evaluate role of linguistic information sources • (Re)discover interesting linguistic rules (= supervised learning) • (Re)discover interesting linguistic categories (= unsupervised learning) • Empiricist alternative for (mostly nativist) rule- based systems
  • 15. There is one small problem … • Current methodology for comparative machine learning experiments is not reliable (especially with small data) – Different runs of the algorithm provide different resulting rule sets – Algorithm can be tweaked to get high performance with any information source combination – Algorithm is highly sensitive to training data, feature selection, algorithm parameter settings, … • Only to be used as a heuristic – As with your own rule induction module
  • 16. Word Sense Disambiguation (do) Similar: experience, material, say, then, … 61.0 60.8 Optimized parameters 59.5 60.8 Optimized parameters LC 47.9 49.0 Default + keywords Local Context
  • 17. Generalisation  Abstraction + abstraction - abstraction + generalisation - generalisation Rule Induction Connectionism Inductive Logic Programming Statistics Handcrafting Table Lookup Memory-Based Learning … (Fill in your most hated linguist here)
  • 18. This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952, p.43) MBL: Use memory traces of experiences as a basis for analogical reasoning, rather than using rules or other abstractions extracted from experience and replacing the experiences.
  • 19. -etje -kje Coda last syl Nucleus last syl Rule Induction
  • 21. Memory-Based Learning • Basis: k nearest neighbor algorithm: – store all examples in memory – to classify a new instance X, look up the k examples in memory with the smallest distance D(X,Y) to X – let each nearest neighbor vote with its class – classify instance X with the class that has the most votes in the nearest neighbor set • Choices: – similarity metric – number of nearest neighbors (k) – voting weights
  • 23. Metrics (2) Voting options: • Equal weight for each nearest neighbor • Distance weighted voting • Inverse distance 1/D(X,Y) (Wettschereck, 1994) • RBF-style gaussian voting function (Shepard, 1987) • Linear voting function (Dudani, 1976) (NB: weighted NN distribution can be used as conditional probability)
  • 24. MBL Acquisition • Inflectional process is represented by a set of exemplars in memory – Exemplars act as models – Learning is incremental storage of exemplars – Compression and Metrics • Exemplar consists of set of (mostly symbolic) features
  • 25. MBL Processing • New instances of a performance process are solved through – Memory-lookup – Analogical (Similarity-Based) Reasoning • Similarity metric – Language (faculty) - independent – Adaptive (feature and exemplar weighting)
  • 26. The properties of language processing tasks … • Language processing tasks are mappings between linguistic representation levels that are – context-sensitive (but mostly local!) – complex (sub/ir/regularity), pockets of exceptions • Similar representations at one linguistic level correspond to similar representations at the other level • Several information sources interact in (often) unpredictable ways at the same level • Data is sparse
  • 27. … fit the bias of MBL • Inference is based on Similarity-Based / Analogical Reasoning • Adaptive data fusion / relevance assignment is available through feature weighting • It is a non-parametric approach • Similarity-based smoothing is implicit • Regularities and subregularities / exceptions can be modeled uniformly
  • 28. German and Dutch plurals
  • 29. Data & Representation • Symbolic features – segmental information (syllable structure) – stress – gender • German Plural (~ 25,000 from CELEX) Vorlesung (lecture) l e - z U N F en Classes: e (e)n s er - U- Uer Ue • Dutch Plural (~ 62,000 from CELEX) ontruiming (evacuation) 0 - O nt 1 r L - 0 m I N en Classes: (e)n s (-eren, -i, -a, …)
  • 30. Cognitive Architectures of Inflectional Morphology • Dual Route (Pinker, Clahsen, Marcus …) – Rules for regular cases • (over)generalization • default behaviour – Associative memory for exceptions • irregularization / family effects • Single Route (R&M, MacWhinney, Plunkett, Elman, …) – Frequency-based regularity Dual Route Pattern Associator Rule Input Features Suffix-class Memory Failure
  • 31. German Plural • Notoriously complex but routinely acquired (at age 5) • Evidence for Dual Route ? -s suffix is default/regular (novel words, surnames, acronyms, …) -s suffix is infrequent (least frequent of the five most important suffixes)
  • 32. Class Frequency Umlaut Frequency Example (e)n 11920 Abart e 6656 no 4646 Abbau yes 2010 Abdampf - 4651 no 4402 Aasgeier yes 249 Abwasser er 974 no 287 Abbild yes 687 Abgang s 967 Abonnement
  • 33. The default status of -s • Similar item missing Fnöhk-s • Surname, product name Mann-s • Borrowings Kiosk-s • Acronyms BMW-s • Lexicalized phrases Vergissmeinnicht-s • Onomatopoeia, truncated roots, derived nouns, ...
  • 34.
  • 35. Discussion • Three “classes” of plurals: ((-en -)(-e -er))(s) the former 4 suffixes seem “regular”, can be accurately learned using information from phonology and gender -s is learned reasonably well but information is lacking • Hypothesis: more “features” are needed (syntactic, semantic, meta-linguistic, …) to enrich the “lexical similarity space” • No difference in accuracy and speed of learning with and without Umlaut • Overall generalization accuracy very high: 95% • Schema-based learning (Köpcke). *,*,*,*,i,r,M e
  • 36.
  • 37.
  • 38. Acquisition Data: Summary of previous studies • Existing nouns: (Park 78; Veit 86; Mills 86; Schamer-Wolles 88; Clahsen et al. 93; Sedlak et al. 98) – Children mainly overapply -e or -(e)n – -s plurals are learned late • Novel words: (Mugdan 77; MacWhinney 78; Phillis & Bouma 80; Schöler & Kany 89) – Children inflect novel words with -e or -(e)n – More “irregular” plural forms produced than “defaults”
  • 39. MBL simulation • model overapplies mainly -en and -e • -s is learned late and imperfectly • Mainly but not completely parallel to input frequency (more -s overgeneralization than -er generalization)
  • 40. Bartke, Marcus, Clahsen (1995) • 37 children age 3.6 to 6.6 • pictures of imaginary things, presented as neologisms – names or roots – rhymes of existing words or not – choice -en or -s • results: – children are aware that unusual sounding words require the default – children are aware that names require the default
  • 41. MBL simulation • sort CELEX data according to rhyme • compare overgeneralization – to -en versus to -s – percentage of total number of errors • results: – when new words don’t rhyme more errors are made – overgeneralization to -en drops below the level of overgeneralization to -s
  • 42. Dutch Plural Suffixes -en and -s are both defaults, and are in complementary distribution Selection of -en or -s governed by – phonological structure of the base noun (stressed vs. unstressed last syllable) – morphological structure (suffix of the base noun) – loan word status – semantic feature person vs. thing – both are possible after // (Baayen et al. 2001)
  • 44. Accuracy on CELEX • Methodology – “Leave-one-out” • Results: – MBL 94.9 % accuracy Prec Rec F -(e)n 95.8 97.2 96.4 -s 93.8 91.4 92.6 -i 82.0 77.2 79.5 – without stress 94.9 % accuracy – last syllable with stress 92.6 % accuracy – last syllable without stress 92.4 % accuracy – rhyme last syllable 89.6 % accuracy
  • 45. Accuracy on pseudo-words • Methodology – Train: Celex (all) and Celex (1000 most frequent types) – Test: 8 * 10 pseudo-words (Baayen et al., 2001) dreip - workel - bastus - bestroeting - kloertje stape - stree - kadisme • Results: accuracy = number of decisions equal to subject majority for each item – Subjects 87.5 % – MBL (all) 83.8 % – MBL (top 1000) 90.0 %
  • 46. -s -en dreip (-en) subjects 4 96 mblp-decisions 0 100 mblp-support 7 93 workel (-s) subjects 98 2 mblp-decisions 100 0 mblp-support 100 0 bastus (-en) subjects 0 100 mblp-decisions 0 90 mblp-support 0 100 bestroeting (-en) subjects 1 99 mblp-decisions 0 100 mblp-support 0 100 kloertje (-s) subjects 100 0 mblp-decisions 100 0 mblp-support 81 19 stape (?) subjects 30 70 mblp-decisions 90 10 mblp-support 86 14 stree (?) subjects 30 70 mblp-decisions 80 20 mblp-support 31 69 kadisme (?) subjects 25 75 mblp-decisions 0 100 mblp-support 6 94 muidus, muidi nn: modus, modi Celex bias Low frequency and loan word nearest neighbours
  • 47. Conclusions Memory-Based Single Route • MBLP picks up the main “schemata” of Dutch and German plural formation and their exceptions without recourse to explicit rules or a dual route architecture • MBLP trained on (part of) CELEX matches subject behavior on pseudo words and acquisition data • Segmental information suffices to reliably predict plural in Dutch and most plurals in German, additional information needed for German -s • Heterogeneity and density in lexical exemplar space as source of behavior predictions
  • 48. Overall Conclusions • Advantages of symbolic machine learning methods over ‘pure statistics’ – As a methodology for inducing interpretable linguistic generalizations and categories – As a way of introducing an operationalisation of analogy-based methods into (psycho)linguistics