SlideShare une entreprise Scribd logo
1  sur  90
Processing short-message
communications in low-resource
languages
Robert Munro
PhD Defense
Stanford University
April 2012
Acknowledgments
• Dissertation Committee: Chris Manning (!), Dan
Jurafsky and Tapan Parikh,
• Oral committee: Chris Potts, Mike Frank
• Stanford Linguistics (esp the Shichi Fukujin)
• Stanford NLP Group
• Volunteers and workers who contributed to the
data used here
Acknowledgments
• Participants:
– Second Annual Symposium on Computing for Development (ACM DEV 2012), Atlanta.
– Global Health & Innovation Conference. Unite For Sight (2012), Yale.
– Microsoft Research Seminar Series, Redmond.
– Feldwork Forum, University Of California, Berkeley.
– Annual Workshop on Machine Translation (2011), EMNLP, Edinburgh.
– Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011), Portland.
– The Human Computer Interaction Group, Seminar Series, Stanford.
– AMTA Workshop on Collaborative Crowdsourcing for Translation, Denver.
– 33rd Conference of the African Studies Association of Australasia and the Pacific,
Melbourne.
– International Conference on Crisis Mapping (ICCM 2011), Boston.
– Center for Information Technology Research in the Interest of Society Seminar Series, (2010)
University of California, Berkeley.
– IBM Almaden Research Center Invited Speakers Series, San Jose.
– Annual Conference of the North American Chapter of the Association for Computational
Linguistics (NAACL 2010), Los Angeles.
– Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (2010),
Los Angeles.
– The Research Centre for Linguistic Typology (RCLT) Seminar Series, Latrobe University,
Melbourne.
Daily potential language exposure
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720
540 500
Year
#oflanguages
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720
540 500
Daily potential language exposure
Year
#oflanguages
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720
540 500
Daily potential language exposure
Year
#oflanguages
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720
540 500
Daily potential language exposure
Year
#oflanguages
We will never be so under-
resourced as right now
Motivation
2000: 1 Trillion
2007 (start of PhD): 5 Trillion
2012 (estimate): 9 Trillion
1 International Telecommunication Union (ITU), 2012. http://www.itu.int/ITU-D/ict/statistics/
• Text messaging
– Most popular form of
remote communication
in much of the world 1
– Especially in areas of
linguistic diversity
– Little research
ACM, IEEE and ACL publications
Actual Usage Recent Research
(90% coverage)
(35% coverage)
Outline
• What do short message communications look like in
most languages?
• How can we model the inherent variation?
• Can we create accurate classification systems
despite the variation?
• Can we leverage loosely aligned translations for
information extraction?
Publications from dissertation
as
published
extensions
and novel
research
rewritten
Munro, R. and Manning, C. D. (2010). Subword variation in text message
classification. Proceedings of the Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL
2010), Los Angeles, CA.
Munro, R. (2011). Subword and spatiotemporal models for identifying
actionable information in Haitian Kreyol. Proceedings of the Fifteenth
Conference on Natural Language Learning (CoNLL 2011), Portland, OR.
Munro, R. and Manning, C. D. (2012). Short message communications:
users, topics, and in-language processing. Proceedings of the Second
Annual Symposium on Computing for Development (ACM DEV 2012),
Atlanta, GA.
Munro, R. and Manning, C. D. (accepted). Accurate Unsupervised Joint
Named-Entity Extraction from Unaligned Parallel Text. Proceedings of the
Named Entities Workshop (NEWS 2012), Jeju, Korea.
Written dissertation,
April 2012
Data – short messages used here
• 600 text messages sent between health workers in
Malawi, in Chichewa
• 40,000 text messages sent from the Haitian
population, in Haitian Kreyol
• 500 text messages sent from the Pakistani
population, in Urdu
• Twitter messages from Haiti and Pakistan
• English translations
Chichewa, Malawi
• 600 text messages sent between
health workers, with translations
and 0-9 labels
Bantu
1. Patient-related
2. Clinic-admin
3. Technological
4. Response
5. Request for doctor
6. Medical advice
7. TB: tuberculosis
8. HIV
9. Death of patient
Haitian Kreyol
• 40,000 text messages sent from the
Haitian population to international
relief efforts
– ~40 labels (request for food, emergency,
logistics, etc)
– Translations
– Named-entities
• 60,000 tweets
Urdu, Pakistan
• 500 text messages sent
from the Pakistan
population to
international relief
efforts
– ~40 labels
– Translations
• 1,000 tweets
Outline
• What do short message communications look like in
most languages?
• How can we model the inherent variation?
• Can we create accurate classification systems
despite the variation?
• Can we leverage loosely aligned translations for
information extraction?
Most NLP research to date assumes the
standardization found in written English
English
• Generations of standardization in spelling and
simple morphology
– Whole words suitable as features for NLP systems
• Most other languages
– Relatively complex morphology
– Less (observed) standardized spellings
– More dialectal variation
• ‘Subword variation’ used to refer to any difference
in forms resulting from the above
The extent of the subword variation
• >30 spellings of odwala (‘patient’) in Chichewa
• > 50% variants of ‘odwala’ occur only once in the
data used here:
– Affixes and incorporation
• ‘kwaodwala’ -> ‘kwa + odwala’
• ‘ndiodwala’ -> ‘ndi odwala’ (official ‘ngodwala’ not present)
– Phonological/Orthographic
• ‘odwara’ -> ‘odwala’
• ‘ndiwodwala’ -> ‘ndi (w) odwala’
Chichewa
The word odwala (‘patient’) in 600
text-messages in Chichewa and the
English translations
Chichewa
• Morphology: affixes and incorporation
ndi-ta-ma-mu-fun-a-nso
1PS-IMPLORE-PRESENT-2PS-want-VERB-also
“I am also currently wanting you very much”
a-ta-ma-ka-fun-a-nso
class2.PL-IMPLORE-PRESENT-class12.SG-want-VERB-also
“They are also currently wanting it very much”
• More than 30 forms for fun (‘want’), 80% novel
Haitian Krèyol
• More or less French spellings
• More or less phonetic spellings
• Frequent words (esp pronouns) are shortened and
compounded
• Regional slang / abbreviations
Haitian Krèyol
mèsi, mesi,
mèci, merci
C a p - H a ï t i e n
K a p a y i s y e n
Urdu
• The least variant of the three languages here
– Derivational morphology
• Zaroori / zaroorath
– Vowels and nonphonemic characters
• Zaruri / zaroorat
zaroori (‘need’)
If it follows patterns, we can model it
Outline
• What do short message communications look like in
most languages?
• How can we model the inherent variation?
• Can we create accurate classification systems
despite the variation?
• Can we leverage loosely aligned translations for
information extraction?
Subword models
• Segmentation
– Separate into constituent morphemes:
nditamamufunanso -> ndi-ta-ma-mu-fun-a-nso
• Normalization
– Model phonological, orthographic, more or less phonetic
spellings:
odwela, edwala, odwara -> odwala
Language Specific
• Segmentation
– Hand-coded
morphological parser
(Mchombo, 2004; Paas,
2005) 1
• Normalization
– Rule-based
ph -> f, etc.
1 robertmunro.com/research/chichewa.php
Table 4.1: Morphological paradigms for Chichewa verbs
Language Independent
• Segmentation (Goldwater et al., 2009)
– Context Sensitive Hierarchical Dirichlet Process, with
morphemes, mi drawn from distribution G generated
from Dirichlet Process DP(α0, P0), with Hm = DP for a
specific morpheme:
• Extension to morphology:
– Enforce existing spaces as morpheme boundaries
– Identify free morphemes as min P0, per word
ndi mafuna -> ndi-ma-funa manthwala
Language Independent
Table 4.3: Phonetically,
phonologically & orthographically
motivated alternation candidates.
• Normalization
– Motivated from minimal pairs in
the corpus, C
– Substitution, H, applied to a word,
w, producing w' iff w'  C
ndiwodwala -> ndiodwala
Evaluation – downstream accuracy
• Most morphological parsers are evaluated on gold
data and limited to prefixes or suffixes only:
– Linguistica (Goldsmith, 2001), Morphessor (Creutz, 2006)
• Classification accuracy (macro-f, all labels):
Chichewa Language
Specific independent
Segmentation: 0.476 0.425
Normalization: 0.396 0.443
Combined: 0.484 0.459
Other subword modeling results
• Stemming vs Segmentation
– Stemming can harm Chichewa 1
– Segmentation most accurate when modeling discontinuous
morphemes 1
• Hand-crafted parser
– Over-segments non-verbs (cf Porter Stemmer for English)
– Under-segments compounds
• Acronym identification
– Improves accuracy & can be broadly implemented 1
1 Munro and Manning, (2010)
Are subword models needed for
classification?
Outline
• What do short message communications look like in
most languages?
• How can we model the inherent variation?
• Can we create accurate classification systems
despite the variation?
• Can we leverage loosely aligned translations for
information extraction?
Classification
• Stanford Classifier
– Maximum Entropy Classifier (Klein and Manning, 2003)
• Binary prediction of the labels associated with each
message
– Leave-one-out cross-validation
– Micro-f
• Comparison of methods with and without subword
models
Strategy
ndimmafunamanthwala
(‘I currently need medicine’)
ndimafunamantwala
ndi-ma-fun-a man-twala
ndi-ma-fun-a man-twala
ndi -fun man-twala
(“I need medicine”)
Category = “Request for aid”
ndi kufuni mantwara
(‘mywant of medicine’)
ndi kufuni mantwala
ndi-ku-fun-i man-twala
ndi-ku-fun-iman-twala
ndi -fun man-twala
(“I need medicine”)
Category = “Request for aid”
1) Normalize spellings
2) Segment
3) Identify predictors
1 in 5 classification
errors with raw
messages
1 in 20 classification
error post-processing.
Improves with scale.
Comparison with English
• bcb
Accuracy:Micro-f
Percent of training data
Streaming architecture
• Potential accuracy in a live, constantly updating
system
– Time sensitive and time-changing
• Kreyol ‘is actionable’ category
– Any message that could be responded to
(request for water, medical assistance, clustered
requests for food, etc )
Streaming architecture
• Build from initial items
time
Model
Streaming architecture
• Predict (and evaluate) on incoming items
– (penalty for training)
time
Model
Streaming architecture
• Repeat / retrain
time
Model
Streaming architecture
• Repeat / retrain
time
Model
Streaming architecture
• Repeat / retrain
time
Model
Streaming architecture
• Repeat / retrain
time
Model
Streaming architecture
• Repeat / retrain
time
Model
Features
• G : Words and ngrams
• W : Subword patterns
• P : Source of the message
• T : Time received
• C : Categories (c0,...,47)
• L : Location (longitude and latitude)
• L : Has-location (a location is written in the
message)
Hierarchical prediction for ‘is actionable’
time
Model
time
Model
time
Model
time
Model
…
Combines features with predictions from
Category and Has-Location models
predicting ‘is actionable’
predicting ‘has location’
predicting ‘category 1’
predicting ‘category n’
Results – subword models
• Also a gain in streaming models
Precision Recall F-value
Baseline 0.622 0.124 0.207
W Subword 0.548 0.233 0.326
Results – overall
• Gain of F > 0.6 for full hierarchical system, over
baseline of words/phrases only
Precision Recall F-value
Baseline 0.622 0.124 0.207
Final 0.872 0.840 0.855
Other classification results
• Urdu and English
– Subword models improve Urdu & English tweets 1
• Domain dependence
– Modeling the source improves accuracy 1
• Semi-supervised streaming models
– Lower F-value but consistent prioritization 2
• Hierarchical streaming predictions
– Outperforms oracle for ‘has location’ 2
• Extension with topic models
– Improves non-contiguous morphemes 3
1 Munro and Manning, (2012); 2 Munro, (2011); 3 Munro and Manning, (2010)
Can we move beyond classification to
information extraction?
Outline
• What do short message communications look like in
most languages?
• How can we model the inherent variation?
• Can we create accurate classification systems
despite the variation?
• Can we leverage loosely aligned translations for
information extraction?
Named Entity Recognition
• Identifying mentions of People, Locations, and
Organizations
– Information extraction / parsing / Q+A
• Typically a high-resource task
– Tagged corpus (Finkel and Manning, 2010)
– Extensive hand-crafted rules (Chiticarui, 2010)
• How far can we get with loosely aligned text?
– One of the only resources for most languages
Example
Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre
pou li resevwa moun malad e lap mande pou moun ki malad yo
ale la.
Sacre-Coeur Hospital which located in this village Milot 14 km
south of Oakp is ready to receive those who are injured.
Therefore, we are asking those who are sick to report to that
hospital.
The intuition
Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre
pou li resevwa moun malad e lap mande pou moun ki malad yo
ale la.
Sacre-Coeur Hospital which located in this village Milot 14 km
south of Oakp is ready to receive those who are injured.
Therefore, we are asking those who are sick to report to that
hospital.
Do named entities have the least edit distance?
The intuition
Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre
pou li resevwa moun malad e lap mande pou moun ki malad yo
ale la.
Sacre-Coeur Hospital which located in this village Milot 14 km
south of Oakp is ready to receive those who are injured.
Therefore, we are asking those who are sick to report to that
hospital.
The intuition
Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre
pou li resevwa moun malad e lap mande pou moun ki malad yo
ale la.
Sacre-Coeur Hospital which located in this village Milot 14 km
south of Oakp is ready to receive those who are injured.
Therefore, we are asking those who are sick to report to that
hospital.
The intuition
Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre
pou li resevwa moun malad e lap mande pou moun ki malad yo
ale la.
Sacre-Coeur Hospital which located in this village Milot 14 km
south of Oakp is ready to receive those who are injured.
Therefore, we are asking those who are sick to report to that
hospital.
The intuition
Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre
pou li resevwa moun malad e lap mande pou moun ki malad yo
ale la.
Sacre-Coeur Hospital which located in this village Milot 14 km
south of Oakp is ready to receive those who are injured.
Therefore, we are asking those who are sick to report to that
hospital.
The complications
Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre
pou li resevwa moun malad e lap mande pou moun ki malad yo
ale la.
Sacre-Coeur Hospital which located in this village Milot 14 km
south of Oakp is ready to receive those who are injured.
Therefore, we are asking those who are sick to report to that
hospital.
Capitalization of entities
was not always consistent
Slang/abbreviations/alternate spellings for ‘Okap’ are
frequent: ‘Cap-Haitien’, ‘Cap Haitien’, ‘Kap’, ‘Kapayisyen’
3 Steps for Named Entity Recognition
1. Generate seeds by calculating the edit likelihood
deviation.
2. Learn context, word-shape and alignment models.
3. Learn weighted models incorporating supervised
predictions.
Step 1: Edit distance (Levenshtein)
• Number of substitutions, deletions or additions to
convert one string to another
– Minimum Edit Distance: min between parallel text
– String Similarity Estimate: normalized by length
– Edit Likelihood Deviation: similarity, relative to average
similarity in parallel text (z-score)
– Weighted Deviation Estimate: combination of Edit
Likelihood Deviation and String Similarity Estimate
Example
• Edit distance: 6
• String Similarity: ~0.45
“Voye manje medikaman pou moun kie nan lopital Kapayisyen”
“Send food and medicine for people in the Cap Haitian hospitals”
– Average & standard dev similarity: μ=0.12, σ=0.05
– Edit Likelihood Deviation: 6.5 (good candidate)
“Voye manje medikaman pou moun kie nan lopital Kapayisyen”
“They said to send manje medikaman for lopital Cap Haitian”
– Average & standard dev similarity: μ=0.21, σ=0.11
– Edit Likelihood Deviation: 2.2 (doubtful candidate)
C a p - H a ï t i e n
K a p a y i s y e n
Equations for edit-distance based metrics
• Given a string in a message and translation MS, M'S'
Levenshtein distance LEV()
String Similarity Estimate SSE()
Average AV()
Standard Deviation SD()
Edit Likelihood Deviation ELD()
Normalizing Function N()
Weighted Deviation Estimate WDE()
Comparison of edit-distance based metrics
Past research used
global edit-distance
metrics (Song and
Strassel, 2008)
This line of
research not
pursued after
REFLEX workshop.
Novel to this
research: local
deviation in edit-
distance.
Entity candidates, ordered by confidence
Precision
Step 2: Seeding a model
• Take the top 5% matches by WDE()
– Assign an ‘entity’ label
• Take the bottom 5% matches by WDE()
– Assign a ‘not-entity’ label
• Learn a model
• Note: the bottom 5% were still the best match for
the given message/translation
– Targeting the boundary conditions
Features
… ki nan vil Milot, 14 km nan sid …
… located in this village Milot 14 km south of …
• Context: BEF_vil, AFT_14 / BEF_village, AFT_14
• Word Shape: SHP_Ccp / SHP_Cc
• Subword: SUB_<b>Mi, SUB_<b>Mil, SUB_il, …
• Alignment: ALN_8_words, ALN_4_perc
• Combinations: SHP_Cc_ALN_4_perc, …
Strong results
• Joint-learning across both languages
Precision Recall F-value
Kreyol 0.904 0.794 0.846
English 0.915 0.813 0.861
• Language-specific:
Precision Recall F-value
Kreyol 0.907 0.687 0.781
English 0.932 0.766 0.840
Effective extension over edit-distance
Joint-prediction
String Similarity Estimate
Domain adaption
• Joint-learning across both languages
Precision Recall F-value
Kreyol 0.904 0.794 0.846
English 0.915 0.813 0.861
• Supervised (MUC/CoNLL-trained Stanford NER):
Precision Recall F-value
English 0.915 0.206 0.336
Completely unsupervised, using ~3,000
sentences loosely aligned with Kreyol
Fully supervised, trained over 10,000s of
manually tagged sentences in English
Step 3: Combined supervised model
… ki nan vil Milot, 14 km nan sid …
… located in this village Milot 14 km south of …
Step 3a: Tag English sequences from a model trained on English
corpora (Sang, 2002; Sang and De Meulder, 2003; Finkel and
Manning, 2010)
Step 3b: Propagate across the candidate alignments, in
combination with features (context, word-shape, etc)
Combined supervised model
• Joint-learning across both languages
Precision Recall F-value
Kreyol 0.904 0.794 0.846
English 0.915 0.813 0.861
• Combined Supervised and Unsupervised
Precision Recall F-value
Kreyol 0.838 0.902 0.869
English 0.846 0.916 0.880
Other information extraction results
• Other edit-distance functions (eg: Jaro-Winkler)
– Make little difference in the seed step - the deviation
measure is the key feature 1
• Named entity discrimination
– Distinguishing People, Locations and Organizations is
reasonably accurate with little data 1
• Clustering contexts
– No clear gain – probably due to sparse data
1 Munro and Manning, (accepted)
Outline
• What do short message communications look like in
most languages?
• How can we model the inherent variation?
• Can we create accurate classification systems
despite the variation?
• Can we leverage loosely aligned translations for
information extraction?
Conclusions
• It is necessary to model the subword variation
found in many of the world’s short-message
communications
• Subword models can significantly improve
classification tasks in these languages
• The same subword variation, cross-linguistically, can
be leveraged for accurate named entity recognition
Conclusions
• More research is needed
2000: 1 Trillion
2007 (start of PhD): 5 Trillion
2012 (estimate): 9 Trillion
Thank you
Appendix
Noise reduction for normalization
• Too many long erroneous matches
• The constraints were necessary
ndimafuna manthwala (‘I currently want medicine‘)
ndimaluma manthwala (‘I am eating medicine‘)
~90% simularity
* ‘f’ -> ‘l’
Haitian Krèyol morphology
• Pronoun incorporation?
• Probably a frequency effect
– ‘m’ / ‘sv’ (‘thank you’ / ‘please’)
– Reductions in common verbs (esp copulas)
– Spoken?
Sub-morphemic patterns
• odwala / manthwala (‘patient’ / ‘medicine’)
– ‘wala’ is not morphemic in either word
– Homographic (only?) of wala (‘shine’)
– Etymological relatedness?
• Proto-Bantu (Greenberg, 1979)
-wala (‘count’) (no)
-ala (‘be rotten’) (maybe)
• Proto-Southern Bantu (Botne, 1992)
-ofwala (‘be blind’) (possible)
• Inherited suffix ‘-ia’ for geopolitical names
– ‘Tanzania’, ‘Australia’, etc
Supervised NER performance
• English translations of messages:
– 81.80% capitalized (including honorifics, etc)
– More capitalized entities missed than identified
– Oracle over re-capitalization: P=0.915, R=0.388, F=0.545
• Most frequent lower-case entities:
– ‘carrefour’, ‘delmas’ also missed when capitalized
– Not present in CoNLL/MUC training data
• Conclusion
– Some loss of accuracy to capitalization
– Most loss due to OOV & domain dependence
Analysis of segmentation
• Language independent
– over segmented all words:
od-wal-a
• Language dependent:
– under-segmented compounds:
odwalamanthwal-a
– over-segmented non-verbs:
ma-nthwal-a
Analysis of normalization
• Language independent:
– most accurate when constrained (‘i’->’y’, ‘w’-> Ø)
• Language dependent
– under-normalized
• Conclusion
– difficult to hand-code normalization rules
– difficult to normalize with no linguistic knowledge
Chichewa morphological parser
a
u
w
i
li
chi
zi
ka
tii
ku
pa
mu
ndi
si ta
kana
kada
ku
ma
pa
dza
a
ba
ka
sa
nga
zi
ba
ta
ka
dza
ka
dzi
ngo
mu
wa
u
i
li
chi
zi
ka
ti
an
its
ets
il
el
i
ik
ek
idw
edw ul e
a
i
o
mi
be
nso
tu
zi
Verb-stem
Named Entity seeds: Step1 top/bottom
Engish Kreyol
Lambi Night Club LAMBI NIGHT CLUB
Marie Ange Dry Marie Ange Dry
Rue Lamarre RUE LAMARRE
Rue Lamarre RUE LAMARRE
Bright Frantz BRIGHT FRANTZ
o1-14-99-1966-o6-ooo18 o1-14-99-1966-o6-ooo18
ROCHEMA FLADA JEREMIE ROCHEMA FLADA JEREMIE
Centre Flore Poumar centre flore poumar
makes Boudotte makes boudotte
Pierreminesson Tresil Pierreminesson Tresil
Prosper Lucmane prosper Lucmane
Delma 29 DELMA 29
Santo 22H SANTO 22H
Delorme Frisnel DELORME FRISNEL
Elsaint Jean Bonheur I Elsaint jean Bonheur m
baby too BABY TOU
Rony Fortune RONY FORTUNE
Cote-Plage 24 Impasse cote-plage 24 impasse
Fontamara 27 Fontamara 27
Roland Joseph roland joseph
Promobank PROMOBANK
Johnny Toussaint johnny toussaint
Bertin Titus Route BERTIN TITUS ROUTE
Jean Patrick Guillaume I jean patrick Guillaume m
Lilavois 50 Lilavois 50
immacula chery i immacula chery j
Cajuste Denise cajuste denise
mahotiére 85 mahotiére 85
Pierre Richemond Pierre Richemond
English Kreyol
gotten any Alo mwen abit
can NAP
in Kil
address A dr
an Adr
How nou
stephani lony fleurio depwi darbonne(l
us in svp
water i Mwen f
Saint-Antoine sentantw
de fer bezwen
We need BEZWEN
Laza PASE
Guilène Guil
WE NEED bezwen
in Mangonese nous sommes
yet te
mercy I Mwen s
we are Mezanmi
Lilavois 41 #2 lilavwa 41#2
Many planch
my wife Moins
Leogane Route de PHOTO) HABITER
are: ef
to --- GONA
information en prie
in Route Silvoupl
help i'm at MEZANMI W
send kijwenn
0/O ef
need grenier
english cite please
we are Mezanmi
reach Manj
Highest Weighted Deviation Estimate for
best-match cross-linguistic phrases
Lowest Weighted Deviation Estimate for
best-match cross-linguistic phrases
Global cellphone stats
• v
International Telecommunication Union (ITU),
http://www.itu.int/ITU-D/ict/statistics/ accessed April 20, 2012
Speech and Speech Recognition
• (Outside the scope of the dissertation)
• Goldwater et al., 2009 – applied to word
segmentation in speech recognition
Chichewa – status of noun classes
• Pronouns most likely to be followed by a space
• Then the locative noun classes
• Then the other noun classes
• Subject noun classes more likely to be followed by a
space than Object noun classes
• Data too sparse to measure the combination of
noun-class and Subj/Obj

Contenu connexe

Tendances

Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 

Tendances (20)

Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Hidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala languageHidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala language
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
SSSLW 2017
SSSLW 2017SSSLW 2017
SSSLW 2017
 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course Lectures
 
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course LecturesAutomatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
 
Jarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingJarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language Processing
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
ReseachPaper
ReseachPaperReseachPaper
ReseachPaper
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Nlp
NlpNlp
Nlp
 
The Usage of Because of-Words in British National Corpus
 The Usage of Because of-Words in British National Corpus The Usage of Because of-Words in British National Corpus
The Usage of Because of-Words in British National Corpus
 
Pedagogical applications of corpus data for English for General and Specific ...
Pedagogical applications of corpus data for English for General and Specific ...Pedagogical applications of corpus data for English for General and Specific ...
Pedagogical applications of corpus data for English for General and Specific ...
 
The Corpus In The Classroom
The Corpus In The ClassroomThe Corpus In The Classroom
The Corpus In The Classroom
 
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech TaggingA Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

En vedette

Energy for Opportunity, Presentation for E-Discuss
Energy for Opportunity, Presentation for E-DiscussEnergy for Opportunity, Presentation for E-Discuss
Energy for Opportunity, Presentation for E-Discuss
Robert Munro
 
Realtime crowdsourced translation for emergency response and beyond
Realtime crowdsourced translation for emergency response and beyondRealtime crowdsourced translation for emergency response and beyond
Realtime crowdsourced translation for emergency response and beyond
Robert Munro
 

En vedette (9)

Crowdsourcing and Natural Language Processing for Humanitarian Response
Crowdsourcing and Natural Language Processing for Humanitarian Response Crowdsourcing and Natural Language Processing for Humanitarian Response
Crowdsourcing and Natural Language Processing for Humanitarian Response
 
Energy for Opportunity, Presentation for E-Discuss
Energy for Opportunity, Presentation for E-DiscussEnergy for Opportunity, Presentation for E-Discuss
Energy for Opportunity, Presentation for E-Discuss
 
Crowdring
CrowdringCrowdring
Crowdring
 
Subword and spatiotemporal models for identifying actionable information in ...
Subword and spatiotemporal models for identifying actionable information in ...Subword and spatiotemporal models for identifying actionable information in ...
Subword and spatiotemporal models for identifying actionable information in ...
 
Realtime crowdsourced translation for emergency response and beyond
Realtime crowdsourced translation for emergency response and beyondRealtime crowdsourced translation for emergency response and beyond
Realtime crowdsourced translation for emergency response and beyond
 
Talking to the crowd in 7,000 languages
Talking to the crowd in 7,000 languages �Talking to the crowd in 7,000 languages �
Talking to the crowd in 7,000 languages
 
Bringing Data Science to the Speakers of Every Language
Bringing Data Science to the Speakers of Every Language Bringing Data Science to the Speakers of Every Language
Bringing Data Science to the Speakers of Every Language
 
Why Sentiment Analysis is a Market for Lemons … and How to Fix it
Why Sentiment Analysis is a Market for Lemons … and How to Fix it Why Sentiment Analysis is a Market for Lemons … and How to Fix it
Why Sentiment Analysis is a Market for Lemons … and How to Fix it
 
Tracking Epidemics with Natural Language Processing and Crowdsourcing
Tracking Epidemics with Natural Language Processing and Crowdsourcing�Tracking Epidemics with Natural Language Processing and Crowdsourcing�
Tracking Epidemics with Natural Language Processing and Crowdsourcing
 

Similaire à Processing short-message communications in low-resource languages

1. level of language study.pptx
1. level of language study.pptx1. level of language study.pptx
1. level of language study.pptx
AlkadumiHamletto
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
tanishamahajan11
 

Similaire à Processing short-message communications in low-resource languages (20)

NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
Speech-Recognition.pptx
Speech-Recognition.pptxSpeech-Recognition.pptx
Speech-Recognition.pptx
 
Intro
IntroIntro
Intro
 
Intro
IntroIntro
Intro
 
1. level of language study.pptx
1. level of language study.pptx1. level of language study.pptx
1. level of language study.pptx
 
Nicoutline
NicoutlineNicoutline
Nicoutline
 
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdfApplied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
SMT3
SMT3SMT3
SMT3
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
American Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech RecognitionAmerican Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech Recognition
 
Decalage is not a dirty word: Teaching simultaneous mode to healthcare interp...
Decalage is not a dirty word: Teaching simultaneous mode to healthcare interp...Decalage is not a dirty word: Teaching simultaneous mode to healthcare interp...
Decalage is not a dirty word: Teaching simultaneous mode to healthcare interp...
 
Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2
 
Morphological Analyzer and Generator for Tamil Language
Morphological Analyzer and Generator for Tamil LanguageMorphological Analyzer and Generator for Tamil Language
Morphological Analyzer and Generator for Tamil Language
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP)
 
Role of Language Engineering to Preserve Endangered Language
Role of Language Engineering to Preserve Endangered Language Role of Language Engineering to Preserve Endangered Language
Role of Language Engineering to Preserve Endangered Language
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
AI, don't f$# up my name.pdf
AI, don't f$# up my name.pdfAI, don't f$# up my name.pdf
AI, don't f$# up my name.pdf
 
NCIHC.HFT.45.Missing.the.plot.02-17-2021_vslideshare
NCIHC.HFT.45.Missing.the.plot.02-17-2021_vslideshareNCIHC.HFT.45.Missing.the.plot.02-17-2021_vslideshare
NCIHC.HFT.45.Missing.the.plot.02-17-2021_vslideshare
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Processing short-message communications in low-resource languages

  • 1. Processing short-message communications in low-resource languages Robert Munro PhD Defense Stanford University April 2012
  • 2. Acknowledgments • Dissertation Committee: Chris Manning (!), Dan Jurafsky and Tapan Parikh, • Oral committee: Chris Potts, Mike Frank • Stanford Linguistics (esp the Shichi Fukujin) • Stanford NLP Group • Volunteers and workers who contributed to the data used here
  • 3. Acknowledgments • Participants: – Second Annual Symposium on Computing for Development (ACM DEV 2012), Atlanta. – Global Health & Innovation Conference. Unite For Sight (2012), Yale. – Microsoft Research Seminar Series, Redmond. – Feldwork Forum, University Of California, Berkeley. – Annual Workshop on Machine Translation (2011), EMNLP, Edinburgh. – Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011), Portland. – The Human Computer Interaction Group, Seminar Series, Stanford. – AMTA Workshop on Collaborative Crowdsourcing for Translation, Denver. – 33rd Conference of the African Studies Association of Australasia and the Pacific, Melbourne. – International Conference on Crisis Mapping (ICCM 2011), Boston. – Center for Information Technology Research in the Interest of Society Seminar Series, (2010) University of California, Berkeley. – IBM Almaden Research Center Invited Speakers Series, San Jose. – Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2010), Los Angeles. – Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (2010), Los Angeles. – The Research Centre for Linguistic Typology (RCLT) Seminar Series, Latrobe University, Melbourne.
  • 4. Daily potential language exposure 5 5 5 5 5 5 4.5 4 50 1500 5000 2000 1400 720 540 500 Year #oflanguages
  • 5. 5 5 5 5 5 5 4.5 4 50 1500 5000 2000 1400 720 540 500 Daily potential language exposure Year #oflanguages
  • 6. 5 5 5 5 5 5 4.5 4 50 1500 5000 2000 1400 720 540 500 Daily potential language exposure Year #oflanguages
  • 7. 5 5 5 5 5 5 4.5 4 50 1500 5000 2000 1400 720 540 500 Daily potential language exposure Year #oflanguages We will never be so under- resourced as right now
  • 8. Motivation 2000: 1 Trillion 2007 (start of PhD): 5 Trillion 2012 (estimate): 9 Trillion 1 International Telecommunication Union (ITU), 2012. http://www.itu.int/ITU-D/ict/statistics/ • Text messaging – Most popular form of remote communication in much of the world 1 – Especially in areas of linguistic diversity – Little research
  • 9. ACM, IEEE and ACL publications Actual Usage Recent Research (90% coverage) (35% coverage)
  • 10. Outline • What do short message communications look like in most languages? • How can we model the inherent variation? • Can we create accurate classification systems despite the variation? • Can we leverage loosely aligned translations for information extraction?
  • 11. Publications from dissertation as published extensions and novel research rewritten Munro, R. and Manning, C. D. (2010). Subword variation in text message classification. Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2010), Los Angeles, CA. Munro, R. (2011). Subword and spatiotemporal models for identifying actionable information in Haitian Kreyol. Proceedings of the Fifteenth Conference on Natural Language Learning (CoNLL 2011), Portland, OR. Munro, R. and Manning, C. D. (2012). Short message communications: users, topics, and in-language processing. Proceedings of the Second Annual Symposium on Computing for Development (ACM DEV 2012), Atlanta, GA. Munro, R. and Manning, C. D. (accepted). Accurate Unsupervised Joint Named-Entity Extraction from Unaligned Parallel Text. Proceedings of the Named Entities Workshop (NEWS 2012), Jeju, Korea. Written dissertation, April 2012
  • 12. Data – short messages used here • 600 text messages sent between health workers in Malawi, in Chichewa • 40,000 text messages sent from the Haitian population, in Haitian Kreyol • 500 text messages sent from the Pakistani population, in Urdu • Twitter messages from Haiti and Pakistan • English translations
  • 13. Chichewa, Malawi • 600 text messages sent between health workers, with translations and 0-9 labels Bantu 1. Patient-related 2. Clinic-admin 3. Technological 4. Response 5. Request for doctor 6. Medical advice 7. TB: tuberculosis 8. HIV 9. Death of patient
  • 14. Haitian Kreyol • 40,000 text messages sent from the Haitian population to international relief efforts – ~40 labels (request for food, emergency, logistics, etc) – Translations – Named-entities • 60,000 tweets
  • 15. Urdu, Pakistan • 500 text messages sent from the Pakistan population to international relief efforts – ~40 labels – Translations • 1,000 tweets
  • 16. Outline • What do short message communications look like in most languages? • How can we model the inherent variation? • Can we create accurate classification systems despite the variation? • Can we leverage loosely aligned translations for information extraction?
  • 17. Most NLP research to date assumes the standardization found in written English
  • 18. English • Generations of standardization in spelling and simple morphology – Whole words suitable as features for NLP systems • Most other languages – Relatively complex morphology – Less (observed) standardized spellings – More dialectal variation • ‘Subword variation’ used to refer to any difference in forms resulting from the above
  • 19. The extent of the subword variation • >30 spellings of odwala (‘patient’) in Chichewa • > 50% variants of ‘odwala’ occur only once in the data used here: – Affixes and incorporation • ‘kwaodwala’ -> ‘kwa + odwala’ • ‘ndiodwala’ -> ‘ndi odwala’ (official ‘ngodwala’ not present) – Phonological/Orthographic • ‘odwara’ -> ‘odwala’ • ‘ndiwodwala’ -> ‘ndi (w) odwala’
  • 20. Chichewa The word odwala (‘patient’) in 600 text-messages in Chichewa and the English translations
  • 21. Chichewa • Morphology: affixes and incorporation ndi-ta-ma-mu-fun-a-nso 1PS-IMPLORE-PRESENT-2PS-want-VERB-also “I am also currently wanting you very much” a-ta-ma-ka-fun-a-nso class2.PL-IMPLORE-PRESENT-class12.SG-want-VERB-also “They are also currently wanting it very much” • More than 30 forms for fun (‘want’), 80% novel
  • 22. Haitian Krèyol • More or less French spellings • More or less phonetic spellings • Frequent words (esp pronouns) are shortened and compounded • Regional slang / abbreviations
  • 23. Haitian Krèyol mèsi, mesi, mèci, merci C a p - H a ï t i e n K a p a y i s y e n
  • 24. Urdu • The least variant of the three languages here – Derivational morphology • Zaroori / zaroorath – Vowels and nonphonemic characters • Zaruri / zaroorat zaroori (‘need’)
  • 25. If it follows patterns, we can model it
  • 26. Outline • What do short message communications look like in most languages? • How can we model the inherent variation? • Can we create accurate classification systems despite the variation? • Can we leverage loosely aligned translations for information extraction?
  • 27. Subword models • Segmentation – Separate into constituent morphemes: nditamamufunanso -> ndi-ta-ma-mu-fun-a-nso • Normalization – Model phonological, orthographic, more or less phonetic spellings: odwela, edwala, odwara -> odwala
  • 28. Language Specific • Segmentation – Hand-coded morphological parser (Mchombo, 2004; Paas, 2005) 1 • Normalization – Rule-based ph -> f, etc. 1 robertmunro.com/research/chichewa.php Table 4.1: Morphological paradigms for Chichewa verbs
  • 29. Language Independent • Segmentation (Goldwater et al., 2009) – Context Sensitive Hierarchical Dirichlet Process, with morphemes, mi drawn from distribution G generated from Dirichlet Process DP(α0, P0), with Hm = DP for a specific morpheme: • Extension to morphology: – Enforce existing spaces as morpheme boundaries – Identify free morphemes as min P0, per word ndi mafuna -> ndi-ma-funa manthwala
  • 30. Language Independent Table 4.3: Phonetically, phonologically & orthographically motivated alternation candidates. • Normalization – Motivated from minimal pairs in the corpus, C – Substitution, H, applied to a word, w, producing w' iff w'  C ndiwodwala -> ndiodwala
  • 31. Evaluation – downstream accuracy • Most morphological parsers are evaluated on gold data and limited to prefixes or suffixes only: – Linguistica (Goldsmith, 2001), Morphessor (Creutz, 2006) • Classification accuracy (macro-f, all labels): Chichewa Language Specific independent Segmentation: 0.476 0.425 Normalization: 0.396 0.443 Combined: 0.484 0.459
  • 32. Other subword modeling results • Stemming vs Segmentation – Stemming can harm Chichewa 1 – Segmentation most accurate when modeling discontinuous morphemes 1 • Hand-crafted parser – Over-segments non-verbs (cf Porter Stemmer for English) – Under-segments compounds • Acronym identification – Improves accuracy & can be broadly implemented 1 1 Munro and Manning, (2010)
  • 33. Are subword models needed for classification?
  • 34. Outline • What do short message communications look like in most languages? • How can we model the inherent variation? • Can we create accurate classification systems despite the variation? • Can we leverage loosely aligned translations for information extraction?
  • 35. Classification • Stanford Classifier – Maximum Entropy Classifier (Klein and Manning, 2003) • Binary prediction of the labels associated with each message – Leave-one-out cross-validation – Micro-f • Comparison of methods with and without subword models
  • 36. Strategy ndimmafunamanthwala (‘I currently need medicine’) ndimafunamantwala ndi-ma-fun-a man-twala ndi-ma-fun-a man-twala ndi -fun man-twala (“I need medicine”) Category = “Request for aid” ndi kufuni mantwara (‘mywant of medicine’) ndi kufuni mantwala ndi-ku-fun-i man-twala ndi-ku-fun-iman-twala ndi -fun man-twala (“I need medicine”) Category = “Request for aid” 1) Normalize spellings 2) Segment 3) Identify predictors 1 in 5 classification errors with raw messages 1 in 20 classification error post-processing. Improves with scale.
  • 37. Comparison with English • bcb Accuracy:Micro-f Percent of training data
  • 38. Streaming architecture • Potential accuracy in a live, constantly updating system – Time sensitive and time-changing • Kreyol ‘is actionable’ category – Any message that could be responded to (request for water, medical assistance, clustered requests for food, etc )
  • 39. Streaming architecture • Build from initial items time Model
  • 40. Streaming architecture • Predict (and evaluate) on incoming items – (penalty for training) time Model
  • 41. Streaming architecture • Repeat / retrain time Model
  • 42. Streaming architecture • Repeat / retrain time Model
  • 43. Streaming architecture • Repeat / retrain time Model
  • 44. Streaming architecture • Repeat / retrain time Model
  • 45. Streaming architecture • Repeat / retrain time Model
  • 46. Features • G : Words and ngrams • W : Subword patterns • P : Source of the message • T : Time received • C : Categories (c0,...,47) • L : Location (longitude and latitude) • L : Has-location (a location is written in the message)
  • 47. Hierarchical prediction for ‘is actionable’ time Model time Model time Model time Model … Combines features with predictions from Category and Has-Location models predicting ‘is actionable’ predicting ‘has location’ predicting ‘category 1’ predicting ‘category n’
  • 48. Results – subword models • Also a gain in streaming models Precision Recall F-value Baseline 0.622 0.124 0.207 W Subword 0.548 0.233 0.326
  • 49. Results – overall • Gain of F > 0.6 for full hierarchical system, over baseline of words/phrases only Precision Recall F-value Baseline 0.622 0.124 0.207 Final 0.872 0.840 0.855
  • 50. Other classification results • Urdu and English – Subword models improve Urdu & English tweets 1 • Domain dependence – Modeling the source improves accuracy 1 • Semi-supervised streaming models – Lower F-value but consistent prioritization 2 • Hierarchical streaming predictions – Outperforms oracle for ‘has location’ 2 • Extension with topic models – Improves non-contiguous morphemes 3 1 Munro and Manning, (2012); 2 Munro, (2011); 3 Munro and Manning, (2010)
  • 51. Can we move beyond classification to information extraction?
  • 52. Outline • What do short message communications look like in most languages? • How can we model the inherent variation? • Can we create accurate classification systems despite the variation? • Can we leverage loosely aligned translations for information extraction?
  • 53. Named Entity Recognition • Identifying mentions of People, Locations, and Organizations – Information extraction / parsing / Q+A • Typically a high-resource task – Tagged corpus (Finkel and Manning, 2010) – Extensive hand-crafted rules (Chiticarui, 2010) • How far can we get with loosely aligned text? – One of the only resources for most languages
  • 54. Example Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
  • 55. The intuition Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital. Do named entities have the least edit distance?
  • 56. The intuition Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
  • 57. The intuition Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
  • 58. The intuition Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
  • 59. The intuition Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
  • 60. The complications Lopital Sacre-Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital. Capitalization of entities was not always consistent Slang/abbreviations/alternate spellings for ‘Okap’ are frequent: ‘Cap-Haitien’, ‘Cap Haitien’, ‘Kap’, ‘Kapayisyen’
  • 61. 3 Steps for Named Entity Recognition 1. Generate seeds by calculating the edit likelihood deviation. 2. Learn context, word-shape and alignment models. 3. Learn weighted models incorporating supervised predictions.
  • 62. Step 1: Edit distance (Levenshtein) • Number of substitutions, deletions or additions to convert one string to another – Minimum Edit Distance: min between parallel text – String Similarity Estimate: normalized by length – Edit Likelihood Deviation: similarity, relative to average similarity in parallel text (z-score) – Weighted Deviation Estimate: combination of Edit Likelihood Deviation and String Similarity Estimate
  • 63. Example • Edit distance: 6 • String Similarity: ~0.45 “Voye manje medikaman pou moun kie nan lopital Kapayisyen” “Send food and medicine for people in the Cap Haitian hospitals” – Average & standard dev similarity: μ=0.12, σ=0.05 – Edit Likelihood Deviation: 6.5 (good candidate) “Voye manje medikaman pou moun kie nan lopital Kapayisyen” “They said to send manje medikaman for lopital Cap Haitian” – Average & standard dev similarity: μ=0.21, σ=0.11 – Edit Likelihood Deviation: 2.2 (doubtful candidate) C a p - H a ï t i e n K a p a y i s y e n
  • 64. Equations for edit-distance based metrics • Given a string in a message and translation MS, M'S' Levenshtein distance LEV() String Similarity Estimate SSE() Average AV() Standard Deviation SD() Edit Likelihood Deviation ELD() Normalizing Function N() Weighted Deviation Estimate WDE()
  • 65. Comparison of edit-distance based metrics Past research used global edit-distance metrics (Song and Strassel, 2008) This line of research not pursued after REFLEX workshop. Novel to this research: local deviation in edit- distance. Entity candidates, ordered by confidence Precision
  • 66. Step 2: Seeding a model • Take the top 5% matches by WDE() – Assign an ‘entity’ label • Take the bottom 5% matches by WDE() – Assign a ‘not-entity’ label • Learn a model • Note: the bottom 5% were still the best match for the given message/translation – Targeting the boundary conditions
  • 67. Features … ki nan vil Milot, 14 km nan sid … … located in this village Milot 14 km south of … • Context: BEF_vil, AFT_14 / BEF_village, AFT_14 • Word Shape: SHP_Ccp / SHP_Cc • Subword: SUB_<b>Mi, SUB_<b>Mil, SUB_il, … • Alignment: ALN_8_words, ALN_4_perc • Combinations: SHP_Cc_ALN_4_perc, …
  • 68. Strong results • Joint-learning across both languages Precision Recall F-value Kreyol 0.904 0.794 0.846 English 0.915 0.813 0.861 • Language-specific: Precision Recall F-value Kreyol 0.907 0.687 0.781 English 0.932 0.766 0.840
  • 69. Effective extension over edit-distance Joint-prediction String Similarity Estimate
  • 70. Domain adaption • Joint-learning across both languages Precision Recall F-value Kreyol 0.904 0.794 0.846 English 0.915 0.813 0.861 • Supervised (MUC/CoNLL-trained Stanford NER): Precision Recall F-value English 0.915 0.206 0.336 Completely unsupervised, using ~3,000 sentences loosely aligned with Kreyol Fully supervised, trained over 10,000s of manually tagged sentences in English
  • 71. Step 3: Combined supervised model … ki nan vil Milot, 14 km nan sid … … located in this village Milot 14 km south of … Step 3a: Tag English sequences from a model trained on English corpora (Sang, 2002; Sang and De Meulder, 2003; Finkel and Manning, 2010) Step 3b: Propagate across the candidate alignments, in combination with features (context, word-shape, etc)
  • 72. Combined supervised model • Joint-learning across both languages Precision Recall F-value Kreyol 0.904 0.794 0.846 English 0.915 0.813 0.861 • Combined Supervised and Unsupervised Precision Recall F-value Kreyol 0.838 0.902 0.869 English 0.846 0.916 0.880
  • 73. Other information extraction results • Other edit-distance functions (eg: Jaro-Winkler) – Make little difference in the seed step - the deviation measure is the key feature 1 • Named entity discrimination – Distinguishing People, Locations and Organizations is reasonably accurate with little data 1 • Clustering contexts – No clear gain – probably due to sparse data 1 Munro and Manning, (accepted)
  • 74. Outline • What do short message communications look like in most languages? • How can we model the inherent variation? • Can we create accurate classification systems despite the variation? • Can we leverage loosely aligned translations for information extraction?
  • 75. Conclusions • It is necessary to model the subword variation found in many of the world’s short-message communications • Subword models can significantly improve classification tasks in these languages • The same subword variation, cross-linguistically, can be leveraged for accurate named entity recognition
  • 76. Conclusions • More research is needed 2000: 1 Trillion 2007 (start of PhD): 5 Trillion 2012 (estimate): 9 Trillion
  • 78.
  • 80. Noise reduction for normalization • Too many long erroneous matches • The constraints were necessary ndimafuna manthwala (‘I currently want medicine‘) ndimaluma manthwala (‘I am eating medicine‘) ~90% simularity * ‘f’ -> ‘l’
  • 81. Haitian Krèyol morphology • Pronoun incorporation? • Probably a frequency effect – ‘m’ / ‘sv’ (‘thank you’ / ‘please’) – Reductions in common verbs (esp copulas) – Spoken?
  • 82. Sub-morphemic patterns • odwala / manthwala (‘patient’ / ‘medicine’) – ‘wala’ is not morphemic in either word – Homographic (only?) of wala (‘shine’) – Etymological relatedness? • Proto-Bantu (Greenberg, 1979) -wala (‘count’) (no) -ala (‘be rotten’) (maybe) • Proto-Southern Bantu (Botne, 1992) -ofwala (‘be blind’) (possible) • Inherited suffix ‘-ia’ for geopolitical names – ‘Tanzania’, ‘Australia’, etc
  • 83. Supervised NER performance • English translations of messages: – 81.80% capitalized (including honorifics, etc) – More capitalized entities missed than identified – Oracle over re-capitalization: P=0.915, R=0.388, F=0.545 • Most frequent lower-case entities: – ‘carrefour’, ‘delmas’ also missed when capitalized – Not present in CoNLL/MUC training data • Conclusion – Some loss of accuracy to capitalization – Most loss due to OOV & domain dependence
  • 84. Analysis of segmentation • Language independent – over segmented all words: od-wal-a • Language dependent: – under-segmented compounds: odwalamanthwal-a – over-segmented non-verbs: ma-nthwal-a
  • 85. Analysis of normalization • Language independent: – most accurate when constrained (‘i’->’y’, ‘w’-> Ø) • Language dependent – under-normalized • Conclusion – difficult to hand-code normalization rules – difficult to normalize with no linguistic knowledge
  • 86. Chichewa morphological parser a u w i li chi zi ka tii ku pa mu ndi si ta kana kada ku ma pa dza a ba ka sa nga zi ba ta ka dza ka dzi ngo mu wa u i li chi zi ka ti an its ets il el i ik ek idw edw ul e a i o mi be nso tu zi Verb-stem
  • 87. Named Entity seeds: Step1 top/bottom Engish Kreyol Lambi Night Club LAMBI NIGHT CLUB Marie Ange Dry Marie Ange Dry Rue Lamarre RUE LAMARRE Rue Lamarre RUE LAMARRE Bright Frantz BRIGHT FRANTZ o1-14-99-1966-o6-ooo18 o1-14-99-1966-o6-ooo18 ROCHEMA FLADA JEREMIE ROCHEMA FLADA JEREMIE Centre Flore Poumar centre flore poumar makes Boudotte makes boudotte Pierreminesson Tresil Pierreminesson Tresil Prosper Lucmane prosper Lucmane Delma 29 DELMA 29 Santo 22H SANTO 22H Delorme Frisnel DELORME FRISNEL Elsaint Jean Bonheur I Elsaint jean Bonheur m baby too BABY TOU Rony Fortune RONY FORTUNE Cote-Plage 24 Impasse cote-plage 24 impasse Fontamara 27 Fontamara 27 Roland Joseph roland joseph Promobank PROMOBANK Johnny Toussaint johnny toussaint Bertin Titus Route BERTIN TITUS ROUTE Jean Patrick Guillaume I jean patrick Guillaume m Lilavois 50 Lilavois 50 immacula chery i immacula chery j Cajuste Denise cajuste denise mahotiére 85 mahotiére 85 Pierre Richemond Pierre Richemond English Kreyol gotten any Alo mwen abit can NAP in Kil address A dr an Adr How nou stephani lony fleurio depwi darbonne(l us in svp water i Mwen f Saint-Antoine sentantw de fer bezwen We need BEZWEN Laza PASE Guilène Guil WE NEED bezwen in Mangonese nous sommes yet te mercy I Mwen s we are Mezanmi Lilavois 41 #2 lilavwa 41#2 Many planch my wife Moins Leogane Route de PHOTO) HABITER are: ef to --- GONA information en prie in Route Silvoupl help i'm at MEZANMI W send kijwenn 0/O ef need grenier english cite please we are Mezanmi reach Manj Highest Weighted Deviation Estimate for best-match cross-linguistic phrases Lowest Weighted Deviation Estimate for best-match cross-linguistic phrases
  • 88. Global cellphone stats • v International Telecommunication Union (ITU), http://www.itu.int/ITU-D/ict/statistics/ accessed April 20, 2012
  • 89. Speech and Speech Recognition • (Outside the scope of the dissertation) • Goldwater et al., 2009 – applied to word segmentation in speech recognition
  • 90. Chichewa – status of noun classes • Pronouns most likely to be followed by a space • Then the locative noun classes • Then the other noun classes • Subject noun classes more likely to be followed by a space than Object noun classes • Data too sparse to measure the combination of noun-class and Subj/Obj