SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Broad Twitter Corpus: A Diverse
Named Entity Recognition Resource
Leon Derczynski
Kalina Bontcheva
Ian Roberts
Broad Twitter Corpus: A Diverse
Named Entity Recognition Resource
“I strongly recommend this paper”
“It is therefore a very useful resource”
“Impact of resources: 5
Overall recommendation: 5
Reviewer Confidence: 5”
wow
so review
very paper
much japan
Most of our language tech was trained on news
The bias is:
- middle class
- white
-working age
- educated
- male
- 1980s/1990s
- from the US
- journalist
- following AP guidelines
Your phone rewards you if you talk and write like
(and that's ok.. sort of)
Photo © Michael Jang 1983
Your phone rewards you if you talk and write like
(and that's ok.. sort of)
.. and punishes you when you don't.
(not cool!)
The REAL problem:
Our studies have centred on a
tiny, over-biased set of data
There is no variation!
(analyse some WSJ if you are not convinced..)
It's time to up our game;
social media is a cheap & unprecedented resource
e.g. Baldwin @ WNUT15; Hovy @ ACL15
Social media is incredibly powerful
- sample of all global discourse
- warns of earthquakes
- sends fire engines
- predicts virus outbreaks (e.g. WNV)
Traditional tools have awful performance
Stanford NER 40% F1
Single-topic recall 66%
.. cross-topic 33%
What kind of entities do we find in social media?
High variety – ages quickly
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personalities, celebrities,
names of friends
LOC Countries, cities, rivers,
and other places
related to current affairs
Restaurants, bars, local
landmarks/areas, cities,
rarely countries
ORG Public and private
companies, government
organisations
Bands, internet
companies, sports clubs
Why a new corpus?
Existing ones are tiny, and hyperfocused
Name Tokens Schema Annotation Notes
UMBC 7K PLO Crowd Low IAA
Ritter 46K Freebase Expert, single No IAA
Microsoft 12K PLO + Product ? Private
MSM 29K PLO + Misc Expert, multiple
No hashtags /
usernames
What kind of variance do we see?
Temporal:
- concept drift over time
- daily cycles (work, family, socialising)
- weekly cycles
- time of year (seasonal behaviours)
Spatial
- many different anglophone regions
- different surface forms in each
- different signifiers (LLC – Ltd. - DAC)
Social
- WSJ readers and writers
- net celebrities
- tv characters
Corpus design:
Temporal
- drawn over six years, from twitter archive
- selected over multiple temporal cycles
Spatial
- spread over six anglophone regions:
UK, US, IE, CA, NZ, AU
Social
- general segment
- selection for news
- selection for commentary
Annotation problems
Workflow:
Crowdsourcing platform interfaces = pita
Not in USA, so no mturk access
Solution:
- GATE Crowdsourcing plugin
- Load corpus, set up task, add API
key, launch job, done!
- Automatic result collection &
alignment
- Even Java/Swing is prettier than
mturk’s back end
Annotation problems
Task design
Lots of training required
Many entity types
Solution
Brief instructions
Clean interface
Annotate just one entity type at a time
- pricy but way better, and overall, quicker
Annotation problems
Annotator recall
Pretty serious problem
People have limited knowledge, limited world experience
Expert annotators actually not good – we’re desperately overfit
Don’t believe me? Who can explain this real document?
KKTNY in 45 min!!!!!
Annotation problems
Annotator recall
Pretty serious problem
People have limited knowledge, limited world experience
Expert annotators actually not good – we’re desperately overfit
Don’t believe me? Who can explain this real document?
KKTNY in 45 min!!!!!
Solution:
Ignore traditional IAA
Pool the results - “max recall”
Rare knowledge ≠ Wrong knowledge
Post-solution:
Expert adjudication step
Annotation problems
Crowd can be pretty dumb
Not its fault – we gave no education
People need precise idea of task
Solution 1
Ensure workers get good score on known data first
Lace the text with gold data, for monitoring & feedback
Solution 2
Keep task focused (just one entity type)
Give instructions & examples
Results – annotator quality
Experts are consistent, but don’t get far
Crowd is varied and inconsistent, but gets
superior recall performance
Remember, recall is the problem with soc med!
Group
Recall over final
annotations
F1 IAA
Expert 0.309 0.835
Crowd 0.837 0.350
Results: size
Name Tokens Schema Annotation Notes
UMBC 7K PLO Crowd Low IAA
Ritter 46K Freebase Expert, single No IAA
Microsoft 12K PLO + Product ? Private
MSM 29K PLO + Misc Expert, multiple
No hashtags /
usernames
BTC
(Broad Twitter
Corpus)
165K PLO
Expert +
Crowd
Source JSON
available
Documents 9 551
Tokens 165 739
Person 5 271
Location 3 114
Organisation 3 732
Total 12 117
Results: diversity
Sorry Botswana,
Bahamas, South Africa,
Malta.. looking forward to
seeing you crowdsource!
Results: diversity
By year, and month
Results: diversity
By day of month, weekday, and time of day
Results: IAA
Adjudication is the agreement with max-recall
Naïve is micro-averaged lenient match
Note that max-recall performs very well
(according to expert..)
Level Adjudication Naïve
Whole doc 0.839 N/a
Person 0.920 0.799
Location 0.963 0.861
Organisation 0.936 0.954
All 0.940 0.877
Results: popular surface forms
CONLL is: * ancient
* US and int.rel. centric
* about cricket???
Results: long tail steepness
Tail vs. head tells us something about diversity
If a few forms make up many mentions, the corpus is more boring:
- less variety (qualitative)
- harder to generalise
about (maths!)
We bisect at h-index
point, and compare
proportions
Corpus distribution
Totally legal to give source; it’s under 50K tweets
- JSON
- GATE docs
- CoNLL
All intermediate crowdsourcing data included in the GATE docs
Available before Dec 16
To be extra sure, also available as “rehydratable standoff”
Thanks! And thank you everyone!
Alonso & Lease, 2011
Bontcheva et al. 2014a
Bontcheva et al. 2014b
Callison-Burch &
Dredze, 2010
Difallah et al. 2013
Finin et al. 2010
Hovy et al. 2013
Khanna et al. 2010
Morris et al. 2012
Sabou et al. 2014
Balog et al. 2012
Bollacker et al. 2008
Hovy 2010
Rowe et al. 2013
Ritter et al. 2011
Rose et al. 2002
Tjong Kim Sam et al. 2003
Coppersmith et al. 2014
De Choudhury et al. 2013
Kedzie et al. 2015
Neubig et al. 2011
Tumasjan et al. 2010
Eisenstein et al. 2010
Eisenstein 2013
Hu et al. 2013
Kergl et al. 2014
Mascaro & Goggins 2012
Tufekci 2014
Bontcheva et al. 2013
Liu et al. 2011
Lui & Baldwin 2012
Magdy & Elsayed 2016
Mostafa 2013
O’Connor et al. 2010
Fromreide et al. 2014
Masud et al. 2010

Contenu connexe

Similaire à Broad Twitter Corpus Resource for Diverse NER

Professional Information Research
Professional Information ResearchProfessional Information Research
Professional Information ResearchEric Kokke
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowTony Russell-Rose
 
PLAIN2013 Rethink, Reorganize, Reword, Redesign
PLAIN2013   Rethink, Reorganize, Reword, RedesignPLAIN2013   Rethink, Reorganize, Reword, Redesign
PLAIN2013 Rethink, Reorganize, Reword, Redesignmacgredl
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppttestbest6
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
Reaching Peak Performance for Knowledge Workers
Reaching Peak Performance for Knowledge WorkersReaching Peak Performance for Knowledge Workers
Reaching Peak Performance for Knowledge WorkersRichard Thripp
 
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...Alexander Serebrenik
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social MediaLeon Derczynski
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application ProfilesDiane Hillmann
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesJose Luis Lopez Pino
 
How to impress your boss and your customer in a modern software development c...
How to impress your boss and your customer in a modern software development c...How to impress your boss and your customer in a modern software development c...
How to impress your boss and your customer in a modern software development c...Wojciech Seliga
 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsAnant Narayanan
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsJason Anderson
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inKumari Naveen
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.pptHaHa501620
 
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...KISK FF MU
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 
jon-on reasearch.ppt
jon-on reasearch.pptjon-on reasearch.ppt
jon-on reasearch.pptSumit Roy
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesZoltan Varju
 

Similaire à Broad Twitter Corpus Resource for Diverse NER (20)

Professional Information Research
Professional Information ResearchProfessional Information Research
Professional Information Research
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and Tomorrow
 
Oss swot
Oss swotOss swot
Oss swot
 
PLAIN2013 Rethink, Reorganize, Reword, Redesign
PLAIN2013   Rethink, Reorganize, Reword, RedesignPLAIN2013   Rethink, Reorganize, Reword, Redesign
PLAIN2013 Rethink, Reorganize, Reword, Redesign
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppt
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
Reaching Peak Performance for Knowledge Workers
Reaching Peak Performance for Knowledge WorkersReaching Peak Performance for Knowledge Workers
Reaching Peak Performance for Knowledge Workers
 
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application Profiles
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the masses
 
How to impress your boss and your customer in a modern software development c...
How to impress your boss and your customer in a modern software development c...How to impress your boss and your customer in a modern software development c...
How to impress your boss and your customer in a modern software development c...
 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge Graphs
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful in
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
 
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
jon-on reasearch.ppt
jon-on reasearch.pptjon-on reasearch.ppt
jon-on reasearch.ppt
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
 

Plus de Leon Derczynski

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and VeracityLeon Derczynski
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018Leon Derczynski
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingLeon Derczynski
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social MediaLeon Derczynski
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Leon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doLeon Derczynski
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsLeon Derczynski
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy DataLeon Derczynski
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyLeon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceLeon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesLeon Derczynski
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringLeon Derczynski
 

Plus de Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answering
 

Dernier

Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
Telephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online LecTelephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online Lecfllcampolet
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxjana861314
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasChayanika Das
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationSanghamitraMohapatra5
 
3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docxUlahVanessaBasa
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptAmirRaziq1
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 

Dernier (20)

Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
Telephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online LecTelephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online Lec
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptx
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitation
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.ppt
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 

Broad Twitter Corpus Resource for Diverse NER

  • 1. Broad Twitter Corpus: A Diverse Named Entity Recognition Resource Leon Derczynski Kalina Bontcheva Ian Roberts
  • 2. Broad Twitter Corpus: A Diverse Named Entity Recognition Resource “I strongly recommend this paper” “It is therefore a very useful resource” “Impact of resources: 5 Overall recommendation: 5 Reviewer Confidence: 5” wow so review very paper much japan
  • 3. Most of our language tech was trained on news The bias is: - middle class - white -working age - educated - male - 1980s/1990s - from the US - journalist - following AP guidelines
  • 4. Your phone rewards you if you talk and write like (and that's ok.. sort of) Photo © Michael Jang 1983
  • 5. Your phone rewards you if you talk and write like (and that's ok.. sort of) .. and punishes you when you don't. (not cool!)
  • 6. The REAL problem: Our studies have centred on a tiny, over-biased set of data There is no variation! (analyse some WSJ if you are not convinced..) It's time to up our game; social media is a cheap & unprecedented resource e.g. Baldwin @ WNUT15; Hovy @ ACL15
  • 7. Social media is incredibly powerful - sample of all global discourse - warns of earthquakes - sends fire engines - predicts virus outbreaks (e.g. WNV) Traditional tools have awful performance Stanford NER 40% F1 Single-topic recall 66% .. cross-topic 33%
  • 8. What kind of entities do we find in social media? High variety – ages quickly News Tweets PER Politicians, business leaders, journalists, celebrities Sportsmen, actors, TV personalities, celebrities, names of friends LOC Countries, cities, rivers, and other places related to current affairs Restaurants, bars, local landmarks/areas, cities, rarely countries ORG Public and private companies, government organisations Bands, internet companies, sports clubs
  • 9. Why a new corpus? Existing ones are tiny, and hyperfocused Name Tokens Schema Annotation Notes UMBC 7K PLO Crowd Low IAA Ritter 46K Freebase Expert, single No IAA Microsoft 12K PLO + Product ? Private MSM 29K PLO + Misc Expert, multiple No hashtags / usernames
  • 10. What kind of variance do we see? Temporal: - concept drift over time - daily cycles (work, family, socialising) - weekly cycles - time of year (seasonal behaviours) Spatial - many different anglophone regions - different surface forms in each - different signifiers (LLC – Ltd. - DAC) Social - WSJ readers and writers - net celebrities - tv characters
  • 11. Corpus design: Temporal - drawn over six years, from twitter archive - selected over multiple temporal cycles Spatial - spread over six anglophone regions: UK, US, IE, CA, NZ, AU Social - general segment - selection for news - selection for commentary
  • 12. Annotation problems Workflow: Crowdsourcing platform interfaces = pita Not in USA, so no mturk access Solution: - GATE Crowdsourcing plugin - Load corpus, set up task, add API key, launch job, done! - Automatic result collection & alignment - Even Java/Swing is prettier than mturk’s back end
  • 13. Annotation problems Task design Lots of training required Many entity types Solution Brief instructions Clean interface Annotate just one entity type at a time - pricy but way better, and overall, quicker
  • 14. Annotation problems Annotator recall Pretty serious problem People have limited knowledge, limited world experience Expert annotators actually not good – we’re desperately overfit Don’t believe me? Who can explain this real document? KKTNY in 45 min!!!!!
  • 15.
  • 16. Annotation problems Annotator recall Pretty serious problem People have limited knowledge, limited world experience Expert annotators actually not good – we’re desperately overfit Don’t believe me? Who can explain this real document? KKTNY in 45 min!!!!! Solution: Ignore traditional IAA Pool the results - “max recall” Rare knowledge ≠ Wrong knowledge Post-solution: Expert adjudication step
  • 17. Annotation problems Crowd can be pretty dumb Not its fault – we gave no education People need precise idea of task Solution 1 Ensure workers get good score on known data first Lace the text with gold data, for monitoring & feedback Solution 2 Keep task focused (just one entity type) Give instructions & examples
  • 18.
  • 19. Results – annotator quality Experts are consistent, but don’t get far Crowd is varied and inconsistent, but gets superior recall performance Remember, recall is the problem with soc med! Group Recall over final annotations F1 IAA Expert 0.309 0.835 Crowd 0.837 0.350
  • 20. Results: size Name Tokens Schema Annotation Notes UMBC 7K PLO Crowd Low IAA Ritter 46K Freebase Expert, single No IAA Microsoft 12K PLO + Product ? Private MSM 29K PLO + Misc Expert, multiple No hashtags / usernames BTC (Broad Twitter Corpus) 165K PLO Expert + Crowd Source JSON available Documents 9 551 Tokens 165 739 Person 5 271 Location 3 114 Organisation 3 732 Total 12 117
  • 21. Results: diversity Sorry Botswana, Bahamas, South Africa, Malta.. looking forward to seeing you crowdsource!
  • 23. Results: diversity By day of month, weekday, and time of day
  • 24. Results: IAA Adjudication is the agreement with max-recall Naïve is micro-averaged lenient match Note that max-recall performs very well (according to expert..) Level Adjudication Naïve Whole doc 0.839 N/a Person 0.920 0.799 Location 0.963 0.861 Organisation 0.936 0.954 All 0.940 0.877
  • 25. Results: popular surface forms CONLL is: * ancient * US and int.rel. centric * about cricket???
  • 26. Results: long tail steepness Tail vs. head tells us something about diversity If a few forms make up many mentions, the corpus is more boring: - less variety (qualitative) - harder to generalise about (maths!) We bisect at h-index point, and compare proportions
  • 27. Corpus distribution Totally legal to give source; it’s under 50K tweets - JSON - GATE docs - CoNLL All intermediate crowdsourcing data included in the GATE docs Available before Dec 16 To be extra sure, also available as “rehydratable standoff”
  • 28. Thanks! And thank you everyone! Alonso & Lease, 2011 Bontcheva et al. 2014a Bontcheva et al. 2014b Callison-Burch & Dredze, 2010 Difallah et al. 2013 Finin et al. 2010 Hovy et al. 2013 Khanna et al. 2010 Morris et al. 2012 Sabou et al. 2014 Balog et al. 2012 Bollacker et al. 2008 Hovy 2010 Rowe et al. 2013 Ritter et al. 2011 Rose et al. 2002 Tjong Kim Sam et al. 2003 Coppersmith et al. 2014 De Choudhury et al. 2013 Kedzie et al. 2015 Neubig et al. 2011 Tumasjan et al. 2010 Eisenstein et al. 2010 Eisenstein 2013 Hu et al. 2013 Kergl et al. 2014 Mascaro & Goggins 2012 Tufekci 2014 Bontcheva et al. 2013 Liu et al. 2011 Lui & Baldwin 2012 Magdy & Elsayed 2016 Mostafa 2013 O’Connor et al. 2010 Fromreide et al. 2014 Masud et al. 2010