SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Twitter Part-of-Speech Tagging for All:
Overcoming Sparse and Noisy Data
Leon Derczynski
Alan Ritter
Sam Clark
Kalina Bontcheva
Streaming social media is powerful
● It's Big Data!
– Velocity: 500M tweets / day
– Volume: 20M users / month
– Variety: earthquakes, stocks, this guy
● Sample of all human discourse - unprecedented
● Not only where people are & when, but also
what they are doing
● Interesting stuff - just ask the NSA!
Tweets are dirty
● You all know what Twitter is, so let's just look at
some difficult tweets
● Orthography: Kk its 22:48 friday nyt :D really
tired so imma go to sleep :) good nyt x god
bles xxxxx
● Fragments: Bonfire tonite. All are welcome,
joe included
● Capitalisation: Don't Have Time To Stop In???
Then, Check Out Our Quick Full Service Drive
Thru Window :)
● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx
*kisses your ass**sneezes after* Lol
Tough tweets: Do we even care?
● Most tweets are linguistically fairly well-formed
● RT @DesignerDepot: Minimalist Web Design: When
Less is More - http://ow.ly/2FwyX
● just went on an unfollowing spree... there's no
point of following you if you haven't tweeted
in 10+ days. #justsaying ..
● The tweets we find most difficult, are those that
seem to say the least
● So im in tha chi whts popping tonight?
● i just gave my momma some money 4 a bill.... she
smiled when i put it n her hand __AND__ said "i
wanna go out to eat"... -______- HELLA SCAN
We do care
● However, there is utility in trivia:
– Sadilek: Predict if you will get flu, using spatial co-location and friend network
– Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus
– Emerging events: tendency to describe briefly
''There's a dead crow
in my garden''
@mari: i think im sick ugh..
Problem representation
● Tweets into finite tokens (PTB + URLs, Smileys)
● Put tokens in categories, depending on linguistic function
● Discriminative
– cases one by one
– e.g. unigram tagger
● Sequence labelling
– order matters!
– consider neighbouring labels
● Goal: label the whole sequence correctly
Word order still matters.. just
● Hard for tweets: exclamations and fragments
● Whole sequences a bit rare
● @FeeninforPretty making something to eat,
aint ate all day
● Peace green tea time!! Happyzone!!!! :)))))
● Sentence structure cues (e.g. caps) often:
– absent
– over-used
How do current tools do?
● Badly!
– Out of the box:
– Trained on Twitter,
IRC and WSJ data:
Where do they break?
● Continued work extending Stanford Tagger
● Terrible at doing whole sentences
– Best was 10% accuracy
– SotA on newswire about 55-60%
● Problems on unknown words – this is a good
target set to get better performance on
– 1 in 5 words completely unseen
– 27% token accuracy on this group
What errors occur on unknowns?
● Gold standard errors (dank_UH je_UH → _FW)
● Training lacks IV words (Internet, bake)
● Pre-taggables (URLs, mentions, retweets)
● NN vs. NNP (derek_NN, Bed_NNP)
● Slang (LUVZ, HELLA, 2night)
● Genre-specific (unfollowing)
● Tokenisation errors (ass**sneezes)
● Orthographic (suprising)
Do we have enough data?
● No, it's even worse than normal
– Ritter: 15K tokens, PTB, one annotator
– Foster: 14K tokens, PTB, low-noise
– CMU: 39K tokens, custom, narrow tagset
Tweet PoS-tagging issues
● From analysis, three big issues identified:
1. Many unseen words / orthographies
2. Uncertain sentence structure
3. Not enough annotated data
● Continued with Ritter dataset
Unseen words in tweets
● Two classes:
● Standard token, non-standard orthography;
– freinds
– KHAAAANNNNNNN!
● Non-standard token, standard orthography
– omg + bieber = omb
– Huntington
Unseen words in tweets
● Majority of non-standard orthographies can be
corrected with a gazetteer: typical Pareto
– vids → videos
– cussin → cursing
– hella → very
● No need to bother with e.g. Brown clustering
● 361 entries give 2.3% token error reduction
Unseen words in tweets
● The rest can handled reasonably with word
shape and contextual features
● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare
● Features include:
– word prefix and suffix shapes
– distribution of shape in corpus
– shapes of neighbouring words
● Corpus small, so adjust rare threshold
● +5.35% absolute token acc., +18.5% sentence
Tweet “sentence” “structure”
● They are structured (sometimes)
● We still do better if we look at global features
– Unigram tagger accuracy: 66%
● Sentence-level accuracy is important
– Unigram tagger sentence accuracy: 2.3%
Tweet “sentence” “structure”
● Tweets contain some constrained-form tokens
● Links, hashtags, user mentions, some smileys
● We can fix the label for these tokens
● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)
Tweet “sentence” “structure”
● This allows us to prune the transition graph of
labels in the sequence
● Because the graph is read in both directions,
fixing any label point impacts whole tweet
● Setting label priors reduces token error 5.03%
Not enough data
● Big unlabelled data - 75 000 000 tweets / day (en)
● Bootstrapping sometimes helps in this case
● Problem: initial accuracy is too low ● •︵ _UH
● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH
● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH
● Solution: Vote-constrained Bootstrapping _⊙ ʘ _UH
Vote-constrained bootstrapping
● Not many taggers available for building
semi-supervised data
● We chose Ritters plus the CMU tagger
● Where classes don't map 1:1
● Create equivalence classes between tags
– CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS)
– CMU tag !(interjection) → PTB (UH)
● Coarser tag constrains set of fine-grained tags
Vote-constrained bootstrapping
● Ask both taggers to label the candidate input
● Add tweet to semi-supervised data if both agree
●
Lebron_^ + Lebron_NNP → OK, Lebron_NNP
●
books_N + books_VBZ → Fail, reject tweet
● Evaluated quality on development set
– Agreed on 17.8% of tweets
– Of those, 97.4 of tokens correctly PTB labelled
– 71.3% whole tweets correctly labelled
Vote-constrained bootstrapping
● Results:
– Use Trendminer lang ID + data
– Collected 1.5M agreed-upon tokens
● Adding this bootstrapped data reduced error by:
– Token-level: 13.7% Sentence-level: 4.5%
www.trendminer-project.eu
Final results
● Unknown accuracy rate: from 27.8% to 74.5%
Token Sentence
Baseline: Ritter T-Pos 84.55 9.32
GATE: eval set 88.69 20.34
- error reduction 26.80 12.15
GATE: dev set 90.54 28.81
- error reduction 38.77 21.49
Where do we go next?
● Local tag sequence bounds?
● Better handling of hashtags
– I'm stressed at 9am, shopping on my lunch break...
can't deal w/ this today. #retailtherapy
– I'm so #bored today
● More data – bootstrapped
● More data – part-bootstrapped (e.g. CMU GS)
● More data – human annotated
● Parsing
Downloadable & Friendly
● As command-line tool; as GATE PR; as Stanford
Tagger model
● Included in GATE's TwitIE toolkit (4pm, Europa)
● 1.5M token dataset available
● Updates since submission:
– Better handling of contractions
– Less sensitive to tokenisation scheme
● Please play!
Thank you for your time!
There is hope:
Jersey Shore is overrated. studying and
history homework then a fat night of sleep!
Do you have any questions?
Owoputi et al.
● NAACL'13 paper: 90.5% token perf w/ PTB accuracy
● Advancement of the Gimpel tagger, used for our bootstrapping
● Late discovery: Can be adapted to PTB tagset with good
results
● We use disjoint techniques to Owoputi; combining them could
give an even better result!
● Our model readily re-usable and integrated into existing NLP
tool sets
Capitalisation
● Noisy tweets have unusual capitalisation, right?
– Buy Our Widgets Now
– ugh I haet u all .. stupd ppl #fml
● Lowercase model with lowercased data allows
us to ignore capitalisation noise
● Tried multiple approaches to classifying noisy
vs. well-formed capitalisation
● Gain from ignoring case in noisy tweets offset
by loss from mis-classified well-cased data

Contenu connexe

Similaire à Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
Roots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription WorkshopRoots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription WorkshopBen Brumfield
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATEDiana Maynard
 
Evan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATLEvan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATLMLconf
 
Think Like a Programmer
Think Like a ProgrammerThink Like a Programmer
Think Like a Programmerdaoswald
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Derek Buitenhuis
 
Estola meetup big_datacampla_6_14_evan_estola
Estola meetup big_datacampla_6_14_evan_estolaEstola meetup big_datacampla_6_14_evan_estola
Estola meetup big_datacampla_6_14_evan_estolaData Con LA
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxKaiduTester
 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Chris Gates
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the WorldYves Raimond
 
Self Trending a Tweet - Cluster and Topic Analysis on Tweets
Self Trending a Tweet - Cluster and Topic Analysis on TweetsSelf Trending a Tweet - Cluster and Topic Analysis on Tweets
Self Trending a Tweet - Cluster and Topic Analysis on TweetsMor Krispil
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Jeongkyu Shin
 
Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at TwitterBill Graham
 
Dealing with Contributor Overload - Linux Conf AU Jan 2018
Dealing with Contributor Overload -  Linux Conf AU Jan 2018Dealing with Contributor Overload -  Linux Conf AU Jan 2018
Dealing with Contributor Overload - Linux Conf AU Jan 2018Holden Karau
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithNETWAYS
 

Similaire à Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data (20)

Messaging
MessagingMessaging
Messaging
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
Messaging
MessagingMessaging
Messaging
 
Roots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription WorkshopRoots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription Workshop
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATE
 
Evan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATLEvan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATL
 
Think Like a Programmer
Think Like a ProgrammerThink Like a Programmer
Think Like a Programmer
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
 
Estola meetup big_datacampla_6_14_evan_estola
Estola meetup big_datacampla_6_14_evan_estolaEstola meetup big_datacampla_6_14_evan_estola
Estola meetup big_datacampla_6_14_evan_estola
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the World
 
Self Trending a Tweet - Cluster and Topic Analysis on Tweets
Self Trending a Tweet - Cluster and Topic Analysis on TweetsSelf Trending a Tweet - Cluster and Topic Analysis on Tweets
Self Trending a Tweet - Cluster and Topic Analysis on Tweets
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Analyze this
Analyze thisAnalyze this
Analyze this
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
 
Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at Twitter
 
Dealing with Contributor Overload - Linux Conf AU Jan 2018
Dealing with Contributor Overload -  Linux Conf AU Jan 2018Dealing with Contributor Overload -  Linux Conf AU Jan 2018
Dealing with Contributor Overload - Linux Conf AU Jan 2018
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 

Plus de Leon Derczynski

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and VeracityLeon Derczynski
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018Leon Derczynski
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingLeon Derczynski
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social MediaLeon Derczynski
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Leon Derczynski
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social MediaLeon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doLeon Derczynski
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsLeon Derczynski
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyLeon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceLeon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesLeon Derczynski
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringLeon Derczynski
 

Plus de Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answering
 

Dernier

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

  • 1. Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data Leon Derczynski Alan Ritter Sam Clark Kalina Bontcheva
  • 2. Streaming social media is powerful ● It's Big Data! – Velocity: 500M tweets / day – Volume: 20M users / month – Variety: earthquakes, stocks, this guy ● Sample of all human discourse - unprecedented ● Not only where people are & when, but also what they are doing ● Interesting stuff - just ask the NSA!
  • 3. Tweets are dirty ● You all know what Twitter is, so let's just look at some difficult tweets ● Orthography: Kk its 22:48 friday nyt :D really tired so imma go to sleep :) good nyt x god bles xxxxx ● Fragments: Bonfire tonite. All are welcome, joe included ● Capitalisation: Don't Have Time To Stop In??? Then, Check Out Our Quick Full Service Drive Thru Window :) ● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx *kisses your ass**sneezes after* Lol
  • 4. Tough tweets: Do we even care? ● Most tweets are linguistically fairly well-formed ● RT @DesignerDepot: Minimalist Web Design: When Less is More - http://ow.ly/2FwyX ● just went on an unfollowing spree... there's no point of following you if you haven't tweeted in 10+ days. #justsaying .. ● The tweets we find most difficult, are those that seem to say the least ● So im in tha chi whts popping tonight? ● i just gave my momma some money 4 a bill.... she smiled when i put it n her hand __AND__ said "i wanna go out to eat"... -______- HELLA SCAN
  • 5. We do care ● However, there is utility in trivia: – Sadilek: Predict if you will get flu, using spatial co-location and friend network – Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus – Emerging events: tendency to describe briefly ''There's a dead crow in my garden'' @mari: i think im sick ugh..
  • 6. Problem representation ● Tweets into finite tokens (PTB + URLs, Smileys) ● Put tokens in categories, depending on linguistic function ● Discriminative – cases one by one – e.g. unigram tagger ● Sequence labelling – order matters! – consider neighbouring labels ● Goal: label the whole sequence correctly
  • 7. Word order still matters.. just ● Hard for tweets: exclamations and fragments ● Whole sequences a bit rare ● @FeeninforPretty making something to eat, aint ate all day ● Peace green tea time!! Happyzone!!!! :))))) ● Sentence structure cues (e.g. caps) often: – absent – over-used
  • 8. How do current tools do? ● Badly! – Out of the box: – Trained on Twitter, IRC and WSJ data:
  • 9. Where do they break? ● Continued work extending Stanford Tagger ● Terrible at doing whole sentences – Best was 10% accuracy – SotA on newswire about 55-60% ● Problems on unknown words – this is a good target set to get better performance on – 1 in 5 words completely unseen – 27% token accuracy on this group
  • 10. What errors occur on unknowns? ● Gold standard errors (dank_UH je_UH → _FW) ● Training lacks IV words (Internet, bake) ● Pre-taggables (URLs, mentions, retweets) ● NN vs. NNP (derek_NN, Bed_NNP) ● Slang (LUVZ, HELLA, 2night) ● Genre-specific (unfollowing) ● Tokenisation errors (ass**sneezes) ● Orthographic (suprising)
  • 11. Do we have enough data? ● No, it's even worse than normal – Ritter: 15K tokens, PTB, one annotator – Foster: 14K tokens, PTB, low-noise – CMU: 39K tokens, custom, narrow tagset
  • 12. Tweet PoS-tagging issues ● From analysis, three big issues identified: 1. Many unseen words / orthographies 2. Uncertain sentence structure 3. Not enough annotated data ● Continued with Ritter dataset
  • 13. Unseen words in tweets ● Two classes: ● Standard token, non-standard orthography; – freinds – KHAAAANNNNNNN! ● Non-standard token, standard orthography – omg + bieber = omb – Huntington
  • 14. Unseen words in tweets ● Majority of non-standard orthographies can be corrected with a gazetteer: typical Pareto – vids → videos – cussin → cursing – hella → very ● No need to bother with e.g. Brown clustering ● 361 entries give 2.3% token error reduction
  • 15. Unseen words in tweets ● The rest can handled reasonably with word shape and contextual features ● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare ● Features include: – word prefix and suffix shapes – distribution of shape in corpus – shapes of neighbouring words ● Corpus small, so adjust rare threshold ● +5.35% absolute token acc., +18.5% sentence
  • 16. Tweet “sentence” “structure” ● They are structured (sometimes) ● We still do better if we look at global features – Unigram tagger accuracy: 66% ● Sentence-level accuracy is important – Unigram tagger sentence accuracy: 2.3%
  • 17. Tweet “sentence” “structure” ● Tweets contain some constrained-form tokens ● Links, hashtags, user mentions, some smileys ● We can fix the label for these tokens ● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)
  • 18. Tweet “sentence” “structure” ● This allows us to prune the transition graph of labels in the sequence ● Because the graph is read in both directions, fixing any label point impacts whole tweet ● Setting label priors reduces token error 5.03%
  • 19. Not enough data ● Big unlabelled data - 75 000 000 tweets / day (en) ● Bootstrapping sometimes helps in this case ● Problem: initial accuracy is too low ● •︵ _UH ● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH ● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH ● Solution: Vote-constrained Bootstrapping _⊙ ʘ _UH
  • 20. Vote-constrained bootstrapping ● Not many taggers available for building semi-supervised data ● We chose Ritters plus the CMU tagger ● Where classes don't map 1:1 ● Create equivalence classes between tags – CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS) – CMU tag !(interjection) → PTB (UH) ● Coarser tag constrains set of fine-grained tags
  • 21. Vote-constrained bootstrapping ● Ask both taggers to label the candidate input ● Add tweet to semi-supervised data if both agree ● Lebron_^ + Lebron_NNP → OK, Lebron_NNP ● books_N + books_VBZ → Fail, reject tweet ● Evaluated quality on development set – Agreed on 17.8% of tweets – Of those, 97.4 of tokens correctly PTB labelled – 71.3% whole tweets correctly labelled
  • 22. Vote-constrained bootstrapping ● Results: – Use Trendminer lang ID + data – Collected 1.5M agreed-upon tokens ● Adding this bootstrapped data reduced error by: – Token-level: 13.7% Sentence-level: 4.5% www.trendminer-project.eu
  • 23. Final results ● Unknown accuracy rate: from 27.8% to 74.5% Token Sentence Baseline: Ritter T-Pos 84.55 9.32 GATE: eval set 88.69 20.34 - error reduction 26.80 12.15 GATE: dev set 90.54 28.81 - error reduction 38.77 21.49
  • 24. Where do we go next? ● Local tag sequence bounds? ● Better handling of hashtags – I'm stressed at 9am, shopping on my lunch break... can't deal w/ this today. #retailtherapy – I'm so #bored today ● More data – bootstrapped ● More data – part-bootstrapped (e.g. CMU GS) ● More data – human annotated ● Parsing
  • 25. Downloadable & Friendly ● As command-line tool; as GATE PR; as Stanford Tagger model ● Included in GATE's TwitIE toolkit (4pm, Europa) ● 1.5M token dataset available ● Updates since submission: – Better handling of contractions – Less sensitive to tokenisation scheme ● Please play!
  • 26. Thank you for your time! There is hope: Jersey Shore is overrated. studying and history homework then a fat night of sleep! Do you have any questions?
  • 27. Owoputi et al. ● NAACL'13 paper: 90.5% token perf w/ PTB accuracy ● Advancement of the Gimpel tagger, used for our bootstrapping ● Late discovery: Can be adapted to PTB tagset with good results ● We use disjoint techniques to Owoputi; combining them could give an even better result! ● Our model readily re-usable and integrated into existing NLP tool sets
  • 28. Capitalisation ● Noisy tweets have unusual capitalisation, right? – Buy Our Widgets Now – ugh I haet u all .. stupd ppl #fml ● Lowercase model with lowercased data allows us to ignore capitalisation noise ● Tried multiple approaches to classifying noisy vs. well-formed capitalisation ● Gain from ignoring case in noisy tweets offset by loss from mis-classified well-cased data