Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations

http://multimedialab.elis.ugent.be
Ghent University – iMinds, ELIS Department/Multimedia Lab
Gaston Crommenlaan 8 bus 201
B-9050 Ledeberg – Ghent, Belgium
Fréderic Godin, Baptist Vandersmissen, Azarakhsh Jalalvand, Wesley De Neve and Rik Van de Walle
Workshop on Machine Learning and NLP, NIPS 2014
Alleviating Manual Feature Engineering for Part-of-Speech Tagging
of Twitter Microposts using Distributed Word Representations
12/12/2014, Montreal, Canada
Research Question
Vote-Constrained Bootstrapping*
Can we avoid manual feature engineering when developing
a Part-of-Speech tagger for Twitter microposts?
@frederic_godin, @BaptistV, @wmdeneve and @rvdwalle
Solution
Automatically learn features on 400 million raw Twitter microposts that
capture syntactic and semantic patterns and feed them to a neural network
Learn Features Train the Part-of-Speech Tagger
400 million
Word2vec Skip-gram
400D vector
400D
400D vector
400D 400D
444000000DDD v vVeeecccttotoorr r
Hidden Layer (500D)
Output Layer (52D)
im doin good
VBG
Evaluation
ARK tagger
GATE tagger
im
doin
good VBG
V
Agree?
Automatically generate
high confidence labeled data
Use this data to pre-train
the neural network
*Derczynski et al., 2013. "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data"
Word2vec
dataset
Pre-training
dataset
Accuracy
validation set
Accuracy
test set
150M / 87.95% 87.46%
150M 50K 89.64% 88.82%
400M 50K 89.73% 88.95%
400M 125K 90.09% 88.90%
Ritter et al. (2011) 84.55%
Derczynski et al. (2013) 88.69%

Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations

Recommandé

Recommandé

Contenu connexe

Similaire à Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations

Similaire à Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations (20)

Plus de fgodin

Plus de fgodin (7)

Dernier

Dernier (20)

Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations