1. Образец заголовка
Tutorial on using and
learning phrases from text
by Cassandra Jacobs
Prepared as an assignment for CS410: Text Information Systems in Spring 2016
3. Образец заголовкаWhat are phrases?
• Word combinations
• Literal and idiomatic meanings
– “kick the bucket” – to die
– “strong coffee” – highly caffeinated,
concentrated
– “data mining” – a particular concept in
computer science
4. Образец заголовкаWhy phrases?
• Phrases can express ideas not obvious
from the individual words
– White House (an important building)
– red herring (an anomaly)
– syntactic parsing (a paper topic)
• Can disambiguate words “for free”
– (river) bank versus (financial) bank
5. Образец заголовкаPhrases versus words
• Difficult to extract from text
• n words, but n2 possible bigrams, n3
trigrams, etc.
– Always rarer than individual words
– Simple measures like frequency can lead to
bad phrases (e.g. “in the”, “is a”, “not our”)
6. Образец заголовкаPhrases versus words
• Some probabilistic measurements are
good proxies for “phraseness”
• Mutual information identifies phrases that
occur more often than chance:
p(a,b)
p(a)p(b)
7. Образец заголовкаPhrases versus words
• Unsupervised methods like topic models
of bigrams often provide strange results
– “I mean”
– “Well I”
• Distributional similarity/vector methods
require supervision or feedback about
phrase quality
8. Образец заголовкаPhrases versus words
• Low numbers of observations
– Huge domain differences in whether phrases
are used
• E.g. ACL submissions encouraged to not use
idiomatic expressions
– Formal versus informal contexts
– Difference between writers’ language
backgrounds
9. Образец заголовкаTasks where phrases are useful
• Good phrases should improve or reflect
– Document classification tasks
– External knowledge (Wikipedia titles, dictionary)
– Analogy solving
– Paraphrase identification
– Similarity ratings on Amazon Mechanical Turk
– Machine translation
10. Образец заголовка
Task 1: Named entity
recognition
• Some studies use wiki phrases (headlines)
by taking all the titles and using them in
other tasks
• Can parse a sentence for entities by
automatically labeling some of the entities
that are in Wikipedia
11. Образец заголовка
Identifying wiki phrases for
named entity recognition
• Polls show DemocratORG
Hillary_ClintonPER and RepublicanORG
Donald_TrumpPER ahead by double-digit
margins
• Wiki phrases like Hillary_Clinton and
Donald_Trump contain lots of clues that
they are people
12. Образец заголовка
Identifying wiki phrases for
named entity recognition
• Passos, Kumar, & McCallum (2014)
– Bigrams where p(a,b)/(p(a)p(b)) > 1000
– Then top 1M phrases
– Create embeddings from these phrases
– Embeddings used as features in named entity
recognition (NER)
– Using phrase embeddings led to state of the
art NER
13. Образец заголовка
Task 2: Using idioms in
sentiment analysis
• Bag of individual words models would
probably misclassify these two
– “not that bad” à ok
– “not that good” à probably bad
• Sometimes adding in phrase information
increases noise, runtime
14. Образец заголовка
Using idioms in sentiment
analysis
• Williams et al. (2015) annotated idioms in
context as either positive or negative
– 580 idioms from a language learner textbook
– Regular expressions to identify variants
– “Not that bad” -> neutral
– “A drop in the bucket” -> good
• Sentiment classification increased from 45
to 60% with addition of idioms
15. Образец заголовка
Task 3: Using idioms in phrase
analogies
Toronto: Toronto Mapleleafs ::
Montreal: Montreal Canadiens
– Want to produce complex, non-word output in
an analogy task
16. Образец заголовка
Using idioms in phrase
analogies
• Mikolov et al. (2013)
• In an analogy task, need to first identify
phrases
– High mutual information score cutoff for
phrase learning
– Train a neural network model to learn
distributed phrase vector representations
17. Образец заголовка
Using idioms in phrase
analogies
• Neural network representations are pairs
of words that are concatenated
– “Toronto Mapleleafs” is treated like a single
word for the model
– Model predicts the contexts given words and
phrases as input
– “Toronto Mapleleafs” and “Montreal
Canadiens” both predict a “hockey” context
when the individual words do not
19. Образец заголовка
Unsupervised learning of
phrases
• Some papers focus on how to get good
phrases beyond mutual information
measures
– Shallow parsing with structural constraints (no
“of the United”)
– If a phrase includes another phrase, the whole
phrase must be included (“President of the
United States”)
20. Образец заголовка
Unsupervised learning of
phrases
• Cho et al. (2014) propose a model for
machine translation that predicts words
and phrases in a target language
(recursive neural network)
– Input: Word and next word in source language
– Output: Word and next word in target
language
21. Образец заголовка
Unsupervised learning of
phrases
• Predicting the next word of a word in a
foreign language helps the model
associate the past with potential future
output
– Phrases learned in the Cho et al. (2014)
model cluster “one to three months” near “for
two months”
22. Образец заголовкаSupervised learning of phrases
• Liu et al. (2015) define quality as a
threshold with two properties
– Informativeness within a document (effectively
term frequency/inverse document frequency)
– Concordance (conventionality, judged by
difference between some combinations – e.g.
powerful coffee, strong coffee)
– Like TF-IDF for phrases
23. Образец заголовкаEvaluation of learned phrases
• Perplexity of the data given the model
– Higher perplexity means less data explained
– When a model captures more dependencies
in the data, phrases included are good (El-Kishky
et al., 2015)
– This metric works better for some domains
than others (e.g. Yelp)
24. Образец заголовкаEvaluations of phrases
• El-Kishky et al. (2015) also compared
retrieved phrases against Wikipedia titles
– If in Wikipedia, then this is a very good phrase
– If not, harder to evaluate
– Works for some domains but maybe not
others (e.g. abstracts and papers)
25. Образец заголовкаCurrent state of research
• No gold standard for evaluating whether a
phrase is good or not
– Many available datasets and applications
– Less clear how to learn phrases in an
unsupervised framework
– Many models implicitly or explicitly use mutual
information and background language models
as filters
26. Образец заголовкаReferences
Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, Jiawei Han, "Scalable Topical
Phrase Mining from Text Corpora", PVLDB Vol. 8 (Also, Proc. 2015 Int. Conf. on Very Large
Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015).
Liu, J., Shang, J., Wang, C., Ren, X., & Han, J. (2015, May). Mining quality phrases from
massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data (pp. 1729-1744). ACM.
Passos, A., Kumar, V., & McCallum, A. (2014). Lexicon infused phrase embeddings for named
entity resolution. arXiv preprint arXiv:1404.5367.
Williams, L., Bannister, C., Arribas-Ayllon, M., Preece, A., & Spasić, I. (2015). The role of idioms
in sentiment analysis. Expert Systems with Applications, 42, 7375-7385.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. In Advances in neural
information processing systems (pp. 3111-3119).
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &
Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078.