SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Образец заголовка
Tutorial on using and
learning phrases from text
by Cassandra Jacobs
Prepared as an assignment for CS410: Text Information Systems in Spring 2016
Образец заголовкаRoadmap
•  What are phrases?
•  Why use phrases?
•  What NLP tasks do phrases help?
•  How do we mine phrases?
Образец заголовкаWhat are phrases?
•  Word combinations
•  Literal and idiomatic meanings
– “kick the bucket” – to die
– “strong coffee” – highly caffeinated,
concentrated
– “data mining” – a particular concept in
computer science
Образец заголовкаWhy phrases?
•  Phrases can express ideas not obvious
from the individual words
– White House (an important building)
– red herring (an anomaly)
– syntactic parsing (a paper topic)
•  Can disambiguate words “for free”
– (river) bank versus (financial) bank
Образец заголовкаPhrases versus words
•  Difficult to extract from text
•  n words, but n2 possible bigrams, n3
trigrams, etc.
– Always rarer than individual words
– Simple measures like frequency can lead to
bad phrases (e.g. “in the”, “is a”, “not our”)
Образец заголовкаPhrases versus words
•  Some probabilistic measurements are
good proxies for “phraseness”
•  Mutual information identifies phrases that
occur more often than chance:
p(a,b)
p(a)p(b)
Образец заголовкаPhrases versus words
•  Unsupervised methods like topic models
of bigrams often provide strange results
– “I mean”
– “Well I”
•  Distributional similarity/vector methods
require supervision or feedback about
phrase quality
Образец заголовкаPhrases versus words
•  Low numbers of observations
– Huge domain differences in whether phrases
are used
•  E.g. ACL submissions encouraged to not use
idiomatic expressions
– Formal versus informal contexts
– Difference between writers’ language
backgrounds
Образец заголовкаTasks where phrases are useful
•  Good phrases should improve or reflect
–  Document classification tasks
–  External knowledge (Wikipedia titles, dictionary)
–  Analogy solving
–  Paraphrase identification
–  Similarity ratings on Amazon Mechanical Turk
–  Machine translation
Образец заголовка
Task 1: Named entity
recognition
•  Some studies use wiki phrases (headlines)
by taking all the titles and using them in
other tasks
•  Can parse a sentence for entities by
automatically labeling some of the entities
that are in Wikipedia
Образец заголовка
Identifying wiki phrases for
named entity recognition
•  Polls show DemocratORG
Hillary_ClintonPER and RepublicanORG
Donald_TrumpPER ahead by double-digit
margins
•  Wiki phrases like Hillary_Clinton and
Donald_Trump contain lots of clues that
they are people
Образец заголовка
Identifying wiki phrases for
named entity recognition
•  Passos, Kumar, & McCallum (2014)
– Bigrams where p(a,b)/(p(a)p(b)) > 1000
– Then top 1M phrases
– Create embeddings from these phrases
– Embeddings used as features in named entity
recognition (NER)
– Using phrase embeddings led to state of the
art NER
Образец заголовка
Task 2: Using idioms in
sentiment analysis
•  Bag of individual words models would
probably misclassify these two
– “not that bad” à ok
– “not that good” à probably bad
•  Sometimes adding in phrase information
increases noise, runtime
Образец заголовка
Using idioms in sentiment
analysis
•  Williams et al. (2015) annotated idioms in
context as either positive or negative
– 580 idioms from a language learner textbook
– Regular expressions to identify variants
– “Not that bad” -> neutral
– “A drop in the bucket” -> good
•  Sentiment classification increased from 45
to 60% with addition of idioms
Образец заголовка
Task 3: Using idioms in phrase
analogies
Toronto: Toronto Mapleleafs ::
Montreal: Montreal Canadiens
– Want to produce complex, non-word output in
an analogy task
Образец заголовка
Using idioms in phrase
analogies
•  Mikolov et al. (2013)
•  In an analogy task, need to first identify
phrases
– High mutual information score cutoff for
phrase learning
– Train a neural network model to learn
distributed phrase vector representations
Образец заголовка
Using idioms in phrase
analogies
•  Neural network representations are pairs
of words that are concatenated
– “Toronto Mapleleafs” is treated like a single
word for the model
– Model predicts the contexts given words and
phrases as input
– “Toronto Mapleleafs” and “Montreal
Canadiens” both predict a “hockey” context
when the individual words do not
Образец заголовкаHow to learn phrases?
•  Unsupervised methods
•  Supervised methods
Образец заголовка
Unsupervised learning of
phrases
•  Some papers focus on how to get good
phrases beyond mutual information
measures
– Shallow parsing with structural constraints (no
“of the United”)
– If a phrase includes another phrase, the whole
phrase must be included (“President of the
United States”)
Образец заголовка
Unsupervised learning of
phrases
•  Cho et al. (2014) propose a model for
machine translation that predicts words
and phrases in a target language
(recursive neural network)
– Input: Word and next word in source language
– Output: Word and next word in target
language
Образец заголовка
Unsupervised learning of
phrases
•  Predicting the next word of a word in a
foreign language helps the model
associate the past with potential future
output
– Phrases learned in the Cho et al. (2014)
model cluster “one to three months” near “for
two months”
Образец заголовкаSupervised learning of phrases
•  Liu et al. (2015) define quality as a
threshold with two properties
– Informativeness within a document (effectively
term frequency/inverse document frequency)
– Concordance (conventionality, judged by
difference between some combinations – e.g.
powerful coffee, strong coffee)
– Like TF-IDF for phrases
Образец заголовкаEvaluation of learned phrases
•  Perplexity of the data given the model
– Higher perplexity means less data explained
– When a model captures more dependencies
in the data, phrases included are good (El-Kishky
et al., 2015)
– This metric works better for some domains
than others (e.g. Yelp)
Образец заголовкаEvaluations of phrases
•  El-Kishky et al. (2015) also compared
retrieved phrases against Wikipedia titles
– If in Wikipedia, then this is a very good phrase
– If not, harder to evaluate
– Works for some domains but maybe not
others (e.g. abstracts and papers)
Образец заголовкаCurrent state of research
•  No gold standard for evaluating whether a
phrase is good or not
– Many available datasets and applications
– Less clear how to learn phrases in an
unsupervised framework
– Many models implicitly or explicitly use mutual
information and background language models
as filters
Образец заголовкаReferences
Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, Jiawei Han, "Scalable Topical
Phrase Mining from Text Corpora", PVLDB Vol. 8 (Also, Proc. 2015 Int. Conf. on Very Large
Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015).
Liu, J., Shang, J., Wang, C., Ren, X., & Han, J. (2015, May). Mining quality phrases from
massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data (pp. 1729-1744). ACM.
Passos, A., Kumar, V., & McCallum, A. (2014). Lexicon infused phrase embeddings for named
entity resolution. arXiv preprint arXiv:1404.5367.
Williams, L., Bannister, C., Arribas-Ayllon, M., Preece, A., & Spasić, I. (2015). The role of idioms
in sentiment analysis. Expert Systems with Applications, 42, 7375-7385.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. In Advances in neural
information processing systems (pp. 3111-3119).
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &
Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078.

Contenu connexe

Tendances

Email Classification
Email ClassificationEmail Classification
Email ClassificationXi Chen
 
Tag based recommender system
Tag based recommender systemTag based recommender system
Tag based recommender systemKaren Li
 
Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningLi Miao
 
Tutorial on Relationship Mining In Online Social Networks
Tutorial on Relationship Mining In Online Social NetworksTutorial on Relationship Mining In Online Social Networks
Tutorial on Relationship Mining In Online Social Networkspjing2
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social networkChanon Hongsirikulkit
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studiesalessio_ferrari
 
Story generation-Sarah Saneei
Story generation-Sarah SaneeiStory generation-Sarah Saneei
Story generation-Sarah SaneeiSRah Sanei
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Preference Elicitation in Recommender Systems
Preference Elicitation in Recommender SystemsPreference Elicitation in Recommender Systems
Preference Elicitation in Recommender SystemsAnish Shenoy
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommendergu wendong
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalDustin Smith
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmhyunsung lee
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Replicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsReplicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsAlejandro Bellogin
 
Survey Research In Empirical Software Engineering
Survey Research In Empirical Software EngineeringSurvey Research In Empirical Software Engineering
Survey Research In Empirical Software Engineeringalessio_ferrari
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysisharit66
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information RetrievalHarsh Thakkar
 

Tendances (20)

Email Classification
Email ClassificationEmail Classification
Email Classification
 
Tag based recommender system
Tag based recommender systemTag based recommender system
Tag based recommender system
 
Final deck
Final deckFinal deck
Final deck
 
Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text Mining
 
Tutorial on Relationship Mining In Online Social Networks
Tutorial on Relationship Mining In Online Social NetworksTutorial on Relationship Mining In Online Social Networks
Tutorial on Relationship Mining In Online Social Networks
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studies
 
Story generation-Sarah Saneei
Story generation-Sarah SaneeiStory generation-Sarah Saneei
Story generation-Sarah Saneei
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Preference Elicitation in Recommender Systems
Preference Elicitation in Recommender SystemsPreference Elicitation in Recommender Systems
Preference Elicitation in Recommender Systems
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommender
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlm
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Replicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsReplicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender Systems
 
Survey Research In Empirical Software Engineering
Survey Research In Empirical Software EngineeringSurvey Research In Empirical Software Engineering
Survey Research In Empirical Software Engineering
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Abstractive Review Summarization
Abstractive Review SummarizationAbstractive Review Summarization
Abstractive Review Summarization
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 

Similaire à Using and learning phrases

Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Why Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveWhy Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveJames Hendler
 
20181106 survey on challenges of question answering in the semantic web saltlux
20181106 survey on challenges of question answering in the semantic web saltlux20181106 survey on challenges of question answering in the semantic web saltlux
20181106 survey on challenges of question answering in the semantic web saltluxDongGyun Hong
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
Mdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesMdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesRafael Alvarado
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsAndre Freitas
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...Seth Grimes
 
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Liz Rodrigues
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlpankit_ppt
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfAdityaMishra178868
 

Similaire à Using and learning phrases (20)

semantic web & natural language
semantic web & natural languagesemantic web & natural language
semantic web & natural language
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Why Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveWhy Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspective
 
20181106 survey on challenges of question answering in the semantic web saltlux
20181106 survey on challenges of question answering in the semantic web saltlux20181106 survey on challenges of question answering in the semantic web saltlux
20181106 survey on challenges of question answering in the semantic web saltlux
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Mdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesMdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databases
 
1910 HCLT
1910 HCLT1910 HCLT
1910 HCLT
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP Systems
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 

Dernier

ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
ARTERIAL BLOOD GAS ANALYSIS........pptx
ARTERIAL BLOOD  GAS ANALYSIS........pptxARTERIAL BLOOD  GAS ANALYSIS........pptx
ARTERIAL BLOOD GAS ANALYSIS........pptxAneriPatwari
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 

Dernier (20)

ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
ARTERIAL BLOOD GAS ANALYSIS........pptx
ARTERIAL BLOOD  GAS ANALYSIS........pptxARTERIAL BLOOD  GAS ANALYSIS........pptx
ARTERIAL BLOOD GAS ANALYSIS........pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 

Using and learning phrases

  • 1. Образец заголовка Tutorial on using and learning phrases from text by Cassandra Jacobs Prepared as an assignment for CS410: Text Information Systems in Spring 2016
  • 2. Образец заголовкаRoadmap •  What are phrases? •  Why use phrases? •  What NLP tasks do phrases help? •  How do we mine phrases?
  • 3. Образец заголовкаWhat are phrases? •  Word combinations •  Literal and idiomatic meanings – “kick the bucket” – to die – “strong coffee” – highly caffeinated, concentrated – “data mining” – a particular concept in computer science
  • 4. Образец заголовкаWhy phrases? •  Phrases can express ideas not obvious from the individual words – White House (an important building) – red herring (an anomaly) – syntactic parsing (a paper topic) •  Can disambiguate words “for free” – (river) bank versus (financial) bank
  • 5. Образец заголовкаPhrases versus words •  Difficult to extract from text •  n words, but n2 possible bigrams, n3 trigrams, etc. – Always rarer than individual words – Simple measures like frequency can lead to bad phrases (e.g. “in the”, “is a”, “not our”)
  • 6. Образец заголовкаPhrases versus words •  Some probabilistic measurements are good proxies for “phraseness” •  Mutual information identifies phrases that occur more often than chance: p(a,b) p(a)p(b)
  • 7. Образец заголовкаPhrases versus words •  Unsupervised methods like topic models of bigrams often provide strange results – “I mean” – “Well I” •  Distributional similarity/vector methods require supervision or feedback about phrase quality
  • 8. Образец заголовкаPhrases versus words •  Low numbers of observations – Huge domain differences in whether phrases are used •  E.g. ACL submissions encouraged to not use idiomatic expressions – Formal versus informal contexts – Difference between writers’ language backgrounds
  • 9. Образец заголовкаTasks where phrases are useful •  Good phrases should improve or reflect –  Document classification tasks –  External knowledge (Wikipedia titles, dictionary) –  Analogy solving –  Paraphrase identification –  Similarity ratings on Amazon Mechanical Turk –  Machine translation
  • 10. Образец заголовка Task 1: Named entity recognition •  Some studies use wiki phrases (headlines) by taking all the titles and using them in other tasks •  Can parse a sentence for entities by automatically labeling some of the entities that are in Wikipedia
  • 11. Образец заголовка Identifying wiki phrases for named entity recognition •  Polls show DemocratORG Hillary_ClintonPER and RepublicanORG Donald_TrumpPER ahead by double-digit margins •  Wiki phrases like Hillary_Clinton and Donald_Trump contain lots of clues that they are people
  • 12. Образец заголовка Identifying wiki phrases for named entity recognition •  Passos, Kumar, & McCallum (2014) – Bigrams where p(a,b)/(p(a)p(b)) > 1000 – Then top 1M phrases – Create embeddings from these phrases – Embeddings used as features in named entity recognition (NER) – Using phrase embeddings led to state of the art NER
  • 13. Образец заголовка Task 2: Using idioms in sentiment analysis •  Bag of individual words models would probably misclassify these two – “not that bad” à ok – “not that good” à probably bad •  Sometimes adding in phrase information increases noise, runtime
  • 14. Образец заголовка Using idioms in sentiment analysis •  Williams et al. (2015) annotated idioms in context as either positive or negative – 580 idioms from a language learner textbook – Regular expressions to identify variants – “Not that bad” -> neutral – “A drop in the bucket” -> good •  Sentiment classification increased from 45 to 60% with addition of idioms
  • 15. Образец заголовка Task 3: Using idioms in phrase analogies Toronto: Toronto Mapleleafs :: Montreal: Montreal Canadiens – Want to produce complex, non-word output in an analogy task
  • 16. Образец заголовка Using idioms in phrase analogies •  Mikolov et al. (2013) •  In an analogy task, need to first identify phrases – High mutual information score cutoff for phrase learning – Train a neural network model to learn distributed phrase vector representations
  • 17. Образец заголовка Using idioms in phrase analogies •  Neural network representations are pairs of words that are concatenated – “Toronto Mapleleafs” is treated like a single word for the model – Model predicts the contexts given words and phrases as input – “Toronto Mapleleafs” and “Montreal Canadiens” both predict a “hockey” context when the individual words do not
  • 18. Образец заголовкаHow to learn phrases? •  Unsupervised methods •  Supervised methods
  • 19. Образец заголовка Unsupervised learning of phrases •  Some papers focus on how to get good phrases beyond mutual information measures – Shallow parsing with structural constraints (no “of the United”) – If a phrase includes another phrase, the whole phrase must be included (“President of the United States”)
  • 20. Образец заголовка Unsupervised learning of phrases •  Cho et al. (2014) propose a model for machine translation that predicts words and phrases in a target language (recursive neural network) – Input: Word and next word in source language – Output: Word and next word in target language
  • 21. Образец заголовка Unsupervised learning of phrases •  Predicting the next word of a word in a foreign language helps the model associate the past with potential future output – Phrases learned in the Cho et al. (2014) model cluster “one to three months” near “for two months”
  • 22. Образец заголовкаSupervised learning of phrases •  Liu et al. (2015) define quality as a threshold with two properties – Informativeness within a document (effectively term frequency/inverse document frequency) – Concordance (conventionality, judged by difference between some combinations – e.g. powerful coffee, strong coffee) – Like TF-IDF for phrases
  • 23. Образец заголовкаEvaluation of learned phrases •  Perplexity of the data given the model – Higher perplexity means less data explained – When a model captures more dependencies in the data, phrases included are good (El-Kishky et al., 2015) – This metric works better for some domains than others (e.g. Yelp)
  • 24. Образец заголовкаEvaluations of phrases •  El-Kishky et al. (2015) also compared retrieved phrases against Wikipedia titles – If in Wikipedia, then this is a very good phrase – If not, harder to evaluate – Works for some domains but maybe not others (e.g. abstracts and papers)
  • 25. Образец заголовкаCurrent state of research •  No gold standard for evaluating whether a phrase is good or not – Many available datasets and applications – Less clear how to learn phrases in an unsupervised framework – Many models implicitly or explicitly use mutual information and background language models as filters
  • 26. Образец заголовкаReferences Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, Jiawei Han, "Scalable Topical Phrase Mining from Text Corpora", PVLDB Vol. 8 (Also, Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015). Liu, J., Shang, J., Wang, C., Ren, X., & Han, J. (2015, May). Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1729-1744). ACM. Passos, A., Kumar, V., & McCallum, A. (2014). Lexicon infused phrase embeddings for named entity resolution. arXiv preprint arXiv:1404.5367. Williams, L., Bannister, C., Arribas-Ayllon, M., Preece, A., & Spasić, I. (2015). The role of idioms in sentiment analysis. Expert Systems with Applications, 42, 7375-7385. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.