SlideShare une entreprise Scribd logo
1  sur  26
Bag-of-Words Methods
for
Text Mining
CSCI-GA.2590 – Lecture 2A
Ralph Grishman
NYU
Bag of Words Models
• do we really need elaborate linguistic analysis?
• look at text mining applications
– document retrieval
– opinion mining
– association mining
• see how far we can get with document-level bag-
of-words models
– and introduce some of our mathematical approaches
1/16/14 NYU 2
• document retrieval
• opinion mining
• association mining
1/16/14 NYU 3
Information Retrieval
• Task: given query = list of keywords, identify
and rank relevant documents from collection
• Basic idea:
find documents whose set of words most
closely matches words in query
1/16/14 NYU 4
Topic Vector
• Suppose the document collection has n distinct words,
w1, …, wn
• Each document is characterized by an n-dimensional
vector whose ith component is the frequency of word
wi in the document
Example
• D1 = [The cat chased the mouse.]
• D2 = [The dog chased the cat.]
• W = [The, chased, dog, cat, mouse] (n = 5)
• V1 = [ 2 , 1 , 0 , 1 , 1 ]
• V2 = [ 2 , 1 , 1 , 1 , 0 ]
1/16/14 NYU 5
Weighting the components
• Unusual words like elephant determine the topic much more
than common words such as “the” or “have”
• can ignore words on a stop list or
• weight each term frequency tfi by its inverse document
frequency idfi
• where N = size of collection and ni = number of documents
containing term i
1/16/14 NYU 6

witfiidfi

idfilog(N
ni
)
Cosine similarity metric
Define a similarity metric between topic vectors
A common choice is cosine similarity (dot
product):
1/16/14 NYU 7

sim(A,B)
aibi
i

ai
2
i
  bi
2
i

Cosine similarity metric
• the cosine similarity metric is the cosine of the
angle between the term vectors:
1/16/14 NYU 8
w1
w2
Verdict: a success
• For heterogenous text collections, the vector
space model, tf-idf weighting, and cosine
similarity have been the basis for successful
document retrieval for over 50 years
• document retrieval
• opinion mining
• association mining
1/16/14 NYU 10
Opinion Mining
– Task: judge whether a document expresses a positive
or negative opinion (or no opinion) about an object or
topic
• classification task
• valuable for producers/marketers of all sorts of products
– Simple strategy: bag-of-words
• make lists of positive and negative words
• see which predominate in a given document
(and mark as ‘no opinion’ if there are few words of either
type
• problem: hard to make such lists
– lists will differ for different topics
1/16/14 11
NYU
Training a Model
– Instead, label the documents in a corpus and then
train a classifier from the corpus
• labeling is easier than thinking up words
• in effect, learn positive and negative words from the
corpus
– probabilistic classifier
• identify most likely class
s = argmax P ( t | W)
t є {pos, neg}
1/16/14 12
NYU
Using Bayes’ Rule
• The last step is based on the naïve assumption
of independence of the word probabilities
1/16/14 NYU 13

argmaxt P(t|W)
argmaxt P(W |t)P(t)
P(W)
argmaxt P(W |t)P(t)
argmaxt P(w1,...,wn |t)P(t)
argmaxt P(wi |t)P(t)
i

Training
• We now estimate these probabilities from the
training corpus (of N documents) using maximum
likelihood estimators
• P ( t ) = count ( docs labeled t) / N
• P ( wi | t ) =
count ( docs labeled t containing wi )
count ( docs labeled t)
1/16/14 NYU 14
A Problem
• Suppose a glowing review GR (with lots of
positive words) includes one word,
“mathematical”, previously seen only in
negative reviews
• P ( positive | GR ) = ?
1/16/14 NYU 15
P ( positive | GR ) = 0
because P ( “mathematical” | positive ) = 0
The maximum likelihood estimate is poor when
there is very little data
We need to ‘smooth’ the probabilities to avoid
this problem
1/16/14 NYU 16
Laplace Smoothing
• A simple remedy is to add 1 to each count
– to keep them as probabilities
we increase the denominator N by the number of outcomes
(values of t) (2 for ‘positive’ and ‘negative’)
– for the conditional probabilities P( w | t ) there are similarly two
outcomes (w is present or absent)
1/16/14 NYU 17
 
t
t
P 1
)
(
An NLTK Demo
Sentiment Analysis with Python NLTK Text
Classification
• http://text-processing.com/demo/sentiment/
NLTK Code (simplified classifier)
• http://streamhacker.com/2010/05/10/text-
classification-sentiment-analysis-naive-bayes-classifier
1/16/14 NYU 18
• Is “low” a positive or a negative term?
1/16/14 NYU 19
Ambiguous terms
“low” can be positive
“low price”
or negative
“low quality”
1/16/14 NYU 20
Verdict: mixed
• A simple bag-of-words strategy with a NB
model works quite well for simple reviews
referring to a single item, but fails
– for ambiguous terms
– for comparative reviews
– to reveal aspects of an opinion
• the car looked great and handled well,
but the wheels kept falling off
1/16/14 NYU 21
• document retrieval
• opinion mining
• association mining
1/16/14 NYU 22
Association Mining
• Goal: find interesting relationships among attributes
of an object in a large collection …
objects with attribute A also have attribute B
– e.g., “people who bought A also bought B”
• For text: documents with term A also have term B
– widely used in scientific and medical literature
1/16/14 NYU 23
Bag-of-words
• Simplest approach
• look for words A and B for which
frequency (A and B in same document) >>
frequency of A x frequency of B
• doesn’t work well
• want to find names (of companies, products, genes),
not individual words
• interested in specific types of terms
• want to learn from a few examples
– need contexts to avoid noise
1/16/14 NYU 24
Needed Ingredients
• Effective text association mining needs
– name recognition
– term classification
– preferably: ability to learn patterns
– preferably: a good GUI
– demo: www.coremine.com
1/16/14 NYU 25
Conclusion
• Some tasks can be handled effectively (and
very simply) by bag-of-words models,
• but most benefit from an analysis of language
structure
1/16/14 NYU 26

Contenu connexe

Similaire à Bag-of-Words Methods for Text Mining CSGI-GA.2590 Lecture

LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
M.saleem literature review
M.saleem  literature reviewM.saleem  literature review
M.saleem literature reviewMuhammad Saleem
 
Lesson 4 secondary research 2
Lesson 4   secondary research 2Lesson 4   secondary research 2
Lesson 4 secondary research 2Kavita Parwani
 
information-skills-for-researchers-v3
information-skills-for-researchers-v3information-skills-for-researchers-v3
information-skills-for-researchers-v3Jacqueline Thomas
 
Reading and writing at m level for scitt
Reading and writing at m level for scittReading and writing at m level for scitt
Reading and writing at m level for scittPhilwood
 
Literature Review(For Young Researchers)
Literature Review(For Young Researchers)Literature Review(For Young Researchers)
Literature Review(For Young Researchers)DrAmitPurushottam
 
STEM Mom Speaks to Teachers at Princeton University
STEM Mom Speaks to Teachers at Princeton University STEM Mom Speaks to Teachers at Princeton University
STEM Mom Speaks to Teachers at Princeton University Darci the STEM Mom
 
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismTools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismMarijn Koolen
 
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGYChapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGYHazrina Haja
 
DOK J. (A.R. PRESENTATION) part rwo.pptx
DOK J. (A.R. PRESENTATION) part rwo.pptxDOK J. (A.R. PRESENTATION) part rwo.pptx
DOK J. (A.R. PRESENTATION) part rwo.pptxAljohnEspejo1
 
Getting a Head Start with Your Thesis Proposal
Getting a Head Start with Your Thesis ProposalGetting a Head Start with Your Thesis Proposal
Getting a Head Start with Your Thesis Proposalwwuwritingcenter
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval GESIS
 
Introduction and Tools of Research
Introduction and Tools of ResearchIntroduction and Tools of Research
Introduction and Tools of ResearchMyke Evans
 
Lesson 3 - Secondary Research 1
Lesson 3 - Secondary Research 1Lesson 3 - Secondary Research 1
Lesson 3 - Secondary Research 1Kavita Parwani
 
Getting started; Literature review (Based on Cresswell) Research Methodology
Getting started; Literature review (Based on Cresswell) Research MethodologyGetting started; Literature review (Based on Cresswell) Research Methodology
Getting started; Literature review (Based on Cresswell) Research MethodologyRajkumar Tyata
 

Similaire à Bag-of-Words Methods for Text Mining CSGI-GA.2590 Lecture (20)

08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
M.saleem literature review
M.saleem  literature reviewM.saleem  literature review
M.saleem literature review
 
Lesson 4 secondary research 2
Lesson 4   secondary research 2Lesson 4   secondary research 2
Lesson 4 secondary research 2
 
information-skills-for-researchers-v3
information-skills-for-researchers-v3information-skills-for-researchers-v3
information-skills-for-researchers-v3
 
Reading and writing at m level for scitt
Reading and writing at m level for scittReading and writing at m level for scitt
Reading and writing at m level for scitt
 
Literature Review(For Young Researchers)
Literature Review(For Young Researchers)Literature Review(For Young Researchers)
Literature Review(For Young Researchers)
 
STEM Mom Speaks to Teachers at Princeton University
STEM Mom Speaks to Teachers at Princeton University STEM Mom Speaks to Teachers at Princeton University
STEM Mom Speaks to Teachers at Princeton University
 
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismTools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism
 
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGYChapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
 
DOK J. (A.R. PRESENTATION) part rwo.pptx
DOK J. (A.R. PRESENTATION) part rwo.pptxDOK J. (A.R. PRESENTATION) part rwo.pptx
DOK J. (A.R. PRESENTATION) part rwo.pptx
 
Literature Review PPT.pptx
Literature Review PPT.pptxLiterature Review PPT.pptx
Literature Review PPT.pptx
 
Xiaowei Zhou
Xiaowei ZhouXiaowei Zhou
Xiaowei Zhou
 
Getting a Head Start with Your Thesis Proposal
Getting a Head Start with Your Thesis ProposalGetting a Head Start with Your Thesis Proposal
Getting a Head Start with Your Thesis Proposal
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
 
Introduction and Tools of Research
Introduction and Tools of ResearchIntroduction and Tools of Research
Introduction and Tools of Research
 
Lesson 3 - Secondary Research 1
Lesson 3 - Secondary Research 1Lesson 3 - Secondary Research 1
Lesson 3 - Secondary Research 1
 
Getting started; Literature review (Based on Cresswell) Research Methodology
Getting started; Literature review (Based on Cresswell) Research MethodologyGetting started; Literature review (Based on Cresswell) Research Methodology
Getting started; Literature review (Based on Cresswell) Research Methodology
 
Lecture-2.pdf
Lecture-2.pdfLecture-2.pdf
Lecture-2.pdf
 

Plus de nyomans1

PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptnyomans1
 
Template Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptxTemplate Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptxnyomans1
 
Clustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptxClustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptxnyomans1
 
slide 7_olap_example.ppt
slide 7_olap_example.pptslide 7_olap_example.ppt
slide 7_olap_example.pptnyomans1
 
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptnyomans1
 
Security Requirement.pptx
Security Requirement.pptxSecurity Requirement.pptx
Security Requirement.pptxnyomans1
 
Minggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptxMinggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptxnyomans1
 
Matriks suplemen.ppt
Matriks suplemen.pptMatriks suplemen.ppt
Matriks suplemen.pptnyomans1
 
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...nyomans1
 
10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptxnyomans1
 
08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptxnyomans1
 
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptxnyomans1
 
04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptxnyomans1
 
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptxnyomans1
 
03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptxnyomans1
 
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptxQ-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptxnyomans1
 
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptxBAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptxnyomans1
 
Support-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptxSupport-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptxnyomans1
 
06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptxnyomans1
 
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptxnyomans1
 

Plus de nyomans1 (20)

PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
 
Template Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptxTemplate Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptx
 
Clustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptxClustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptx
 
slide 7_olap_example.ppt
slide 7_olap_example.pptslide 7_olap_example.ppt
slide 7_olap_example.ppt
 
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
 
Security Requirement.pptx
Security Requirement.pptxSecurity Requirement.pptx
Security Requirement.pptx
 
Minggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptxMinggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptx
 
Matriks suplemen.ppt
Matriks suplemen.pptMatriks suplemen.ppt
Matriks suplemen.ppt
 
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
 
10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx
 
08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx
 
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
 
04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx
 
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
 
03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx
 
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptxQ-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
 
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptxBAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
 
Support-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptxSupport-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptx
 
06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx
 
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
 

Dernier

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 

Dernier (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 

Bag-of-Words Methods for Text Mining CSGI-GA.2590 Lecture

  • 1. Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU
  • 2. Bag of Words Models • do we really need elaborate linguistic analysis? • look at text mining applications – document retrieval – opinion mining – association mining • see how far we can get with document-level bag- of-words models – and introduce some of our mathematical approaches 1/16/14 NYU 2
  • 3. • document retrieval • opinion mining • association mining 1/16/14 NYU 3
  • 4. Information Retrieval • Task: given query = list of keywords, identify and rank relevant documents from collection • Basic idea: find documents whose set of words most closely matches words in query 1/16/14 NYU 4
  • 5. Topic Vector • Suppose the document collection has n distinct words, w1, …, wn • Each document is characterized by an n-dimensional vector whose ith component is the frequency of word wi in the document Example • D1 = [The cat chased the mouse.] • D2 = [The dog chased the cat.] • W = [The, chased, dog, cat, mouse] (n = 5) • V1 = [ 2 , 1 , 0 , 1 , 1 ] • V2 = [ 2 , 1 , 1 , 1 , 0 ] 1/16/14 NYU 5
  • 6. Weighting the components • Unusual words like elephant determine the topic much more than common words such as “the” or “have” • can ignore words on a stop list or • weight each term frequency tfi by its inverse document frequency idfi • where N = size of collection and ni = number of documents containing term i 1/16/14 NYU 6  witfiidfi  idfilog(N ni )
  • 7. Cosine similarity metric Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): 1/16/14 NYU 7  sim(A,B) aibi i  ai 2 i   bi 2 i 
  • 8. Cosine similarity metric • the cosine similarity metric is the cosine of the angle between the term vectors: 1/16/14 NYU 8 w1 w2
  • 9. Verdict: a success • For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years
  • 10. • document retrieval • opinion mining • association mining 1/16/14 NYU 10
  • 11. Opinion Mining – Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic • classification task • valuable for producers/marketers of all sorts of products – Simple strategy: bag-of-words • make lists of positive and negative words • see which predominate in a given document (and mark as ‘no opinion’ if there are few words of either type • problem: hard to make such lists – lists will differ for different topics 1/16/14 11 NYU
  • 12. Training a Model – Instead, label the documents in a corpus and then train a classifier from the corpus • labeling is easier than thinking up words • in effect, learn positive and negative words from the corpus – probabilistic classifier • identify most likely class s = argmax P ( t | W) t є {pos, neg} 1/16/14 12 NYU
  • 13. Using Bayes’ Rule • The last step is based on the naïve assumption of independence of the word probabilities 1/16/14 NYU 13  argmaxt P(t|W) argmaxt P(W |t)P(t) P(W) argmaxt P(W |t)P(t) argmaxt P(w1,...,wn |t)P(t) argmaxt P(wi |t)P(t) i 
  • 14. Training • We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators • P ( t ) = count ( docs labeled t) / N • P ( wi | t ) = count ( docs labeled t containing wi ) count ( docs labeled t) 1/16/14 NYU 14
  • 15. A Problem • Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews • P ( positive | GR ) = ? 1/16/14 NYU 15
  • 16. P ( positive | GR ) = 0 because P ( “mathematical” | positive ) = 0 The maximum likelihood estimate is poor when there is very little data We need to ‘smooth’ the probabilities to avoid this problem 1/16/14 NYU 16
  • 17. Laplace Smoothing • A simple remedy is to add 1 to each count – to keep them as probabilities we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) – for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) 1/16/14 NYU 17   t t P 1 ) (
  • 18. An NLTK Demo Sentiment Analysis with Python NLTK Text Classification • http://text-processing.com/demo/sentiment/ NLTK Code (simplified classifier) • http://streamhacker.com/2010/05/10/text- classification-sentiment-analysis-naive-bayes-classifier 1/16/14 NYU 18
  • 19. • Is “low” a positive or a negative term? 1/16/14 NYU 19
  • 20. Ambiguous terms “low” can be positive “low price” or negative “low quality” 1/16/14 NYU 20
  • 21. Verdict: mixed • A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails – for ambiguous terms – for comparative reviews – to reveal aspects of an opinion • the car looked great and handled well, but the wheels kept falling off 1/16/14 NYU 21
  • 22. • document retrieval • opinion mining • association mining 1/16/14 NYU 22
  • 23. Association Mining • Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B – e.g., “people who bought A also bought B” • For text: documents with term A also have term B – widely used in scientific and medical literature 1/16/14 NYU 23
  • 24. Bag-of-words • Simplest approach • look for words A and B for which frequency (A and B in same document) >> frequency of A x frequency of B • doesn’t work well • want to find names (of companies, products, genes), not individual words • interested in specific types of terms • want to learn from a few examples – need contexts to avoid noise 1/16/14 NYU 24
  • 25. Needed Ingredients • Effective text association mining needs – name recognition – term classification – preferably: ability to learn patterns – preferably: a good GUI – demo: www.coremine.com 1/16/14 NYU 25
  • 26. Conclusion • Some tasks can be handled effectively (and very simply) by bag-of-words models, • but most benefit from an analysis of language structure 1/16/14 NYU 26