SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Text
Classification
Positive or negative movie
review?
• unbelievably disappointing
• Full of zany characters and richly applied satire,
and some great plot twists
• this is the greatest screwball comedy ever
filmed
• It was pathetic. The worst part about it was the
boxing scenes.
2
What is the subject of this
article?
• Management/mba
• admission
• arts
• exam preparation
• nursing
• technology
• …
3
Subject Category
?
Text Classification
• Assigning subject categories, topics, or
genres
• Spam detection
• Authorship identification
• Age/gender identification
• Language Identification
• Sentiment analysis
• …
Text Classification: definition
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• Output: a predicted class c ∈ C
Classification Methods:
Hand-coded rules
• Rules based on combinations of words or other
features
• spam: black-list-address OR (“dollars” AND“have been
selected”)
• Accuracy can be high
• If rules carefully refined by expert
• But building and maintaining these rules is
expensive
Classification Methods:
Supervised Machine Learning
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
• Output:
• a learned classifier γ:d  c
7
Classification Methods:
Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines
• Maximum Entropy Model
• Generative Vs Discriminative
• …
Naïve Bayes Intuition
• Simple (“naïve”) classification method
based on Bayes rule
• Relies on very simple representation of
document
• Bag of words
The bag of words
representation
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
γ
(
)=c
The bag of words
representation
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
γ
(
)=c
The bag of words representation:
using a subset of words
x love xxxxxxxxxxxxxxxx sweet
xxxxxxx satirical xxxxxxxxxx
xxxxxxxxxxx great xxxxxxx
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx several xxxxxxxxxxxxxxxxx
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
γ
(
)=c
Planning GUIGarbage
Collection
Machine 
Learning NLP
parser
tag
training
translation
language...
learning
training
algorithm
shrinkage
network...
garbage
collection
memory
optimization
region...
Test
document
parser
language
label
translation
…
Bag of words for document
classification
...planning
temporal
reasoning
plan
language...
?
Bayes’ Rule Applied to
Documents and Classes
• For a document d and a class c
P(c| d) =
P(d| c)P(c)
P(d)
Naïve Bayes Classifier (I)
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator
cMAP = argmax
c∈C
P(c| d)
= argmax
c∈C
P(d| c)P(c)
P(d)
= argmax
c∈C
P(d| c)P(c)
Naïve Bayes Classifier (II)
Document d
represented
as features
x1..xn
cMAP = argmax
c∈C
P(d| c)P(c)
= argmax
c∈C
P(x1, x2,…, xn | c)P(c)
Naïve Bayes Classifier (IV)
How often does this
class occur?
O(|X|n•|C|) parameters
We can just count the
relative frequencies
in a corpus
Could only be estimated if
a very, very large number
of training examples was
available.
cMAP = argmax
c∈C
P(x1, x2,…, xn | c)P(c)
Multinomial Naïve Bayes
Independence Assumptions
• Bag of Words assumption: Assume position
doesn’t matter
• Conditional Independence: Assume the
feature probabilities P(xi|cj) are independent
given the class c.
P(x1, x2,…, xn | c)
P(x1,…, xn |c) = P(x1 |c)•P(x2 |c)•P(x3 |c)•...•P(xn | c)
Multinomial Naïve Bayes
Classifier
cMAP = argmax
c∈C
P(x1, x2,…, xn | c)P(c)
cNB = argmax
c∈C
P(cj ) P(x| c)
x∈X
∏
Learning the Multinomial Naïve
Bayes Model
• First attempt: maximum likelihood
estimates
• simply use the frequencies in the data
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w∈V
∑
ˆP(cj ) =
doccount(C = cj )
Ndoc
• Create mega-document for topic j by
concatenating all docs in this topic
• Use frequency of w in mega-document
Parameter estimation
fraction of times word wi
appears
among all words in documents
of topic cj
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w∈V
∑
Summary: Naive Bayes is Not
So Naive
• Very Fast, low storage requirements
• Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
• Very good in domains with many equally important
features
Decision Trees suffer from fragmentation in such cases –
especially if little data
• Optimal if the independence assumptions hold: If
assumed independence is correct, then it is the Bayes Optimal
Classifier for problem
• A good dependable baseline for text classification
Real-world systems generally
combine:
• Automatic classification
• Manual review of
uncertain/difficult/"new” cases
23
24
The Real World
• Gee, I’m building a text classifier for real, now!
• What should I do?
25
The Real World
• Write your own classifier code.
• Tools:
●
Apache Mahout (java)
●
NLTK (python)
●
Lingpipe
●
Stanford Classifier …..
• APIs:
●
OpenCalais
●
AlchemiApi
●
UIUC CCG.....

Contenu connexe

Tendances

Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Simplilearn
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classificationsathish sak
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA BoostAman Patel
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networksguestfee8698
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 

Tendances (20)

Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Lstm
LstmLstm
Lstm
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Word embedding
Word embedding Word embedding
Word embedding
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA Boost
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 

Similaire à Introduction to text classification using naive bayes

Topic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptxTopic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptxHassaanIbrahim2
 
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2tingyuansenastro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysisSubhas Kumar Ghosh
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Mail.ru Group
 
Significant scales in community structure
Significant scales in community structureSignificant scales in community structure
Significant scales in community structureVincent Traag
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 

Similaire à Introduction to text classification using naive bayes (11)

Topic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptxTopic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptx
 
Normalizing flow
Normalizing flowNormalizing flow
Normalizing flow
 
Text Classification.pdf
Text Classification.pdfText Classification.pdf
Text Classification.pdf
 
My7class
My7classMy7class
My7class
 
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Significant scales in community structure
Significant scales in community structureSignificant scales in community structure
Significant scales in community structure
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 

Dernier

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Dernier (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Introduction to text classification using naive bayes

  • 2. Positive or negative movie review? • unbelievably disappointing • Full of zany characters and richly applied satire, and some great plot twists • this is the greatest screwball comedy ever filmed • It was pathetic. The worst part about it was the boxing scenes. 2
  • 3. What is the subject of this article? • Management/mba • admission • arts • exam preparation • nursing • technology • … 3 Subject Category ?
  • 4. Text Classification • Assigning subject categories, topics, or genres • Spam detection • Authorship identification • Age/gender identification • Language Identification • Sentiment analysis • …
  • 5. Text Classification: definition • Input: • a document d • a fixed set of classes C = {c1, c2,…, cJ} • Output: a predicted class c ∈ C
  • 6. Classification Methods: Hand-coded rules • Rules based on combinations of words or other features • spam: black-list-address OR (“dollars” AND“have been selected”) • Accuracy can be high • If rules carefully refined by expert • But building and maintaining these rules is expensive
  • 7. Classification Methods: Supervised Machine Learning • Input: • a document d • a fixed set of classes C = {c1, c2,…, cJ} • A training set of m hand-labeled documents (d1,c1),....,(dm,cm) • Output: • a learned classifier γ:d  c 7
  • 8. Classification Methods: Supervised Machine Learning • Any kind of classifier • Naïve Bayes • Logistic regression • Support-vector machines • Maximum Entropy Model • Generative Vs Discriminative • …
  • 9. Naïve Bayes Intuition • Simple (“naïve”) classification method based on Bayes rule • Relies on very simple representation of document • Bag of words
  • 10. The bag of words representation I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. γ ( )=c
  • 11. The bag of words representation I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. γ ( )=c
  • 12. The bag of words representation: using a subset of words x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx γ ( )=c
  • 14. Bayes’ Rule Applied to Documents and Classes • For a document d and a class c P(c| d) = P(d| c)P(c) P(d)
  • 15. Naïve Bayes Classifier (I) MAP is “maximum a posteriori” = most likely class Bayes Rule Dropping the denominator cMAP = argmax c∈C P(c| d) = argmax c∈C P(d| c)P(c) P(d) = argmax c∈C P(d| c)P(c)
  • 16. Naïve Bayes Classifier (II) Document d represented as features x1..xn cMAP = argmax c∈C P(d| c)P(c) = argmax c∈C P(x1, x2,…, xn | c)P(c)
  • 17. Naïve Bayes Classifier (IV) How often does this class occur? O(|X|n•|C|) parameters We can just count the relative frequencies in a corpus Could only be estimated if a very, very large number of training examples was available. cMAP = argmax c∈C P(x1, x2,…, xn | c)P(c)
  • 18. Multinomial Naïve Bayes Independence Assumptions • Bag of Words assumption: Assume position doesn’t matter • Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c. P(x1, x2,…, xn | c) P(x1,…, xn |c) = P(x1 |c)•P(x2 |c)•P(x3 |c)•...•P(xn | c)
  • 19. Multinomial Naïve Bayes Classifier cMAP = argmax c∈C P(x1, x2,…, xn | c)P(c) cNB = argmax c∈C P(cj ) P(x| c) x∈X ∏
  • 20. Learning the Multinomial Naïve Bayes Model • First attempt: maximum likelihood estimates • simply use the frequencies in the data ˆP(wi | cj ) = count(wi,cj ) count(w,cj ) w∈V ∑ ˆP(cj ) = doccount(C = cj ) Ndoc
  • 21. • Create mega-document for topic j by concatenating all docs in this topic • Use frequency of w in mega-document Parameter estimation fraction of times word wi appears among all words in documents of topic cj ˆP(wi | cj ) = count(wi,cj ) count(w,cj ) w∈V ∑
  • 22. Summary: Naive Bayes is Not So Naive • Very Fast, low storage requirements • Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results • Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data • Optimal if the independence assumptions hold: If assumed independence is correct, then it is the Bayes Optimal Classifier for problem • A good dependable baseline for text classification
  • 23. Real-world systems generally combine: • Automatic classification • Manual review of uncertain/difficult/"new” cases 23
  • 24. 24 The Real World • Gee, I’m building a text classifier for real, now! • What should I do?
  • 25. 25 The Real World • Write your own classifier code. • Tools: ● Apache Mahout (java) ● NLTK (python) ● Lingpipe ● Stanford Classifier ….. • APIs: ● OpenCalais ● AlchemiApi ● UIUC CCG.....