SlideShare une entreprise Scribd logo
1  sur  95
Fun With Text
Hacking Text Analytics
Cohan Sujay Carlos
Aiaioo Labs
Bangalore
I shall chuck three buzz-words at you:
• One type of big data is unstructured data
• One type of unstructured data is text
• The analysis of text is text analysis
Is text analytics important?
You didn’t answer my question.
Is text analytics important?
Hell, I don’t know!
But what are you doing reading this if it isn’t?
• One Machine Learning Tool
• How to reduce Text Analytics Tasks to
steps which you can solve with merely
this one Machine Learning Tool
What we’re going to cover
The One ML Tool
Get ready for the one ML tool you’ll need to
hack text analytics
DRUM ROLL !!!!!!!
The One
Get ready for the one ML tool you’ll need to
hack text analytics
The Classifier 
That ML Tool is the Classifier
What is a Classifier?
Something that performs classification.
Classification = categorizing
Classification = deciding
Classification = labelling
Classification = Deciding = Labelling
Classification = Deciding = Labelling
5’11”
5’ 8”
Classify these door heights as: Short or Tall ?
5’8”
5’11”
6’2”
6’6”
5’ 2”
6’8”
6’9”
6’10”
Classification = Deciding = Labelling
5’11” Short
5’ 8” Short
For classification, you always start with
some labelled data points.
5’8”
5’11”
6’2” Tall
6’6” Tall
5’ 2” Short
6’8”
6’9”
6’10” Tall
Classification = Deciding = Labelling
5’11” Short
5’ 8” Short
5’8” Short
5’11” Short
6’2” Tall
6’6” Tall
5’ 2” Short
6’8” Tall
6’9” Tall
6’10” Tall
If you were doing analysis, you’d ask a
human to come up with a rule like:
A door below 6’ is Short else it’s Tall
Classification = Deciding = Labelling
5’11” Short
5’ 8” Short
In ML, you just provide some examples.
The computer discovers the rule.
5’8”
5’11”
6’2” Tall
6’6” Tall
5’ 2” Short
6’8”
6’9”
6’10” Tall
Classification = Deciding = Labelling
5’11” Short
5’ 8” Short
5’8”
5’11”
6’2” Tall
6’6” Tall
5’ 2” Short
6’8”
6’9”
6’10” Tall
The ML algorithm learns something like:
A door below 6’ is Short else it’s Tall
Classification = Deciding = Labelling
5’11” Short
5’ 8” Short
5’8” Short
5’11” Short
6’2” Tall
6’6” Tall
5’ 2” Short
6’8” Tall
6’9” Tall
6’10” Tall
You will learn to create an ML algorithm
that learns something like this and that
works with text.
Topic Classification
Can you tell which is about Politics and which is
about Sports?
The United Nations
Security Council today
Manchester United
beat Barca to reach
Topic Classification
Can you train an ML algorithm to tell which is
about Politics and which is about Sports?
Manchester United
beat Barca to reach
Politics Sports
The United Nations
Security Council today
Topic Classification
Start by taking some samples of documents on
politics and some samples of sports documents.
We’re using really short documents so you can
do all the calculations manually & see that this
ML algorithm really works!
Politics Sports
The United Nations
The United States and
Manchester United
Manchester and Barca
Step 1: Learn Multinomial Probabilities
The United States and
Politics
Manchester United
Manchester and Barca
Sports
P(United/Politics) = 2/7
The United Nations
P(Nations/Politics) = 1/7
P(States/Politics) = 1/7
P(Manchester/Sports) = 2/5
P(United/Sports) = 1/5
P(Barca/Sports) = 1/5
P(Politics) = 7/12 P(Sports) = 5/12
P(The/Politics) = 2/7 P(and/Sports) = 1/5
P(and/Politics) = 1/7
One ML Algorithm
Step 1 was easy!!!
Are you ready for Step 2 ?
Step 2: There’s no step 2
P(United/Politics) = 2/7
P(Nations/Politics) = 1/7
P(States/Politics) = 1/7
P(Manchester/Sports) = 2/5
P(United/Sports) = 1/5
P(Barca/Sports) = 1/5
P(Politics) = 7/12 P(Sports) = 5/12
P(The/Politics) = 2/7 P(and/Sports) = 1/5
P(and/Politics) = 1/7
This is a Naïve Bayesian Classifier!!!
One ML Algorithm
Let’s put the classifier to work:
Let’s see if it can classify the following
documents :
1. United Nations
2. Manchester United
We are using deliberately short documents!
Running the Topic Classifier
United Nations
P(Politics|United Nations)
= P(United|P)*P(Nations|Politics)*P(Politics)
= (2/7)*(1/7)*(7/12) = 1/(7*6)
P(Sports|United Nations)
= P(United|S)*P(Nations|Sports)*P(Sports)
= (1/5)*(0)*(5/12) = 0
Running the Topic Classifier
United Nations
P(Politics|United Nations)
>
P(Sports|United Nations)
So, the classifier has returned the category POLITICS
Running the Topic Classifier
Manchester United
P(Politics|Manchester United)
= P(Manchester|P)*P(United|P)*P(Politics)
= (0)*(2/7)*(7/12) = 0
P(Sports|Manchester United)
= P(Manchester|S)*P(United|S)*P(Sports)
= (2/5)*(1/5)*(5/12) = 2/(5*12)
Running the Topic Classifier
P(Sports|Manchester United)
>
P(Politics|Manchester United)
So, the classifier has returned the category SPORTS
Manchester United
Topic Classification
Manchester United
Politics
United Nations
Sports
We have successfully used an ML algorithm to
tell us which document is about Politics and
which is about Sports !!!
Can We Solve Other Problems?
Now, we have an ML tool in our toolkit – a Naïve
Bayesian classifier.
Can We Solve Other Problems?
So, if we can represent a text analysis problem
as a classification task, then we can solve it
using ML.
Now, we have an ML tool in our toolkit – a Naïve
Bayesian classifier.
Can We Solve Other Problems?
So, if we can represent a text analysis problem
as a classification task, then we can solve it
using ML.
Now, we have an ML tool in our toolkit – a Naïve
Bayesian classifier.
So, let us learn how to represent text analysis
problems as classification tasks.
Problem 1: Sentence Segmentation
The problem of identifying the end-points of
sentences is called sentence segmentation. It
can be reduced to a classification problem.
Sentence Segmentation
Yes, Mr. Anurag. You need a D.L. to drive.
Sentence segmentation
Classify each ‘.’, ‘!’ and ‘?’ into:
1) sentence terminator and
2) not a sentence terminator.
Sentence Segmentation
Hurray! We have turned the problem
of sentence segmentation into a
classification problem.
Once you have modeled a text
analytics problem as an ML problem,
there is one more step you need to
perform to get a working solution.
Sentence Segmentation
FIND THE FEATURES!
Features
Which country’s flag is this?
Features are the clues you need for
good decision making.
Features
Which country’s flag is this?
One important feature for solving this
decision problem is colour.
Features for Topic Classification
P(United/Politics) = 2/7
P(Nations/Politics) = 1/7
P(States/Politics) = 1/7
P(Manchester/Sports) = 2/5
P(United/Sports) = 1/5
P(Barca/Sports) = 1/5
P(Politics) = 7/12 P(Sports) = 5/12
P(The/Politics) = 2/7 P(and/Sports) = 1/5
P(and/Politics) = 1/7
For topic classification, the features
are the individual words in the text.
Sentence Segmentation
What are the features you would use?
Yes, Mr. Anurag. You need a D.L. to drive.
That’s when you start reading the
research papers!!!
Sentence Segmentation
What are the features you would use?
Is_This_Character_a_Dot
Is_This_Within_Quotes
Is_Next_Letter_Capitalized
Is_Prev_Letter_Capitalized
Number_of_Words_in_Sentence_so_Far
Is_Next_Word_a_Name
Yes, Mr. Anurag. You need a D.L. to drive.
Sentence Segmentation
Train an NB classifier on some text
whose characters are marked as
sentence terminators or not.
It will learn to assign characters to the
categories word terminator and not a
word terminator.
Problem 2: Tokenization
Now, get a D.L., Mr. Anurag.
For almost all analyses, you have to identify the
individual words in the text.
How can you break a sentence into words?
Now , get a D.L. , Mr. Anurag .
Tokenization
Yes, Mr. Anurag. You need a D.L. to drive.
Tokenization
Classify each ‘.’, ‘!’, ‘,’, ‘ ’, ‘?’ etc. into:
1) word terminator and
2) not a word terminator.
Tokenization or Word Segmentation
We have turned the problem of word
segmentation into a classification
problem.
Train a classifier on text whose
characters are marked as word
terminators or not. It will learn to
assign characters to the categories
word terminator and not a word
terminator.
Problem 3: Part of Speech (POS) Tagging
Anurag needs a D.L.
Anurag/NNP needs/VBZ a/DT D.L./NN
How do you tag the words in a sentence?
Problem 3: Part of Speech (POS) Tagging
Anurag needs a D.L.
Anurag/NNP needs/VBZ a/DT D.L./NN
POS Tagging
Run from the first to the last word in the
sentence, classifying each word in the sentence
into a part-of-speech category.
We have turned the problem of POS Tagging into a
Classification problem where you label each word as:
1. Noun
2. Verb
3. Adjective
4. Adverb
5. Interjection
6. Conjunction (and, or, if, neither … nor)
7. Pronoun (I, it, you, them)
8. Preposition (in, of, out)
Problem 3: Part of Speech (POS) Tagging
Problem 4: Named Entity Recognition
Anurag is looking for a Hyundai car in Bangalore
Anurag = Person
Hyundai car = Vehicle
Bangalore = Place
How do you extract the named entities in a sentence
using a classifier?
Problem 4: Named Entity Recognition
Anurag is looking for a Hyundai car in Bangalore
Anurag = Person
Hyundai car = Vehicle
Bangalore = Place
Anurag/Person is/Other looking/Other
for/Other a/Other Hyundai/Vehicle car/Vehicle
in/Other Bangalore/Place
We have turned the problem of Named Entity
Recognition (NER) into a Tagging problem
where you label each word as:
1. Person
2. Vehicle
3. Place
Problem 4: Named Entity Recognition
But we know that we can turn the problem of
Tagging into a Classification problem over the
same labels:
1. Person
2. Vehicle
3. Place
Problem 4: Named Entity Recognition
Problem 4: Named Entity Recognition
Anurag is looking for a Hyundai car in Bangalore
Anurag/Person is/Other looking/Other
for/Other a/Other Hyundai/Vehicle car/Vehicle
in/Other Bangalore/Place
Named Entity Recognition
Run from the first to the last word in the
sentence, classifying each word in the sentence
into a Named Entity Recognition category.
Hey We Have One More ML Tool
We just built one more very useful ML tool!
The Extractor 
Extraction
What is an Extractor?
Something that performs extraction.
Extraction = recognizing
Extraction = finding
Extraction = locating
Extraction = Finding = Locating
Problem 5: Relation Extraction
Tim Cook is the new CEO of Apple Computers
Relation: CEO_of
Tim Cook (Person) Apple Computers (Org)
How do you identify relations between entities in a
sentence using a classifier?
Extracting Meaning
Text: Tim Cook is the new CEO of Apple Computers
Analysis: Tim/Person Cook/Person is the new CEO
of Apple/Org Computers/Org
Relation Extraction
Step 1:
Extracting Meaning
Text: Tim Cook is the new CEO of Apple Computers
Analysis: Tim/Person Cook/Person is the new CEO
of Apple/Org Computers/Org
Relation Extraction
Step 1:
Step 2:
Relation Extraction
{Tim/Person Cook/Person, Apple/Org Computers/Org} =>
CEO_of
Extracting Meaning
Text: Tim Cook is the new CEO of Apple Computers
Analysis: Tim/Person Cook/Person is the new CEO
of Apple/Org Computers/Org
Relation Extraction
Step 1:
Step 2:
Relation Extraction
Run through all pairs of named entities, classifying each
pair into CEO_of or Other.
But look at what we’re extracting!
Do you realize what we’re extracting?
Meaning !!! 
Extracting Meaning
Text: I am looking for a Hyundai car in B’lore
Syntactic Analysis: I/Pronoun am/BE looking/V
for a/DT Hyundai/NP car/NN in/PP Bangalore/NP
Semantic Analysis: I/Person am looking for a
Hyundai/Vehicle car/Vehicle in Bangalore/Place
But is this comprehensive?
Are you telling me that this is all I have to do to
extract every possible sort of simple meaning?
Yeah!!! 
All utterances fall into two main categories …
Intentional: I want to buy a computer.
Information: There was heavy snowfall in Sikkim.
Two Kinds of Utterances
360 Degree Text
Analysis
I want to buy a computer.
How do you deal
with intentional
utterances
360 Degree Text
Analysis
Intentional Utterance
Raw Text: Are you sad that Steve Jobs died?
Analysis: This person is inquiring about
someone’s emotions concerning Steve Jobs
Intention Analysis
Intention Holder: I
Intention: inquire
Subjective Utterance
Raw Text: I am sad that Steve Jobs died
Analysis: This person holds a positive opinion
on Steve Jobs
Sentiment Analysis
Sentiment Holder: I
Object of Sentiment: Steve Jobs
Polarity of Sentiment: positive
How do you deal
with informational
utterances
360 Degree Text
Analysis
There was heavy snowfall in Sikkim.
Approaches to
Extracting Meaning
Raw Text: There is heavy snowfall in Sikkim.
Analysis: Snowfall event
Event Analysis
Event: snowfall
Approaches to
Extracting Meaning
Raw Text: Bangalore is the capital of K’taka
Analysis: capital_of relation exists
Fact Analysis
Entity: Bangalore/Place
Karnataka/Place
Relation: Bangalore capital_of K’taka
Extracting Meaning
Text: I am looking for a Hyundai car in Bangalore
Semantic Analysis: I/Person am looking for a
Hyundai/Thing car/Thing in Bangalore/Place
Entity Extraction
Entities:
I Person
Hyundai car Thing
Bangalore Place
Fact Analysis
Extracting Meaning
Text: Tim Cook is the new CEO of Apple Computers
Analysis: Tim/Person Cook/Person is the new CEO
of Apple/Org Computers/Org
Relation Extraction
Relation: CEO_of
Tim Cook (Person) Apple Computers(Org)
Fact Analysis
So, I can do anything (badly?)
I got it! I got it! I can do anything.
But how do I know how well I am doing it ? 
Measurement
Measurement:
This measurement thing is very
important in any design process,
because …
Measurement
How do you measure the performance of a
classifier?
Measurement:
… it lets you compare two designs and
decide which is better.
Before we go on …
… promise me …
… that you will never forget what I am about to
tell you … 
Measurement
Break up the data points into training
and test parts (usually an 80:20 split)
and never test on your training data.
Measurement
Why not test on the training data?
Break up the data points into training
and test – usually an 80:20 split.
Train on 80%
Test on the remaining 20%
Measurement
Why not develop on the test data?
Or if you are still developing features,
break up the data points into training,
development and test – usually a
70:10:20 split.
Train on 70%
Develop on 10%
Test on the remaining 20%
That applies to college as well …
… you won’t get accurate measurements …
… if you test students using questions that
appeared in the question bank!
… and you will be encouraging students to learn
things by rote!
Classification Quality Metric - Accuracy
Correct Answers
Total Number of Questions
Classification Quality Metric - Accuracy
Politics Documents classified as Politics
+ Sports Documents classified as Sports
Total Number of Documents
If your categories are Politics and Sports
Classification Quality Metric - Accuracy
Gold - Politics Gold – Sports
Observed - Politics TN (True Negative) FN (False Positive)
Observed - Sports FP (False Negative) TP (True Positive)
Point of View = Sports = (+ve)
Accuracy
Gold – Sports (1000) Gold – Politics (1000)
Observed – Sports TN = 990 FN = 100
Observed – Politics FP = 10 TP = 900
Point of View - Politics
= ?
Accuracy
Gold – Sports (1000) Gold – Politics (1000)
Observed – Sports TN = 990 FN = 100
Observed – Politics FP = 10 TP = 900
Point of View - Politics
= 94.5%
Classification Quality Metric - Recall
How many Politics document did it find
Total number of Politics documents
in the test data
Recall
Gold – Sports (1000) Gold – Politics (1000)
Observed – Sports TN = 990 FN = 100
Observed – Politics FP = 10 TP = 900
Point of View - Politics
= ?
Recall
Gold – Sports (1000) Gold – Politics (1000)
Observed – Sports TN = 990 FN = 100
Observed – Politics FP = 10 TP = 900
Point of View - Politics
= 90%
Recall
Gold – Sports (1000) Gold – Politics (1000)
Observed – Sports TP = 990 FP = 100
Observed – Politics FN = 10 TN = 900
Point of View - Sports
= ?
Recall
Gold – Sports (1000) Gold – Politics (1000)
Observed – Sports TP = 990 FP = 100
Observed – Politics FN = 10 TN = 900
Point of View - Sports
= 99%
Classification Quality Metric - Precision
How many were really Politics documents
Total number of documents the
classifier identified as Politics
Precision
Point of View - Politics
= ?
Gold – Sports (1040) Gold – Politics (960)
Observed – Sports TN = 990 FN = 10
Observed – Politics FP = 50 TP = 950
Precision
Point of View - Politics
= 95%
Gold – Sports (1040) Gold – Politics (960)
Observed – Sports TN = 990 FN = 10
Observed – Politics FP = 50 TP = 950
Precision
Point of View - Sports
= ?
Gold – Sports (1040) Gold – Politics (960)
Observed – Sports TP = 990 FP = 10
Observed – Politics FN = 50 TN = 950
Precision
Point of View - Sports
= 99%
Gold – Sports (1040) Gold – Politics (960)
Observed – Sports TP = 990 FP = 10
Observed – Politics FN = 50 TN = 950
Are we done?
… kinda …
The more training you have, the better you will
get at text analysis … so keep learning!
How to Learn More?
Grab the UC Berkeley Natural Language
Processing Course’s slides. The course’s
name is CS 294.
Start reading the ACL conference’s
research papers (ACL = Association of
Computational Linguistics)
Now, are we done?
… kinda …
Have Fun!
… and I have an exercise for you …
Did you know
… that you can use a classifier …
… to do clustering?
The exercise is … think about it …
THE END
Fun With Text
Hacking Text Analytics
Cohan Sujay Carlos
Aiaioo Labs
Bangalore
The bottom

Contenu connexe

Similaire à Fun with Text - Hacking Text Analytics

Case Study For a Strategic Management ClassFor this assignment s.docx
Case Study For a Strategic Management ClassFor this assignment s.docxCase Study For a Strategic Management ClassFor this assignment s.docx
Case Study For a Strategic Management ClassFor this assignment s.docx
drennanmicah
 
Feel free to use this as a starting point for your project proposa.docx
Feel free to use this as a starting point for your project proposa.docxFeel free to use this as a starting point for your project proposa.docx
Feel free to use this as a starting point for your project proposa.docx
lmelaine
 
March Madness WebQuest
March Madness WebQuestMarch Madness WebQuest
March Madness WebQuest
Corey Phillips
 
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docxCase StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
wendolynhalbert
 
Ratios And Proportions Notes
Ratios And Proportions NotesRatios And Proportions Notes
Ratios And Proportions Notes
Jeremy Shortess
 

Similaire à Fun with Text - Hacking Text Analytics (20)

Sentimental Analysis - Naive Bayes Algorithm
Sentimental Analysis - Naive Bayes AlgorithmSentimental Analysis - Naive Bayes Algorithm
Sentimental Analysis - Naive Bayes Algorithm
 
De vry math 399 ilabs & discussions latest 2016 november
De vry math 399 ilabs & discussions latest 2016 novemberDe vry math 399 ilabs & discussions latest 2016 november
De vry math 399 ilabs & discussions latest 2016 november
 
De vry math 399 ilabs & discussions latest 2016
De vry math 399 ilabs & discussions latest 2016De vry math 399 ilabs & discussions latest 2016
De vry math 399 ilabs & discussions latest 2016
 
Grant writing slide show
Grant writing slide showGrant writing slide show
Grant writing slide show
 
toolkit13_sec9.pdf
toolkit13_sec9.pdftoolkit13_sec9.pdf
toolkit13_sec9.pdf
 
Critical Thinking Ch5.pptx
Critical Thinking Ch5.pptxCritical Thinking Ch5.pptx
Critical Thinking Ch5.pptx
 
Case Study For a Strategic Management ClassFor this assignment s.docx
Case Study For a Strategic Management ClassFor this assignment s.docxCase Study For a Strategic Management ClassFor this assignment s.docx
Case Study For a Strategic Management ClassFor this assignment s.docx
 
D3M Politics
D3M PoliticsD3M Politics
D3M Politics
 
Using Meltwater to Identify Competitor Data Assignment
Using Meltwater to Identify Competitor Data AssignmentUsing Meltwater to Identify Competitor Data Assignment
Using Meltwater to Identify Competitor Data Assignment
 
Field Testing Legal Documents
Field Testing Legal DocumentsField Testing Legal Documents
Field Testing Legal Documents
 
Feel free to use this as a starting point for your project proposa.docx
Feel free to use this as a starting point for your project proposa.docxFeel free to use this as a starting point for your project proposa.docx
Feel free to use this as a starting point for your project proposa.docx
 
02 2023 ZOOM 2 Ver 2 - GETTING STARTED - LN.pptx
02 2023 ZOOM 2 Ver 2 - GETTING STARTED - LN.pptx02 2023 ZOOM 2 Ver 2 - GETTING STARTED - LN.pptx
02 2023 ZOOM 2 Ver 2 - GETTING STARTED - LN.pptx
 
Critical Thinking Ch5.pptx
Critical Thinking Ch5.pptxCritical Thinking Ch5.pptx
Critical Thinking Ch5.pptx
 
Management games
Management gamesManagement games
Management games
 
Leveraging Your Data
Leveraging Your DataLeveraging Your Data
Leveraging Your Data
 
Cracking The Technical Interview Uw
Cracking The Technical Interview   UwCracking The Technical Interview   Uw
Cracking The Technical Interview Uw
 
March Madness WebQuest
March Madness WebQuestMarch Madness WebQuest
March Madness WebQuest
 
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docxCase StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
 
So You Know Excel Pivot Tables?
So You Know Excel Pivot Tables?So You Know Excel Pivot Tables?
So You Know Excel Pivot Tables?
 
Ratios And Proportions Notes
Ratios And Proportions NotesRatios And Proportions Notes
Ratios And Proportions Notes
 

Plus de aiaioo

Plus de aiaioo (9)

Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learning
 
Deep Learning through Pytorch Exercises
Deep Learning through Pytorch ExercisesDeep Learning through Pytorch Exercises
Deep Learning through Pytorch Exercises
 
Learning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text ClassificationLearning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text Classification
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Arduino for Indian Languages
Arduino for Indian LanguagesArduino for Indian Languages
Arduino for Indian Languages
 
Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)
 
Statistics for linguistics
Statistics for linguisticsStatistics for linguistics
Statistics for linguistics
 
Rules engines to machine learning
Rules engines to machine learningRules engines to machine learning
Rules engines to machine learning
 
Aiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly FuturisticAiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly Futuristic
 

Dernier

一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Dernier (20)

Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 

Fun with Text - Hacking Text Analytics

  • 1. Fun With Text Hacking Text Analytics Cohan Sujay Carlos Aiaioo Labs Bangalore
  • 2. I shall chuck three buzz-words at you: • One type of big data is unstructured data • One type of unstructured data is text • The analysis of text is text analysis Is text analytics important?
  • 3. You didn’t answer my question. Is text analytics important? Hell, I don’t know! But what are you doing reading this if it isn’t?
  • 4. • One Machine Learning Tool • How to reduce Text Analytics Tasks to steps which you can solve with merely this one Machine Learning Tool What we’re going to cover
  • 5. The One ML Tool Get ready for the one ML tool you’ll need to hack text analytics DRUM ROLL !!!!!!!
  • 6. The One Get ready for the one ML tool you’ll need to hack text analytics The Classifier 
  • 7. That ML Tool is the Classifier What is a Classifier? Something that performs classification. Classification = categorizing Classification = deciding Classification = labelling Classification = Deciding = Labelling
  • 8. Classification = Deciding = Labelling 5’11” 5’ 8” Classify these door heights as: Short or Tall ? 5’8” 5’11” 6’2” 6’6” 5’ 2” 6’8” 6’9” 6’10”
  • 9. Classification = Deciding = Labelling 5’11” Short 5’ 8” Short For classification, you always start with some labelled data points. 5’8” 5’11” 6’2” Tall 6’6” Tall 5’ 2” Short 6’8” 6’9” 6’10” Tall
  • 10. Classification = Deciding = Labelling 5’11” Short 5’ 8” Short 5’8” Short 5’11” Short 6’2” Tall 6’6” Tall 5’ 2” Short 6’8” Tall 6’9” Tall 6’10” Tall If you were doing analysis, you’d ask a human to come up with a rule like: A door below 6’ is Short else it’s Tall
  • 11. Classification = Deciding = Labelling 5’11” Short 5’ 8” Short In ML, you just provide some examples. The computer discovers the rule. 5’8” 5’11” 6’2” Tall 6’6” Tall 5’ 2” Short 6’8” 6’9” 6’10” Tall
  • 12. Classification = Deciding = Labelling 5’11” Short 5’ 8” Short 5’8” 5’11” 6’2” Tall 6’6” Tall 5’ 2” Short 6’8” 6’9” 6’10” Tall The ML algorithm learns something like: A door below 6’ is Short else it’s Tall
  • 13. Classification = Deciding = Labelling 5’11” Short 5’ 8” Short 5’8” Short 5’11” Short 6’2” Tall 6’6” Tall 5’ 2” Short 6’8” Tall 6’9” Tall 6’10” Tall You will learn to create an ML algorithm that learns something like this and that works with text.
  • 14. Topic Classification Can you tell which is about Politics and which is about Sports? The United Nations Security Council today Manchester United beat Barca to reach
  • 15. Topic Classification Can you train an ML algorithm to tell which is about Politics and which is about Sports? Manchester United beat Barca to reach Politics Sports The United Nations Security Council today
  • 16. Topic Classification Start by taking some samples of documents on politics and some samples of sports documents. We’re using really short documents so you can do all the calculations manually & see that this ML algorithm really works! Politics Sports The United Nations The United States and Manchester United Manchester and Barca
  • 17. Step 1: Learn Multinomial Probabilities The United States and Politics Manchester United Manchester and Barca Sports P(United/Politics) = 2/7 The United Nations P(Nations/Politics) = 1/7 P(States/Politics) = 1/7 P(Manchester/Sports) = 2/5 P(United/Sports) = 1/5 P(Barca/Sports) = 1/5 P(Politics) = 7/12 P(Sports) = 5/12 P(The/Politics) = 2/7 P(and/Sports) = 1/5 P(and/Politics) = 1/7
  • 18. One ML Algorithm Step 1 was easy!!! Are you ready for Step 2 ?
  • 19. Step 2: There’s no step 2 P(United/Politics) = 2/7 P(Nations/Politics) = 1/7 P(States/Politics) = 1/7 P(Manchester/Sports) = 2/5 P(United/Sports) = 1/5 P(Barca/Sports) = 1/5 P(Politics) = 7/12 P(Sports) = 5/12 P(The/Politics) = 2/7 P(and/Sports) = 1/5 P(and/Politics) = 1/7 This is a Naïve Bayesian Classifier!!!
  • 20. One ML Algorithm Let’s put the classifier to work: Let’s see if it can classify the following documents : 1. United Nations 2. Manchester United We are using deliberately short documents!
  • 21. Running the Topic Classifier United Nations P(Politics|United Nations) = P(United|P)*P(Nations|Politics)*P(Politics) = (2/7)*(1/7)*(7/12) = 1/(7*6) P(Sports|United Nations) = P(United|S)*P(Nations|Sports)*P(Sports) = (1/5)*(0)*(5/12) = 0
  • 22. Running the Topic Classifier United Nations P(Politics|United Nations) > P(Sports|United Nations) So, the classifier has returned the category POLITICS
  • 23. Running the Topic Classifier Manchester United P(Politics|Manchester United) = P(Manchester|P)*P(United|P)*P(Politics) = (0)*(2/7)*(7/12) = 0 P(Sports|Manchester United) = P(Manchester|S)*P(United|S)*P(Sports) = (2/5)*(1/5)*(5/12) = 2/(5*12)
  • 24. Running the Topic Classifier P(Sports|Manchester United) > P(Politics|Manchester United) So, the classifier has returned the category SPORTS Manchester United
  • 25. Topic Classification Manchester United Politics United Nations Sports We have successfully used an ML algorithm to tell us which document is about Politics and which is about Sports !!!
  • 26. Can We Solve Other Problems? Now, we have an ML tool in our toolkit – a Naïve Bayesian classifier.
  • 27. Can We Solve Other Problems? So, if we can represent a text analysis problem as a classification task, then we can solve it using ML. Now, we have an ML tool in our toolkit – a Naïve Bayesian classifier.
  • 28. Can We Solve Other Problems? So, if we can represent a text analysis problem as a classification task, then we can solve it using ML. Now, we have an ML tool in our toolkit – a Naïve Bayesian classifier. So, let us learn how to represent text analysis problems as classification tasks.
  • 29. Problem 1: Sentence Segmentation The problem of identifying the end-points of sentences is called sentence segmentation. It can be reduced to a classification problem.
  • 30. Sentence Segmentation Yes, Mr. Anurag. You need a D.L. to drive. Sentence segmentation Classify each ‘.’, ‘!’ and ‘?’ into: 1) sentence terminator and 2) not a sentence terminator.
  • 31. Sentence Segmentation Hurray! We have turned the problem of sentence segmentation into a classification problem. Once you have modeled a text analytics problem as an ML problem, there is one more step you need to perform to get a working solution.
  • 33. Features Which country’s flag is this? Features are the clues you need for good decision making.
  • 34. Features Which country’s flag is this? One important feature for solving this decision problem is colour.
  • 35. Features for Topic Classification P(United/Politics) = 2/7 P(Nations/Politics) = 1/7 P(States/Politics) = 1/7 P(Manchester/Sports) = 2/5 P(United/Sports) = 1/5 P(Barca/Sports) = 1/5 P(Politics) = 7/12 P(Sports) = 5/12 P(The/Politics) = 2/7 P(and/Sports) = 1/5 P(and/Politics) = 1/7 For topic classification, the features are the individual words in the text.
  • 36. Sentence Segmentation What are the features you would use? Yes, Mr. Anurag. You need a D.L. to drive. That’s when you start reading the research papers!!!
  • 37. Sentence Segmentation What are the features you would use? Is_This_Character_a_Dot Is_This_Within_Quotes Is_Next_Letter_Capitalized Is_Prev_Letter_Capitalized Number_of_Words_in_Sentence_so_Far Is_Next_Word_a_Name Yes, Mr. Anurag. You need a D.L. to drive.
  • 38. Sentence Segmentation Train an NB classifier on some text whose characters are marked as sentence terminators or not. It will learn to assign characters to the categories word terminator and not a word terminator.
  • 39. Problem 2: Tokenization Now, get a D.L., Mr. Anurag. For almost all analyses, you have to identify the individual words in the text. How can you break a sentence into words? Now , get a D.L. , Mr. Anurag .
  • 40. Tokenization Yes, Mr. Anurag. You need a D.L. to drive. Tokenization Classify each ‘.’, ‘!’, ‘,’, ‘ ’, ‘?’ etc. into: 1) word terminator and 2) not a word terminator.
  • 41. Tokenization or Word Segmentation We have turned the problem of word segmentation into a classification problem. Train a classifier on text whose characters are marked as word terminators or not. It will learn to assign characters to the categories word terminator and not a word terminator.
  • 42. Problem 3: Part of Speech (POS) Tagging Anurag needs a D.L. Anurag/NNP needs/VBZ a/DT D.L./NN How do you tag the words in a sentence?
  • 43. Problem 3: Part of Speech (POS) Tagging Anurag needs a D.L. Anurag/NNP needs/VBZ a/DT D.L./NN POS Tagging Run from the first to the last word in the sentence, classifying each word in the sentence into a part-of-speech category.
  • 44. We have turned the problem of POS Tagging into a Classification problem where you label each word as: 1. Noun 2. Verb 3. Adjective 4. Adverb 5. Interjection 6. Conjunction (and, or, if, neither … nor) 7. Pronoun (I, it, you, them) 8. Preposition (in, of, out) Problem 3: Part of Speech (POS) Tagging
  • 45. Problem 4: Named Entity Recognition Anurag is looking for a Hyundai car in Bangalore Anurag = Person Hyundai car = Vehicle Bangalore = Place How do you extract the named entities in a sentence using a classifier?
  • 46. Problem 4: Named Entity Recognition Anurag is looking for a Hyundai car in Bangalore Anurag = Person Hyundai car = Vehicle Bangalore = Place Anurag/Person is/Other looking/Other for/Other a/Other Hyundai/Vehicle car/Vehicle in/Other Bangalore/Place
  • 47. We have turned the problem of Named Entity Recognition (NER) into a Tagging problem where you label each word as: 1. Person 2. Vehicle 3. Place Problem 4: Named Entity Recognition
  • 48. But we know that we can turn the problem of Tagging into a Classification problem over the same labels: 1. Person 2. Vehicle 3. Place Problem 4: Named Entity Recognition
  • 49. Problem 4: Named Entity Recognition Anurag is looking for a Hyundai car in Bangalore Anurag/Person is/Other looking/Other for/Other a/Other Hyundai/Vehicle car/Vehicle in/Other Bangalore/Place Named Entity Recognition Run from the first to the last word in the sentence, classifying each word in the sentence into a Named Entity Recognition category.
  • 50. Hey We Have One More ML Tool We just built one more very useful ML tool! The Extractor 
  • 51. Extraction What is an Extractor? Something that performs extraction. Extraction = recognizing Extraction = finding Extraction = locating Extraction = Finding = Locating
  • 52. Problem 5: Relation Extraction Tim Cook is the new CEO of Apple Computers Relation: CEO_of Tim Cook (Person) Apple Computers (Org) How do you identify relations between entities in a sentence using a classifier?
  • 53. Extracting Meaning Text: Tim Cook is the new CEO of Apple Computers Analysis: Tim/Person Cook/Person is the new CEO of Apple/Org Computers/Org Relation Extraction Step 1:
  • 54. Extracting Meaning Text: Tim Cook is the new CEO of Apple Computers Analysis: Tim/Person Cook/Person is the new CEO of Apple/Org Computers/Org Relation Extraction Step 1: Step 2: Relation Extraction {Tim/Person Cook/Person, Apple/Org Computers/Org} => CEO_of
  • 55. Extracting Meaning Text: Tim Cook is the new CEO of Apple Computers Analysis: Tim/Person Cook/Person is the new CEO of Apple/Org Computers/Org Relation Extraction Step 1: Step 2: Relation Extraction Run through all pairs of named entities, classifying each pair into CEO_of or Other.
  • 56. But look at what we’re extracting! Do you realize what we’re extracting? Meaning !!! 
  • 57. Extracting Meaning Text: I am looking for a Hyundai car in B’lore Syntactic Analysis: I/Pronoun am/BE looking/V for a/DT Hyundai/NP car/NN in/PP Bangalore/NP Semantic Analysis: I/Person am looking for a Hyundai/Vehicle car/Vehicle in Bangalore/Place
  • 58. But is this comprehensive? Are you telling me that this is all I have to do to extract every possible sort of simple meaning? Yeah!!!  All utterances fall into two main categories …
  • 59. Intentional: I want to buy a computer. Information: There was heavy snowfall in Sikkim. Two Kinds of Utterances 360 Degree Text Analysis
  • 60. I want to buy a computer. How do you deal with intentional utterances 360 Degree Text Analysis
  • 61. Intentional Utterance Raw Text: Are you sad that Steve Jobs died? Analysis: This person is inquiring about someone’s emotions concerning Steve Jobs Intention Analysis Intention Holder: I Intention: inquire
  • 62. Subjective Utterance Raw Text: I am sad that Steve Jobs died Analysis: This person holds a positive opinion on Steve Jobs Sentiment Analysis Sentiment Holder: I Object of Sentiment: Steve Jobs Polarity of Sentiment: positive
  • 63. How do you deal with informational utterances 360 Degree Text Analysis There was heavy snowfall in Sikkim.
  • 64. Approaches to Extracting Meaning Raw Text: There is heavy snowfall in Sikkim. Analysis: Snowfall event Event Analysis Event: snowfall
  • 65. Approaches to Extracting Meaning Raw Text: Bangalore is the capital of K’taka Analysis: capital_of relation exists Fact Analysis Entity: Bangalore/Place Karnataka/Place Relation: Bangalore capital_of K’taka
  • 66. Extracting Meaning Text: I am looking for a Hyundai car in Bangalore Semantic Analysis: I/Person am looking for a Hyundai/Thing car/Thing in Bangalore/Place Entity Extraction Entities: I Person Hyundai car Thing Bangalore Place Fact Analysis
  • 67. Extracting Meaning Text: Tim Cook is the new CEO of Apple Computers Analysis: Tim/Person Cook/Person is the new CEO of Apple/Org Computers/Org Relation Extraction Relation: CEO_of Tim Cook (Person) Apple Computers(Org) Fact Analysis
  • 68. So, I can do anything (badly?) I got it! I got it! I can do anything. But how do I know how well I am doing it ? 
  • 69. Measurement Measurement: This measurement thing is very important in any design process, because …
  • 70. Measurement How do you measure the performance of a classifier? Measurement: … it lets you compare two designs and decide which is better.
  • 71. Before we go on … … promise me … … that you will never forget what I am about to tell you … 
  • 72. Measurement Break up the data points into training and test parts (usually an 80:20 split) and never test on your training data.
  • 73. Measurement Why not test on the training data? Break up the data points into training and test – usually an 80:20 split. Train on 80% Test on the remaining 20%
  • 74. Measurement Why not develop on the test data? Or if you are still developing features, break up the data points into training, development and test – usually a 70:10:20 split. Train on 70% Develop on 10% Test on the remaining 20%
  • 75. That applies to college as well … … you won’t get accurate measurements … … if you test students using questions that appeared in the question bank! … and you will be encouraging students to learn things by rote!
  • 76. Classification Quality Metric - Accuracy Correct Answers Total Number of Questions
  • 77. Classification Quality Metric - Accuracy Politics Documents classified as Politics + Sports Documents classified as Sports Total Number of Documents If your categories are Politics and Sports
  • 78. Classification Quality Metric - Accuracy Gold - Politics Gold – Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive) Point of View = Sports = (+ve)
  • 79. Accuracy Gold – Sports (1000) Gold – Politics (1000) Observed – Sports TN = 990 FN = 100 Observed – Politics FP = 10 TP = 900 Point of View - Politics = ?
  • 80. Accuracy Gold – Sports (1000) Gold – Politics (1000) Observed – Sports TN = 990 FN = 100 Observed – Politics FP = 10 TP = 900 Point of View - Politics = 94.5%
  • 81. Classification Quality Metric - Recall How many Politics document did it find Total number of Politics documents in the test data
  • 82. Recall Gold – Sports (1000) Gold – Politics (1000) Observed – Sports TN = 990 FN = 100 Observed – Politics FP = 10 TP = 900 Point of View - Politics = ?
  • 83. Recall Gold – Sports (1000) Gold – Politics (1000) Observed – Sports TN = 990 FN = 100 Observed – Politics FP = 10 TP = 900 Point of View - Politics = 90%
  • 84. Recall Gold – Sports (1000) Gold – Politics (1000) Observed – Sports TP = 990 FP = 100 Observed – Politics FN = 10 TN = 900 Point of View - Sports = ?
  • 85. Recall Gold – Sports (1000) Gold – Politics (1000) Observed – Sports TP = 990 FP = 100 Observed – Politics FN = 10 TN = 900 Point of View - Sports = 99%
  • 86. Classification Quality Metric - Precision How many were really Politics documents Total number of documents the classifier identified as Politics
  • 87. Precision Point of View - Politics = ? Gold – Sports (1040) Gold – Politics (960) Observed – Sports TN = 990 FN = 10 Observed – Politics FP = 50 TP = 950
  • 88. Precision Point of View - Politics = 95% Gold – Sports (1040) Gold – Politics (960) Observed – Sports TN = 990 FN = 10 Observed – Politics FP = 50 TP = 950
  • 89. Precision Point of View - Sports = ? Gold – Sports (1040) Gold – Politics (960) Observed – Sports TP = 990 FP = 10 Observed – Politics FN = 50 TN = 950
  • 90. Precision Point of View - Sports = 99% Gold – Sports (1040) Gold – Politics (960) Observed – Sports TP = 990 FP = 10 Observed – Politics FN = 50 TN = 950
  • 91. Are we done? … kinda … The more training you have, the better you will get at text analysis … so keep learning!
  • 92. How to Learn More? Grab the UC Berkeley Natural Language Processing Course’s slides. The course’s name is CS 294. Start reading the ACL conference’s research papers (ACL = Association of Computational Linguistics)
  • 93. Now, are we done? … kinda … Have Fun! … and I have an exercise for you …
  • 94. Did you know … that you can use a classifier … … to do clustering? The exercise is … think about it …
  • 95. THE END Fun With Text Hacking Text Analytics Cohan Sujay Carlos Aiaioo Labs Bangalore The bottom