SlideShare une entreprise Scribd logo
1  sur  26
Getting Started with Text Mining
Mathangi Sri R
Lets look at some text
1. I love movies
2. I love icecream
3. I don’t like anything
4. I am not going to tell you anything
5. What are you guys doing
6. Where are you all going with it
7. I love her
8. doggie
When asked a question what do you love?
the tokens..?
['I', 'love', 'movies', 'I', 'love', 'icecream', 'I', 'donx92t', 'like',
'anything', 'I', 'am', 'not', 'going', 'to', 'tell', 'you', 'anything',
'What', 'are', 'you', 'guys', 'doing', 'Where', 'are', 'you', 'all',
'going', 'with', 'it', 'I', 'love', 'her', 'doggie']
word frequency
[('I', 5), ('love', 3), ('movies', 1), ('I', 5), ('love', 3),
('icecream', 1), ('I', 5), ('donx92t', 1), ('like', 1),
('anything', 2), ('I', 5), ('am', 1), ('not', 1), ('going',
2), ('to', 1), ('tell', 1), ('you', 3), ('anything', 2),
('What', 1), ('are', 2), ('you', 3), ('guys', 1), ('doing',
1), ('Where', 1), ('are', 2), ('you', 3), ('all', 1),
('going', 2), ('with', 1), ('it', 1), ('I', 5), ('love', 3),
('her', 1), ('doggie', 1)]
Term Frequency
[('I', 0.15), ('love', 0.09), ('movies', 0.03), ('I', 0.15), ('love', 0.09), ('icecream',
0.03), ('I', 0.15), ('donx92t', 0.03), ('like', 0.03), ('anything', 0.06), ('I', 0.15),
('am', 0.03), ('not', 0.03), ('going', 0.06), ('to', 0.03), ('tell', 0.03), ('you', 0.09),
('anything', 0.06), ('What', 0.03), ('are', 0.06), ('you', 0.09), ('guys', 0.03),
('doing', 0.03), ('Where', 0.03), ('are', 0.06), ('you', 0.09), ('all', 0.03), ('going',
0.06), ('with', 0.03), ('it', 0.03), ('I', 0.15), ('love', 0.09), ('her', 0.03), ('doggie',
0.03)]
TF - IDF
• TF: Term Frequency, which measures how frequently a
term occurs in a document
TF(t) = (Number of times term t appears in a document) /
(Total number of terms in the document).
• IDF: Inverse Document Frequency, which measures how
important a term is. :
IDF(t) = log_e(Total number of documents / Number of
documents with term t in it).
Tf-idf for our dataset
• 8*22 (8 records * 22 unique words. Total words 34)
u'all', u'am',
u'anyt
hing', u'are',
u'dog
gie',
u'doin
g', u'don',
u'goin
g',
u'guys
', u'her',
u'icecr
eam', u'it', u'like',
u'love'
,
u'movi
es', u'not', u'tell', u'to',
u'what
',
u'wher
e',
u'with'
, u'you'
I love movies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 0.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I love icecream 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.00 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I don’t like
anything 0.00 0.00 0.51 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I am not going to
tell you anything 0.00 0.41 0.34 0.00 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.41 0.00 0.00 0.00 0.30
What are you
guys doing 0.00 0.00 0.00 0.41 0.00 0.49 0.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.35
Where are you all
going with it 0.41 0.00 0.00 0.34 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.30
Unigrams,Bi-grams and Tri-grams
• I love movies
--I love, love movies
In our dataset,
[u'all', u'all going', u'all going with', u'am', u'am not', u'am not going', u'anything', u'are', u'are you',
u'are you all', u'are you guys', u'doggie', u'doing', u'don', u'don like', u'don like anything', u'going',
u'going to', u'going to tell', u'going with', u'going with it', u'guys', u'guys doing', u'her', u'icecream',
u'it', u'like', u'like anything', u'love', u'love her', u'love icecream', u'love movies', u'movies', u'not',
u'not going', u'not going to', u'tell', u'tell you', u'tell you anything', u'to', u'to tell', u'to tell you',
u'what', u'what are', u'what are you', u'where', u'where are', u'where are you', u'with', u'with it',
u'you', u'you all', u'you all going', u'you anything', u'you guys', u'you guys doing']
Python code to genarate tf-idf matrix
Input dataset (List of strings)-
[u'I love movies', u'I love icecream ', u'I donx92t like anything', u'I am not going to tell you anything', u'What
are you guys doing', u'Where are you all going with it', u'I love her', u'doggie ']
Code:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()
Text Classification
Classifying text - Methods
• Supervised classification:
– Requires labelled data
– Classification algorithms – SVM, LR, Ensemble,
RF,etc
– Can measure accuracy precisely
– Need for highly actionable applications
Classifying text - Methods
• Unsupervised
- No labels required
- Accuracy is a ‘loose’ measure
- Measuring homogeneity of clusters
- Useful for quick insights or where grouping is
required
Classifying text - Methods
• Semi-supervised learning is a class of
supervised learning tasks and techniques that
also make use of unlabeled data for training -
typically a small amount of labeled data with a
large amount of unlabeled data.
Supervised Learning – Case Study
Lets look at some text
line class
20 get me to check in check in
21 check in internet check in
22 what is free baggage allowance baggage
23 how much baggage baggage
24 I have 35 kg should I pay baggage
25 how much can I carry baggage
26 lots of bags I have baggage
27 till how much baggage is free baggage
28 how many bags are free baggage
29 upto what weight I can carry baggage
30 how much can I carry baggage
31 baggage carry baggage
32 baggage to carry baggage
33 number of bags baggage
34 carrying bags baggage
35 travelling with bags baggage
36 money for luggage baggage
37 how much luggage I can carry baggage
38 too much luggage baggage
Class Distribution
0%
5%
10%
15%
20%
25%
30%
login other baggage check in greetings thanks cancel
Preprocess the data
• Naming same words into a word group (For
eg: different places can be made with a single
group name)
• Use regex and normalize Dates, dollar values
etc
Stop Words
How do you generate stop words from a corpus?
Stemming
• Stemming is the process of reducing a word
into its stem, i.e. its root form. The root form
is not necessarily a word by itself, but it can be
used to generate words by concatenating the
right suffix.
Stemmed words
fish, fishes and fishing --- fish
study, studies and studying stems --- studi
Diff between stemming vs lemmetization:
stemming – meaningless words
Lemmetization – meaningful words
Stemming and Lemmetizing
Code
from nltk.stem import PorterStemmer
#from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
ps.stem(“having”)
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
lancaster_stemmer.stem("maximum")
Spell checker
• https://github.com/mattalcock/blog/blob/ma
ster/2012/12/5/python-spell-checker.rst
• https://pypi.python.org/pypi/autocorrect/0.1.
0
Sampling – Train and Validation
• from sklearn.cross_validation import StratifiedShuffleSplit
• sss = StratifiedShuffleSplit(tgt3, 1,
test_size=0.2,random_state=42)
• for train_index, test_index in sss:
• #print("TRAIN:", train_index, "TEST:", test_index)
• a_train_b, a_test_b = tf1[train_index], tf1[test_index]
• b_train_b, b_test_b = tgt3[train_index], tgt3[test_index]
Generate features or word tokens and
vectorize
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer =
TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1,
4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()
Feature Selection
from sklearn.feature_selection import
SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=100)
selector.fit(a_train_b, b_train_b)
a_train_b = selector.fit_transform(a_train_b,
b_train_b)
a_test_b = selector.transform(a_test_b)
Build Model
• Logistic Regression
• GBM
• SVM
• RF
• Neural Nets
• NB

Contenu connexe

Similaire à Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018

BEA Ignite2017 - Therkelsen
BEA Ignite2017  - TherkelsenBEA Ignite2017  - Therkelsen
BEA Ignite2017 - TherkelsenMichael Bruce
 
Lesson plan intended for kindergarten konti nlng
Lesson plan intended for kindergarten konti nlngLesson plan intended for kindergarten konti nlng
Lesson plan intended for kindergarten konti nlngteacherglenda132992
 
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...tdc-globalcode
 
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?Ahirton Lopes
 
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNHSLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNHNguyễn Văn Tuấn
 

Similaire à Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018 (6)

Film plot
Film plotFilm plot
Film plot
 
BEA Ignite2017 - Therkelsen
BEA Ignite2017  - TherkelsenBEA Ignite2017  - Therkelsen
BEA Ignite2017 - Therkelsen
 
Lesson plan intended for kindergarten konti nlng
Lesson plan intended for kindergarten konti nlngLesson plan intended for kindergarten konti nlng
Lesson plan intended for kindergarten konti nlng
 
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
 
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
 
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNHSLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
 

Plus de Analytics India Magazine

[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNINGAnalytics India Magazine
 
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...Analytics India Magazine
 
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...Analytics India Magazine
 
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...Analytics India Magazine
 
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...Analytics India Magazine
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Analytics India Magazine
 
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...Analytics India Magazine
 
10 data science & AI trends in india to watch out for in 2019
10 data science & AI trends in india to watch out for in 201910 data science & AI trends in india to watch out for in 2019
10 data science & AI trends in india to watch out for in 2019Analytics India Magazine
 
The hitchhiker's guide to artificial intelligence 2018-19
The hitchhiker's guide to artificial intelligence 2018-19The hitchhiker's guide to artificial intelligence 2018-19
The hitchhiker's guide to artificial intelligence 2018-19Analytics India Magazine
 
Data Science Skills Study 2018 by AIM & Great Learning
Data Science Skills Study 2018 by AIM & Great LearningData Science Skills Study 2018 by AIM & Great Learning
Data Science Skills Study 2018 by AIM & Great LearningAnalytics India Magazine
 
Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Analytics India Magazine
 
Predicting outcome of legal case using machine learning algorithms By Ankita ...
Predicting outcome of legal case using machine learning algorithms By Ankita ...Predicting outcome of legal case using machine learning algorithms By Ankita ...
Predicting outcome of legal case using machine learning algorithms By Ankita ...Analytics India Magazine
 
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...Analytics India Magazine
 
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Analytics India Magazine
 
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...Analytics India Magazine
 
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ..."Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...Analytics India Magazine
 
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...Analytics India Magazine
 
Analytics Education — A Primer & Learning Path
Analytics Education — A Primer & Learning PathAnalytics Education — A Primer & Learning Path
Analytics Education — A Primer & Learning PathAnalytics India Magazine
 
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIMAnalytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIMAnalytics India Magazine
 

Plus de Analytics India Magazine (20)

Deep Learning in Search for E-Commerce
Deep Learning in Search for E-CommerceDeep Learning in Search for E-Commerce
Deep Learning in Search for E-Commerce
 
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
 
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
 
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
 
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
 
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
 
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
 
10 data science & AI trends in india to watch out for in 2019
10 data science & AI trends in india to watch out for in 201910 data science & AI trends in india to watch out for in 2019
10 data science & AI trends in india to watch out for in 2019
 
The hitchhiker's guide to artificial intelligence 2018-19
The hitchhiker's guide to artificial intelligence 2018-19The hitchhiker's guide to artificial intelligence 2018-19
The hitchhiker's guide to artificial intelligence 2018-19
 
Data Science Skills Study 2018 by AIM & Great Learning
Data Science Skills Study 2018 by AIM & Great LearningData Science Skills Study 2018 by AIM & Great Learning
Data Science Skills Study 2018 by AIM & Great Learning
 
Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...
 
Predicting outcome of legal case using machine learning algorithms By Ankita ...
Predicting outcome of legal case using machine learning algorithms By Ankita ...Predicting outcome of legal case using machine learning algorithms By Ankita ...
Predicting outcome of legal case using machine learning algorithms By Ankita ...
 
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
 
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
 
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
 
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ..."Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
 
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
 
Analytics Education — A Primer & Learning Path
Analytics Education — A Primer & Learning PathAnalytics Education — A Primer & Learning Path
Analytics Education — A Primer & Learning Path
 
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIMAnalytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
 

Dernier

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 

Dernier (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 

Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018

  • 1. Getting Started with Text Mining Mathangi Sri R
  • 2. Lets look at some text 1. I love movies 2. I love icecream 3. I don’t like anything 4. I am not going to tell you anything 5. What are you guys doing 6. Where are you all going with it 7. I love her 8. doggie When asked a question what do you love?
  • 3. the tokens..? ['I', 'love', 'movies', 'I', 'love', 'icecream', 'I', 'donx92t', 'like', 'anything', 'I', 'am', 'not', 'going', 'to', 'tell', 'you', 'anything', 'What', 'are', 'you', 'guys', 'doing', 'Where', 'are', 'you', 'all', 'going', 'with', 'it', 'I', 'love', 'her', 'doggie']
  • 4. word frequency [('I', 5), ('love', 3), ('movies', 1), ('I', 5), ('love', 3), ('icecream', 1), ('I', 5), ('donx92t', 1), ('like', 1), ('anything', 2), ('I', 5), ('am', 1), ('not', 1), ('going', 2), ('to', 1), ('tell', 1), ('you', 3), ('anything', 2), ('What', 1), ('are', 2), ('you', 3), ('guys', 1), ('doing', 1), ('Where', 1), ('are', 2), ('you', 3), ('all', 1), ('going', 2), ('with', 1), ('it', 1), ('I', 5), ('love', 3), ('her', 1), ('doggie', 1)]
  • 5. Term Frequency [('I', 0.15), ('love', 0.09), ('movies', 0.03), ('I', 0.15), ('love', 0.09), ('icecream', 0.03), ('I', 0.15), ('donx92t', 0.03), ('like', 0.03), ('anything', 0.06), ('I', 0.15), ('am', 0.03), ('not', 0.03), ('going', 0.06), ('to', 0.03), ('tell', 0.03), ('you', 0.09), ('anything', 0.06), ('What', 0.03), ('are', 0.06), ('you', 0.09), ('guys', 0.03), ('doing', 0.03), ('Where', 0.03), ('are', 0.06), ('you', 0.09), ('all', 0.03), ('going', 0.06), ('with', 0.03), ('it', 0.03), ('I', 0.15), ('love', 0.09), ('her', 0.03), ('doggie', 0.03)]
  • 6. TF - IDF • TF: Term Frequency, which measures how frequently a term occurs in a document TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). • IDF: Inverse Document Frequency, which measures how important a term is. : IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
  • 7. Tf-idf for our dataset • 8*22 (8 records * 22 unique words. Total words 34) u'all', u'am', u'anyt hing', u'are', u'dog gie', u'doin g', u'don', u'goin g', u'guys ', u'her', u'icecr eam', u'it', u'like', u'love' , u'movi es', u'not', u'tell', u'to', u'what ', u'wher e', u'with' , u'you' I love movies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 0.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I love icecream 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.00 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I don’t like anything 0.00 0.00 0.51 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I am not going to tell you anything 0.00 0.41 0.34 0.00 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.41 0.00 0.00 0.00 0.30 What are you guys doing 0.00 0.00 0.00 0.41 0.00 0.49 0.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.35 Where are you all going with it 0.41 0.00 0.00 0.34 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.30
  • 8. Unigrams,Bi-grams and Tri-grams • I love movies --I love, love movies In our dataset, [u'all', u'all going', u'all going with', u'am', u'am not', u'am not going', u'anything', u'are', u'are you', u'are you all', u'are you guys', u'doggie', u'doing', u'don', u'don like', u'don like anything', u'going', u'going to', u'going to tell', u'going with', u'going with it', u'guys', u'guys doing', u'her', u'icecream', u'it', u'like', u'like anything', u'love', u'love her', u'love icecream', u'love movies', u'movies', u'not', u'not going', u'not going to', u'tell', u'tell you', u'tell you anything', u'to', u'to tell', u'to tell you', u'what', u'what are', u'what are you', u'where', u'where are', u'where are you', u'with', u'with it', u'you', u'you all', u'you all going', u'you anything', u'you guys', u'you guys doing']
  • 9. Python code to genarate tf-idf matrix Input dataset (List of strings)- [u'I love movies', u'I love icecream ', u'I donx92t like anything', u'I am not going to tell you anything', u'What are you guys doing', u'Where are you all going with it', u'I love her', u'doggie '] Code: from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None) tfidf_matrix = tfidf_vectorizer.fit_transform(tt1) tf1= tfidf_matrix.todense()
  • 11. Classifying text - Methods • Supervised classification: – Requires labelled data – Classification algorithms – SVM, LR, Ensemble, RF,etc – Can measure accuracy precisely – Need for highly actionable applications
  • 12. Classifying text - Methods • Unsupervised - No labels required - Accuracy is a ‘loose’ measure - Measuring homogeneity of clusters - Useful for quick insights or where grouping is required
  • 13. Classifying text - Methods • Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data.
  • 15. Lets look at some text line class 20 get me to check in check in 21 check in internet check in 22 what is free baggage allowance baggage 23 how much baggage baggage 24 I have 35 kg should I pay baggage 25 how much can I carry baggage 26 lots of bags I have baggage 27 till how much baggage is free baggage 28 how many bags are free baggage 29 upto what weight I can carry baggage 30 how much can I carry baggage 31 baggage carry baggage 32 baggage to carry baggage 33 number of bags baggage 34 carrying bags baggage 35 travelling with bags baggage 36 money for luggage baggage 37 how much luggage I can carry baggage 38 too much luggage baggage
  • 16. Class Distribution 0% 5% 10% 15% 20% 25% 30% login other baggage check in greetings thanks cancel
  • 17. Preprocess the data • Naming same words into a word group (For eg: different places can be made with a single group name) • Use regex and normalize Dates, dollar values etc
  • 18. Stop Words How do you generate stop words from a corpus?
  • 19. Stemming • Stemming is the process of reducing a word into its stem, i.e. its root form. The root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.
  • 20. Stemmed words fish, fishes and fishing --- fish study, studies and studying stems --- studi Diff between stemming vs lemmetization: stemming – meaningless words Lemmetization – meaningful words
  • 21. Stemming and Lemmetizing Code from nltk.stem import PorterStemmer #from nltk.tokenize import sent_tokenize, word_tokenize ps = PorterStemmer() ps.stem(“having”) from nltk.stem.lancaster import LancasterStemmer lancaster_stemmer = LancasterStemmer() lancaster_stemmer.stem("maximum")
  • 23. Sampling – Train and Validation • from sklearn.cross_validation import StratifiedShuffleSplit • sss = StratifiedShuffleSplit(tgt3, 1, test_size=0.2,random_state=42) • for train_index, test_index in sss: • #print("TRAIN:", train_index, "TEST:", test_index) • a_train_b, a_test_b = tf1[train_index], tf1[test_index] • b_train_b, b_test_b = tgt3[train_index], tgt3[test_index]
  • 24. Generate features or word tokens and vectorize from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None) tfidf_matrix = tfidf_vectorizer.fit_transform(tt1) tf1= tfidf_matrix.todense()
  • 25. Feature Selection from sklearn.feature_selection import SelectPercentile, f_classif selector = SelectPercentile(f_classif, percentile=100) selector.fit(a_train_b, b_train_b) a_train_b = selector.fit_transform(a_train_b, b_train_b) a_test_b = selector.transform(a_test_b)
  • 26. Build Model • Logistic Regression • GBM • SVM • RF • Neural Nets • NB