The workshop will empower you and get started with analyzing text data, discover patterns and what are the best ways to convert unstructured to structured data. We will also build a quick classification model and understand techniques to improve model performance. Towards the end lets quickly do a sentiment analysis on data corpus and discuss the next steps to improve model accuracy. Please come prepared with a working laptop with Jupyter Notebook and Python 2.7. Participants who have a minimum working knowledge of supervised models is encouraged.
2. Lets look at some text
1. I love movies
2. I love icecream
3. I don’t like anything
4. I am not going to tell you anything
5. What are you guys doing
6. Where are you all going with it
7. I love her
8. doggie
When asked a question what do you love?
6. TF - IDF
• TF: Term Frequency, which measures how frequently a
term occurs in a document
TF(t) = (Number of times term t appears in a document) /
(Total number of terms in the document).
• IDF: Inverse Document Frequency, which measures how
important a term is. :
IDF(t) = log_e(Total number of documents / Number of
documents with term t in it).
7. Tf-idf for our dataset
• 8*22 (8 records * 22 unique words. Total words 34)
u'all', u'am',
u'anyt
hing', u'are',
u'dog
gie',
u'doin
g', u'don',
u'goin
g',
u'guys
', u'her',
u'icecr
eam', u'it', u'like',
u'love'
,
u'movi
es', u'not', u'tell', u'to',
u'what
',
u'wher
e',
u'with'
, u'you'
I love movies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 0.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I love icecream 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.00 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I don’t like
anything 0.00 0.00 0.51 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I am not going to
tell you anything 0.00 0.41 0.34 0.00 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.41 0.00 0.00 0.00 0.30
What are you
guys doing 0.00 0.00 0.00 0.41 0.00 0.49 0.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.35
Where are you all
going with it 0.41 0.00 0.00 0.34 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.30
8. Unigrams,Bi-grams and Tri-grams
• I love movies
--I love, love movies
In our dataset,
[u'all', u'all going', u'all going with', u'am', u'am not', u'am not going', u'anything', u'are', u'are you',
u'are you all', u'are you guys', u'doggie', u'doing', u'don', u'don like', u'don like anything', u'going',
u'going to', u'going to tell', u'going with', u'going with it', u'guys', u'guys doing', u'her', u'icecream',
u'it', u'like', u'like anything', u'love', u'love her', u'love icecream', u'love movies', u'movies', u'not',
u'not going', u'not going to', u'tell', u'tell you', u'tell you anything', u'to', u'to tell', u'to tell you',
u'what', u'what are', u'what are you', u'where', u'where are', u'where are you', u'with', u'with it',
u'you', u'you all', u'you all going', u'you anything', u'you guys', u'you guys doing']
9. Python code to genarate tf-idf matrix
Input dataset (List of strings)-
[u'I love movies', u'I love icecream ', u'I donx92t like anything', u'I am not going to tell you anything', u'What
are you guys doing', u'Where are you all going with it', u'I love her', u'doggie ']
Code:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()
11. Classifying text - Methods
• Supervised classification:
– Requires labelled data
– Classification algorithms – SVM, LR, Ensemble,
RF,etc
– Can measure accuracy precisely
– Need for highly actionable applications
12. Classifying text - Methods
• Unsupervised
- No labels required
- Accuracy is a ‘loose’ measure
- Measuring homogeneity of clusters
- Useful for quick insights or where grouping is
required
13. Classifying text - Methods
• Semi-supervised learning is a class of
supervised learning tasks and techniques that
also make use of unlabeled data for training -
typically a small amount of labeled data with a
large amount of unlabeled data.
15. Lets look at some text
line class
20 get me to check in check in
21 check in internet check in
22 what is free baggage allowance baggage
23 how much baggage baggage
24 I have 35 kg should I pay baggage
25 how much can I carry baggage
26 lots of bags I have baggage
27 till how much baggage is free baggage
28 how many bags are free baggage
29 upto what weight I can carry baggage
30 how much can I carry baggage
31 baggage carry baggage
32 baggage to carry baggage
33 number of bags baggage
34 carrying bags baggage
35 travelling with bags baggage
36 money for luggage baggage
37 how much luggage I can carry baggage
38 too much luggage baggage
17. Preprocess the data
• Naming same words into a word group (For
eg: different places can be made with a single
group name)
• Use regex and normalize Dates, dollar values
etc
19. Stemming
• Stemming is the process of reducing a word
into its stem, i.e. its root form. The root form
is not necessarily a word by itself, but it can be
used to generate words by concatenating the
right suffix.
20. Stemmed words
fish, fishes and fishing --- fish
study, studies and studying stems --- studi
Diff between stemming vs lemmetization:
stemming – meaningless words
Lemmetization – meaningful words
21. Stemming and Lemmetizing
Code
from nltk.stem import PorterStemmer
#from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
ps.stem(“having”)
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
lancaster_stemmer.stem("maximum")
23. Sampling – Train and Validation
• from sklearn.cross_validation import StratifiedShuffleSplit
• sss = StratifiedShuffleSplit(tgt3, 1,
test_size=0.2,random_state=42)
• for train_index, test_index in sss:
• #print("TRAIN:", train_index, "TEST:", test_index)
• a_train_b, a_test_b = tf1[train_index], tf1[test_index]
• b_train_b, b_test_b = tgt3[train_index], tgt3[test_index]
24. Generate features or word tokens and
vectorize
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer =
TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1,
4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()