Detailed documented with the definition of text mining along with challenges, implementing modeling techniques, word cloud and much more.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
2. What is Text mining?
Text mining is the process of exploring and analyzing large amounts
of unstructured data that can be used to identify concepts, patterns,
topics, keywords and other attributes.
Common challenges of text mining:
Each word and phrase can be high number of possible dimensions.
Data are in unstructured form unlike in other data mining techniques
data are found in structure tabular format.
Even statistically not independent.
Ambiguity “the quality of being open to more than one
interpretation; inexactness.”
Rupak Roy
3. Text mining applications
• Customer Relationship management (CRM)
• Market Analysis
• NLP (natural language processing)
• Personalization in E-Commerce
• Natural language processing (or NLP) is a field of Ai and a
component of text mining that performs linguistic analysis that
essentially helps machine to deal with understanding, analyzing,
languages that humans naturally is good at. NLP uses a variety of
methodologies to decipher the ambiguities in human language, like
automatic summarization, speech tagging, entity extraction and
relations extraction, as well as disambiguation and natural language
understanding and recognition.
Rupak Roy
4. Modeling Techniques
• Supervised Learning
• Unsupervised learning
Supervised Learning: where we use labeled data to train our model to
classify new data and as we know in supervised learning we direct i.e.
train our ML model using labeled data.
For example sentimental analysis using classification methods like svm.
Unsupervised Learning: is the vice versa of supervised learning. It doesn't
require labeled data to train the model and validate over test data,
instead it will use the available unlabeled data to develop the model to
classify the problems and the solutions.
For example: Clustering, topic modeling.
Rupak Roy
5. Tm(text mining) package in R
Tm is a base R package for Pre processing the text data like
1. Remove unnecessary lines then convert text to a corpus(a structured
set of texts in tabular format)
2. Then read and inspect the Corpus to create TDM (term document
matrix)
Corpus- A corpus or text corpus is a large and structured set of texts.
a) In a corpus we parse the data to extract words, remove
punctuations, spaces even lower and upper case to make it
uniform.
b) Then remove words that has no meaning by itself like was, as, a, it
etc. also called as Stop words.
c) Finally apply Stemming which is the process of reducing
derived words to their word stem, base or root form. Eg. Consult,
Consulting, Consultation, Consultants = Consult(same meaning)
6. Term Document Matrix
Term Document Matrix (TDM) is a matrix that describes the frequency of
terms that occur in a collection of documents.
ROWS = TERMS
Columns = DOCUMENTS
Document term Matrix
Term Document Matrix
One of the common function widely used for cleaning the
data(corpus) like remove whitespaces, punctuations, numbers is
tm_map() function from base tm R package.
Rupak Roy
i like hate Data Science
D1 1 1 0 1
D2 1 0 1 1
D1 D2
I 1 1
Like 1 0
Hate 0 1
Data
Science
1 1
7. Term Document Matrix
Now what we can we do with Term Document Matrix (TDM)?
* We can easy find the frequent terms occur in the document which is
helpful to understand the keywords. For example very helpful to
understand they Google search keywords.
* We can also find association that are co-related or similar of each
words, how they are related to each other.
* Group the words that have same or similar performance by Clustering
techniques.
* Sentimental Analysis: is the automated process of understanding an
opinion like negative, positive or neutral about a given subject from
written or spoken language helping a business to understand the social
sentiment of their brand, product or service.
Rupak Roy
8. Example
#load the data
>star_wars_EPV<-read.csv("SW_EpisodeV.txt",h=TRUE,sep = " ")
>View(star_wars_EPV)
>str(star_wars_EPV)
>names(star_wars_EPV)
#Convert to a dataframe ‘only second column’
>dialogue<-data.frame(star_wars_EPV$dialogue)
#Renaming the column
>names(dialogue)<-"dialogue"
>str(dialogue)
Rupak Roy
9. Example
#data preprocessing using TM package
>library(tm)
#build text corpus
>dialogue.corpus<-Corpus(VectorSource(dialogue$dialogue))
>summary(dialogue.corpus)
>inspect(dialogue.corpus[1:5]) #Inspecting elements in Corpus
#clean the data
>inspect(dialogue.corpus[1:5])
#Converting to lower case
>dialogue.corpus<-tm_map(dialogue.corpus,content_transformer(tolower))
#Removing extra white space
>dialogue.corpus<-tm_map(dialogue.corpus,stripWhitespace)
#Removing punctuations
>dialogue.corpus<-tm_map(dialogue.corpus,removePunctuation)
#Removing numbers
>dialogue.corpus<-tm_map(dialogue.corpus,removeNumbers)
10. Example
#Create a list of stop words, the words that have no meaning itself.
>my_stopwords<-c(stopwords('english'),‟@‟,'http*„,‟url‟,‟www*‟)
#Remove the stop words
>dialogue.corpus<-
tm_map(dialogue.corpus,removeWords,my_stopwords)
#Build term document matrix
>dialogue.tdm<-TermDocumentMatrix(dialogue.corpus)
>dialogue.tdm
>dim(dialogue.tdm) #Dimensions of term document matrix
>inspect(dialogue.tdm[1:10,1:10])
#Remove sparse terms (Words that occur infrequently)
#here 97% refers remove at least 97% of sparse
>dialogue.imp<-removeSparseTerms(dialogue.tdm,0.97)
11. Example
#Finding word and frequencies
>temp<-inspect(dialogue.imp)
>wordFreq<-data.frame(apply(temp, 1, sum))
>wordFreq<-data.frame(ST = row.names(wordFreq), Freq =
wordFreq[,1])
>head(wordFreq)
>wordFreq<-wordFreq[order(wordFreq$Freq, decreasing = T), ]
>View(wordFreq)
Rupak Roy
12. Example
##Basic Analysis
#Finding the most frequent terms/words
findFreqTerms(dialogue.tdm,10) #Occurring minimum of 10 times
findFreqTerms(dialogue.tdm,30) #Occurring minimum of 30 times
findFreqTerms(dialogue.tdm,50) #Occurring minimum of 50 times
findFreqTerms(dialogue.tdm,70) #Occurring minimum of 70 times
#Finding association between terms/words
findAssocs(dialogue.tdm,"dont",0.3)
findAssocs(dialogue.tdm,"get",0.2)
findAssocs(dialogue.tdm,"right",0.2)
findAssocs(dialogue.tdm,"will",0.3)
findAssocs(dialogue.tdm,"know",0.3)
findAssocs(dialogue.tdm,"good",0.3)
13. Building Word Cloud
#Visualization using WordCloud
>library("wordcloud")
>library("RColorBrewer")
#Word Cloud requires text corpus and not term document matrix
#How to choose colors?
?brewer.pal
display.brewer.all() #Gives you a chart
brewer.pal #Helps you identify the groups of pallete colors
display.brewer.pal(8,"Dark2")
display.brewer.pal(8,"Purples")
display.brewer.pal(3,"Oranges")
set8<-brewer.pal(8,"Dark2")
Rupak Roy
14. Building Word Cloud
#plot the word cloud
wordcloud(dialogue.corpus,min.freq=10,
max.words=60,
random.order=T,colors=set8)
wordcloud(dialogue.corpus,min.freq=10,max.words=60,
random.order=T,
colors=set8,vfont=c("script","plain"))
Rupak Roy
15. Next
We will learn how to use regular expression tools to find and replace the
text.
Rupak Roy