SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Introduction to
Tm package
What is Text mining?
Text mining is the process of exploring and analyzing large amounts
of unstructured data that can be used to identify concepts, patterns,
topics, keywords and other attributes.
Common challenges of text mining:
 Each word and phrase can be high number of possible dimensions.
 Data are in unstructured form unlike in other data mining techniques
data are found in structure tabular format.
 Even statistically not independent.
 Ambiguity “the quality of being open to more than one
interpretation; inexactness.”
Rupak Roy
Text mining applications
• Customer Relationship management (CRM)
• Market Analysis
• NLP (natural language processing)
• Personalization in E-Commerce
• Natural language processing (or NLP) is a field of Ai and a
component of text mining that performs linguistic analysis that
essentially helps machine to deal with understanding, analyzing,
languages that humans naturally is good at. NLP uses a variety of
methodologies to decipher the ambiguities in human language, like
automatic summarization, speech tagging, entity extraction and
relations extraction, as well as disambiguation and natural language
understanding and recognition.
Rupak Roy
Modeling Techniques
• Supervised Learning
• Unsupervised learning
Supervised Learning: where we use labeled data to train our model to
classify new data and as we know in supervised learning we direct i.e.
train our ML model using labeled data.
For example sentimental analysis using classification methods like svm.
Unsupervised Learning: is the vice versa of supervised learning. It doesn't
require labeled data to train the model and validate over test data,
instead it will use the available unlabeled data to develop the model to
classify the problems and the solutions.
For example: Clustering, topic modeling.
Rupak Roy
Tm(text mining) package in R
Tm is a base R package for Pre processing the text data like
1. Remove unnecessary lines then convert text to a corpus(a structured
set of texts in tabular format)
2. Then read and inspect the Corpus to create TDM (term document
matrix)
Corpus- A corpus or text corpus is a large and structured set of texts.
a) In a corpus we parse the data to extract words, remove
punctuations, spaces even lower and upper case to make it
uniform.
b) Then remove words that has no meaning by itself like was, as, a, it
etc. also called as Stop words.
c) Finally apply Stemming which is the process of reducing
derived words to their word stem, base or root form. Eg. Consult,
Consulting, Consultation, Consultants = Consult(same meaning)
Term Document Matrix
Term Document Matrix (TDM) is a matrix that describes the frequency of
terms that occur in a collection of documents.
ROWS = TERMS
Columns = DOCUMENTS
Document term Matrix
Term Document Matrix
One of the common function widely used for cleaning the
data(corpus) like remove whitespaces, punctuations, numbers is
tm_map() function from base tm R package.
Rupak Roy
i like hate Data Science
D1 1 1 0 1
D2 1 0 1 1
D1 D2
I 1 1
Like 1 0
Hate 0 1
Data
Science
1 1
Term Document Matrix
Now what we can we do with Term Document Matrix (TDM)?
* We can easy find the frequent terms occur in the document which is
helpful to understand the keywords. For example very helpful to
understand they Google search keywords.
* We can also find association that are co-related or similar of each
words, how they are related to each other.
* Group the words that have same or similar performance by Clustering
techniques.
* Sentimental Analysis: is the automated process of understanding an
opinion like negative, positive or neutral about a given subject from
written or spoken language helping a business to understand the social
sentiment of their brand, product or service.
Rupak Roy
Example
#load the data
>star_wars_EPV<-read.csv("SW_EpisodeV.txt",h=TRUE,sep = " ")
>View(star_wars_EPV)
>str(star_wars_EPV)
>names(star_wars_EPV)
#Convert to a dataframe ‘only second column’
>dialogue<-data.frame(star_wars_EPV$dialogue)
#Renaming the column
>names(dialogue)<-"dialogue"
>str(dialogue)
Rupak Roy
Example
#data preprocessing using TM package
>library(tm)
#build text corpus
>dialogue.corpus<-Corpus(VectorSource(dialogue$dialogue))
>summary(dialogue.corpus)
>inspect(dialogue.corpus[1:5]) #Inspecting elements in Corpus
#clean the data
>inspect(dialogue.corpus[1:5])
#Converting to lower case
>dialogue.corpus<-tm_map(dialogue.corpus,content_transformer(tolower))
#Removing extra white space
>dialogue.corpus<-tm_map(dialogue.corpus,stripWhitespace)
#Removing punctuations
>dialogue.corpus<-tm_map(dialogue.corpus,removePunctuation)
#Removing numbers
>dialogue.corpus<-tm_map(dialogue.corpus,removeNumbers)
Example
#Create a list of stop words, the words that have no meaning itself.
>my_stopwords<-c(stopwords('english'),‟@‟,'http*„,‟url‟,‟www*‟)
#Remove the stop words
>dialogue.corpus<-
tm_map(dialogue.corpus,removeWords,my_stopwords)
#Build term document matrix
>dialogue.tdm<-TermDocumentMatrix(dialogue.corpus)
>dialogue.tdm
>dim(dialogue.tdm) #Dimensions of term document matrix
>inspect(dialogue.tdm[1:10,1:10])
#Remove sparse terms (Words that occur infrequently)
#here 97% refers remove at least 97% of sparse
>dialogue.imp<-removeSparseTerms(dialogue.tdm,0.97)
Example
#Finding word and frequencies
>temp<-inspect(dialogue.imp)
>wordFreq<-data.frame(apply(temp, 1, sum))
>wordFreq<-data.frame(ST = row.names(wordFreq), Freq =
wordFreq[,1])
>head(wordFreq)
>wordFreq<-wordFreq[order(wordFreq$Freq, decreasing = T), ]
>View(wordFreq)
Rupak Roy
Example
##Basic Analysis
#Finding the most frequent terms/words
findFreqTerms(dialogue.tdm,10) #Occurring minimum of 10 times
findFreqTerms(dialogue.tdm,30) #Occurring minimum of 30 times
findFreqTerms(dialogue.tdm,50) #Occurring minimum of 50 times
findFreqTerms(dialogue.tdm,70) #Occurring minimum of 70 times
#Finding association between terms/words
findAssocs(dialogue.tdm,"dont",0.3)
findAssocs(dialogue.tdm,"get",0.2)
findAssocs(dialogue.tdm,"right",0.2)
findAssocs(dialogue.tdm,"will",0.3)
findAssocs(dialogue.tdm,"know",0.3)
findAssocs(dialogue.tdm,"good",0.3)
Building Word Cloud
#Visualization using WordCloud
>library("wordcloud")
>library("RColorBrewer")
#Word Cloud requires text corpus and not term document matrix
#How to choose colors?
?brewer.pal
display.brewer.all() #Gives you a chart
brewer.pal #Helps you identify the groups of pallete colors
display.brewer.pal(8,"Dark2")
display.brewer.pal(8,"Purples")
display.brewer.pal(3,"Oranges")
set8<-brewer.pal(8,"Dark2")
Rupak Roy
Building Word Cloud
#plot the word cloud
wordcloud(dialogue.corpus,min.freq=10,
max.words=60,
random.order=T,colors=set8)
wordcloud(dialogue.corpus,min.freq=10,max.words=60,
random.order=T,
colors=set8,vfont=c("script","plain"))
Rupak Roy
Next
We will learn how to use regular expression tools to find and replace the
text.
Rupak Roy

Contenu connexe

Tendances

NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
Urjit Patel
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaning
feiwin
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Dustin Smith
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
Pratik Kumar
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_Report
Urjit Patel
 
Text summarization
Text summarizationText summarization
Text summarization
kareemhashem
 

Tendances (20)

NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
Aaai 2006 Pedersen
Aaai 2006 PedersenAaai 2006 Pedersen
Aaai 2006 Pedersen
 
similarity measure
similarity measure similarity measure
similarity measure
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaning
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_Report
 
Proposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachProposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic Approach
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Deductive Database
Deductive DatabaseDeductive Database
Deductive Database
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
 
Data modelingpresentation
Data modelingpresentationData modelingpresentation
Data modelingpresentation
 
Text summarization
Text summarizationText summarization
Text summarization
 
G04124041046
G04124041046G04124041046
G04124041046
 

Similaire à Introduction to Text Mining

Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
rohithprabhas1
 
Ppt programming by alyssa marie paral
Ppt programming by alyssa marie paralPpt programming by alyssa marie paral
Ppt programming by alyssa marie paral
alyssamarieparal
 
Frame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxFrame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptx
nilesh405711
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
socarem879
 

Similaire à Introduction to Text Mining (20)

Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text Classification
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
Textmining
TextminingTextmining
Textmining
 
Ppt programming by alyssa marie paral
Ppt programming by alyssa marie paralPpt programming by alyssa marie paral
Ppt programming by alyssa marie paral
 
Frame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxFrame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptx
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONSIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 
What is Text Analysis?
What is Text Analysis?What is Text Analysis?
What is Text Analysis?
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 

Plus de Rupak Roy

Plus de Rupak Roy (20)

Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
 
Clustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLPClustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLP
 
Network Analysis - NLP
Network Analysis  - NLPNetwork Analysis  - NLP
Network Analysis - NLP
 
Sentiment Analysis Practical Steps
Sentiment Analysis Practical StepsSentiment Analysis Practical Steps
Sentiment Analysis Practical Steps
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular Expressions
 
Apache Hbase Architecture
Apache Hbase ArchitectureApache Hbase Architecture
Apache Hbase Architecture
 
Introduction to Hbase
Introduction to Hbase Introduction to Hbase
Introduction to Hbase
 
Apache Hive Table Partition and HQL
Apache Hive Table Partition and HQLApache Hive Table Partition and HQL
Apache Hive Table Partition and HQL
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
Scoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSScoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMS
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
 
Apache Pig Relational Operators - II
Apache Pig Relational Operators - II Apache Pig Relational Operators - II
Apache Pig Relational Operators - II
 
Passing Parameters using File and Command Line
Passing Parameters using File and Command LinePassing Parameters using File and Command Line
Passing Parameters using File and Command Line
 
Apache PIG Relational Operations
Apache PIG Relational Operations Apache PIG Relational Operations
Apache PIG Relational Operations
 
Apache PIG casting, reference
Apache PIG casting, referenceApache PIG casting, reference
Apache PIG casting, reference
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store Functions
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
 

Dernier

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Dernier (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 

Introduction to Text Mining

  • 2. What is Text mining? Text mining is the process of exploring and analyzing large amounts of unstructured data that can be used to identify concepts, patterns, topics, keywords and other attributes. Common challenges of text mining:  Each word and phrase can be high number of possible dimensions.  Data are in unstructured form unlike in other data mining techniques data are found in structure tabular format.  Even statistically not independent.  Ambiguity “the quality of being open to more than one interpretation; inexactness.” Rupak Roy
  • 3. Text mining applications • Customer Relationship management (CRM) • Market Analysis • NLP (natural language processing) • Personalization in E-Commerce • Natural language processing (or NLP) is a field of Ai and a component of text mining that performs linguistic analysis that essentially helps machine to deal with understanding, analyzing, languages that humans naturally is good at. NLP uses a variety of methodologies to decipher the ambiguities in human language, like automatic summarization, speech tagging, entity extraction and relations extraction, as well as disambiguation and natural language understanding and recognition. Rupak Roy
  • 4. Modeling Techniques • Supervised Learning • Unsupervised learning Supervised Learning: where we use labeled data to train our model to classify new data and as we know in supervised learning we direct i.e. train our ML model using labeled data. For example sentimental analysis using classification methods like svm. Unsupervised Learning: is the vice versa of supervised learning. It doesn't require labeled data to train the model and validate over test data, instead it will use the available unlabeled data to develop the model to classify the problems and the solutions. For example: Clustering, topic modeling. Rupak Roy
  • 5. Tm(text mining) package in R Tm is a base R package for Pre processing the text data like 1. Remove unnecessary lines then convert text to a corpus(a structured set of texts in tabular format) 2. Then read and inspect the Corpus to create TDM (term document matrix) Corpus- A corpus or text corpus is a large and structured set of texts. a) In a corpus we parse the data to extract words, remove punctuations, spaces even lower and upper case to make it uniform. b) Then remove words that has no meaning by itself like was, as, a, it etc. also called as Stop words. c) Finally apply Stemming which is the process of reducing derived words to their word stem, base or root form. Eg. Consult, Consulting, Consultation, Consultants = Consult(same meaning)
  • 6. Term Document Matrix Term Document Matrix (TDM) is a matrix that describes the frequency of terms that occur in a collection of documents. ROWS = TERMS Columns = DOCUMENTS Document term Matrix Term Document Matrix One of the common function widely used for cleaning the data(corpus) like remove whitespaces, punctuations, numbers is tm_map() function from base tm R package. Rupak Roy i like hate Data Science D1 1 1 0 1 D2 1 0 1 1 D1 D2 I 1 1 Like 1 0 Hate 0 1 Data Science 1 1
  • 7. Term Document Matrix Now what we can we do with Term Document Matrix (TDM)? * We can easy find the frequent terms occur in the document which is helpful to understand the keywords. For example very helpful to understand they Google search keywords. * We can also find association that are co-related or similar of each words, how they are related to each other. * Group the words that have same or similar performance by Clustering techniques. * Sentimental Analysis: is the automated process of understanding an opinion like negative, positive or neutral about a given subject from written or spoken language helping a business to understand the social sentiment of their brand, product or service. Rupak Roy
  • 8. Example #load the data >star_wars_EPV<-read.csv("SW_EpisodeV.txt",h=TRUE,sep = " ") >View(star_wars_EPV) >str(star_wars_EPV) >names(star_wars_EPV) #Convert to a dataframe ‘only second column’ >dialogue<-data.frame(star_wars_EPV$dialogue) #Renaming the column >names(dialogue)<-"dialogue" >str(dialogue) Rupak Roy
  • 9. Example #data preprocessing using TM package >library(tm) #build text corpus >dialogue.corpus<-Corpus(VectorSource(dialogue$dialogue)) >summary(dialogue.corpus) >inspect(dialogue.corpus[1:5]) #Inspecting elements in Corpus #clean the data >inspect(dialogue.corpus[1:5]) #Converting to lower case >dialogue.corpus<-tm_map(dialogue.corpus,content_transformer(tolower)) #Removing extra white space >dialogue.corpus<-tm_map(dialogue.corpus,stripWhitespace) #Removing punctuations >dialogue.corpus<-tm_map(dialogue.corpus,removePunctuation) #Removing numbers >dialogue.corpus<-tm_map(dialogue.corpus,removeNumbers)
  • 10. Example #Create a list of stop words, the words that have no meaning itself. >my_stopwords<-c(stopwords('english'),‟@‟,'http*„,‟url‟,‟www*‟) #Remove the stop words >dialogue.corpus<- tm_map(dialogue.corpus,removeWords,my_stopwords) #Build term document matrix >dialogue.tdm<-TermDocumentMatrix(dialogue.corpus) >dialogue.tdm >dim(dialogue.tdm) #Dimensions of term document matrix >inspect(dialogue.tdm[1:10,1:10]) #Remove sparse terms (Words that occur infrequently) #here 97% refers remove at least 97% of sparse >dialogue.imp<-removeSparseTerms(dialogue.tdm,0.97)
  • 11. Example #Finding word and frequencies >temp<-inspect(dialogue.imp) >wordFreq<-data.frame(apply(temp, 1, sum)) >wordFreq<-data.frame(ST = row.names(wordFreq), Freq = wordFreq[,1]) >head(wordFreq) >wordFreq<-wordFreq[order(wordFreq$Freq, decreasing = T), ] >View(wordFreq) Rupak Roy
  • 12. Example ##Basic Analysis #Finding the most frequent terms/words findFreqTerms(dialogue.tdm,10) #Occurring minimum of 10 times findFreqTerms(dialogue.tdm,30) #Occurring minimum of 30 times findFreqTerms(dialogue.tdm,50) #Occurring minimum of 50 times findFreqTerms(dialogue.tdm,70) #Occurring minimum of 70 times #Finding association between terms/words findAssocs(dialogue.tdm,"dont",0.3) findAssocs(dialogue.tdm,"get",0.2) findAssocs(dialogue.tdm,"right",0.2) findAssocs(dialogue.tdm,"will",0.3) findAssocs(dialogue.tdm,"know",0.3) findAssocs(dialogue.tdm,"good",0.3)
  • 13. Building Word Cloud #Visualization using WordCloud >library("wordcloud") >library("RColorBrewer") #Word Cloud requires text corpus and not term document matrix #How to choose colors? ?brewer.pal display.brewer.all() #Gives you a chart brewer.pal #Helps you identify the groups of pallete colors display.brewer.pal(8,"Dark2") display.brewer.pal(8,"Purples") display.brewer.pal(3,"Oranges") set8<-brewer.pal(8,"Dark2") Rupak Roy
  • 14. Building Word Cloud #plot the word cloud wordcloud(dialogue.corpus,min.freq=10, max.words=60, random.order=T,colors=set8) wordcloud(dialogue.corpus,min.freq=10,max.words=60, random.order=T, colors=set8,vfont=c("script","plain")) Rupak Roy
  • 15. Next We will learn how to use regular expression tools to find and replace the text. Rupak Roy