Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Text Analytics

154 vues

Publié le

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Text Analytics

  1. 1. Presented by: Ajay Ram K P
  2. 2. What is Text analytics??  Text analytics is the process of  analyzing unstructured text,  extracting relevant information  and transforming it into useful business intelligence.  Text analytics processes can be performed manually, but the amount of text-based data available to companies today makes it increasingly important to use intelligent, automated solutions. 2
  3. 3. Why is Text Analytics important??  Emails, online reviews, tweets, call center agent notes, and the vast array of other written feedback, all hold insight into customer wants and needs only if you can unlock it.  Text analytics is the way to extract meaning from this unstructured text, and to uncover patterns and themes. 3
  4. 4. Text Analytics in R  Text Analytics in R is carried out with the help of tm package.  It is a framework for text mining applications within R.  Contains functions for actions such as content transformation, word removal, finding frequent terms and lot more 4
  5. 5. The Case Study data  The data used is a collection of game reviews in an Excel sheet.  Game reviews from 1000 gamers are recorded in the data set.  The objective is to do an analysis of these reviews treating all of them as one text and find out the most frequent words. 5
  6. 6. Part 1  The review are read to a variable docs using functions VectorSource(), Corpus().  VectorSource() sets a source for comparison.  Corpus() creates a skeleton of the text. 6 Reading the Data
  7. 7.  Data cleansing is required as most of the reviews are contain punctuations, numbers, stop words etc. that we don’t require for analysis.  Depending out what you are trying to achieve with your analysis, you may want to do the data cleaning step differently.  Data cleansing is done using tm_map() function in R 7 Cleaning the Data
  8. 8.  Converting document into Document Term Matrix  A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.  The tm package stores document term matrixes as sparse matrices for efficacy. Since we only have 1000 reviews and one document we can just convert our term- document-matrix into a normal matrix, which is easier to work with. Code: dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm)  We then take the column sums of this matrix, which will give us a named vector.  And now we can sort this vector to see the most frequently used words. Code: v <- sort(rowSums(m),decreasing=TRUE) head(v) 8 Finding the frequent terms and their frequency
  9. 9. 9
  10. 10.  For plotting the Word Cloud, we use wordcloud package. 10 Plotting the Word Cloud
  11. 11. And Voila!!! 11
  12. 12. Part 2 12 Creating the Network  For network creation, we take help of packages  igraph  sna  network
  13. 13.  Finding the association.  findAssocs() function is used. 13 Creating the Network
  14. 14.  Plotting the graph.  Using igraph package & graph.data.frame() function 14 Creating the Network
  15. 15. And there it is!!! 15
  16. 16. Another Graph…  Graph where frequent terms are node and number of frequencies are interaction/strength. 16
  17. 17. In case of large networks  Say the network has more than 10K nodes. Such networks will be complicated.  For quantifying such networks we go for statistical aspects of the network.  Use of Random network, Scale-free network or Hierarchical network models in such cases would be fit. 17 Random Network Scale-free Network Hierarchical Network
  18. 18. Where else can network approaches be powerful??  Biological Science  Economics  Computer science 18
  19. 19. THANK YOU!!! 20