Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Introduction to Text Mining

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Text MIning
Text MIning
Chargement dans…3
×

Consultez-les par la suite

1 sur 25 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Introduction to Text Mining (20)

Publicité

Plus par Minha Hwang (14)

Plus récents (20)

Publicité

Introduction to Text Mining

  1. 1. Class Outline • Introduction: Unstructured Data Analysis • Word-level Analysis – Vector Space Model – TF-IDF • Beyond Word-level Analysis: Natural Language Processing (NLP) • Text Mining Demonstration in R: Mining Twitter Data
  2. 2. Background: Text Mining – New MR Tool! • Text data is everywhere – books, news, articles, financial analysis, blogs, social networking, etc • According to estimates, 80% of world’s data is in “unstructured text format” • We need methods to extract, summarize, and analyze useful information from unstructured/text data • Text mining seeks to automatically discover useful knowledge from the massive amount of data • Active research is going on in the area of text mining in industry and academics
  3. 3. What is Text Mining? • Use of computational techniques to extract high quality information from text • Extract and discover knowledge hidden in text automatically • KDD definition: “discovery by computer of new previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources”
  4. 4. Text Mining Tasks • 1. Document Categorization (Supervised Learning) • 2. Document Clustering/Organization (Unsupervised Learning) • 3. Summarization (key words, indices, etc) • 4. Visualization (word cloud, maps) • 5. Numeric prediction (stock market prediction based on news text)
  5. 5. Features of Text Data • • • • • • • • High dimensionality Large number of features Multiple ways to represent the same concept Highly redundant data Unstructured data Easy for humans, hard for machine Abstract ideas hard to represent Huge amount of data to be processed – Automation is required
  6. 6. Acquiring Texts • Existing digital corpora: e.g. XML (high quality text and metadata) – http://www.hathitrust.org/htrc • Other digital sources (e.g. Web, twitter, Amazon consumer reviews) – Through API: e.g. tweets – Websites without APIs can be “scraped” – Generally requires custom programming (Perl, Python, etc) or software tools (e.g. Web extractor pro) • Undigitized text – Scanned and subjected to Optical Character Recognition (OCR) – Time and labor intensive – Error-prone
  7. 7. Word-level Analysis: Vector Space Model • Documents are treated as a “bag” of words or terms • Any document can be represented as a vector: a list of terms and their associated weights – D= {(t1,w1),(t2,w2),…………,(tn,wn )} – ti: i-th term – wi: weight for the i-th term • Weight is a measure of the importance of terms of information content
  8. 8. Vector Space Model: Bag of Words Representation • Each document: Sparse high-dimensional vector!
  9. 9. TF-IDF: Definition
  10. 10. TF-IDF: Example • TF: Consider a document containing 100 words wherein the word cow appears 3 times. Following the previously defined formulas, what is the term frequency (TF) for cow? – TF(cow,d1) = 3. • IDF: Now assume we have 10 million documents and cow appears in one thousand of these. What is the inverse document frequency of the term, cow? – IDF(cow) = log(10,000,000/1,000) = 4 • TF-IDF score? – TF-IDF = 3 x 4 = 12 (Product of TF and IDF)
  11. 11. Application 1: Document Search with Query Document ID Cat Dog d1 0.397 d2 Mouse Fish Horse Cow Matching Scores 0.397 0.000 0.475 0.000 0.000 1.268 0.352 0.301 0.680 0.000 0.000 0.000 0.653 d3 0.301 0.363 0.000 0.000 0.669 0.741 0.664 d4 0.376 0.352 0.636 0.558 0.000 0.000 1.286 d5 0.301 0.301 0.000 0.426 0.544 0.544 1.028
  12. 12. Application 2: Word Frequencies – Zipf’s Law • Idea: We use a few words very often, and most words very rarely, because it’s more effort to use a rare word. • Zipf’s Law: Product of frequency of word and its rank is [reasonably] constant • Empirically demonstrable; holds up over different languages
  13. 13. Application 2: Word Frequencies – Zipf’s Law
  14. 14. Application 3: Word Cloud - Budweiser Example http://people.duke.edu/~el113/Visualizations.html
  15. 15. Problems with Word-level Analysis: Sentiment • Sentiment can often be expressed in a more subtle manner, making it difficult to be identified by any of a sentence or document’s terms when considered in isolation – A positive or negative sentiment word may have opposite orientations in different application domains. (“This camera sucks.” -> negative; “This vacuum cleaner really sucks.” -> positive) – A sentence containing sentiment words may not express any sentiment. (e.g. “Can you tell me which Sony camera is good?”) – Sarcastic sentences with or without sentiment words are hard to deal with. (e.g. “What a great car! It sopped working in two days.” – Many sentences without sentiment words can also imply opinions. (e.g. “This washer uses a lot of water.” -> negative) • We have to consider the overall context (semantics of each sentence or document)
  16. 16. Natural Language Processing (NLP) to the Rescue! • NLP: is a filed of computer science, artificial intelligence, and linguistics, concerned with the interactions between computers and human (natural) languages. • Key idea: Use statistical “machine learning” to automatically learn the language from data! • Major tasks in NLP – – – – – – Automatic summarization Part-of-speech tagging (POS tagging) Relationship extraction Sentiment analysis Topic segmentation and recognition Machine translation
  17. 17. Demonstration: POS Tagging – 1/2 • http://cogcomp.cs.illinois.edu/demo/pos/results.php
  18. 18. Demonstration: POS Tagging – 2/2
  19. 19. Demonstration: Sentence-level Sentiment – 1/3 • Stanford Sentiment Analyzer – http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
  20. 20. Demonstration: Sentence-level Sentiment – 2/3 • Review 1: This movie doesn’t care about cleverness, wit or any other kind of intelligent humor. -> Negative
  21. 21. Demonstration: Sentence-level Sentiment – 3/3 • There are slow and repetitive parts, but it has just enough spice to keep it interesting. -> Positive
  22. 22. • Text Mining Demonstration in R: Mining Twitter Data
  23. 23. Twitter Mining in R – 1/2 Step 0) Install “R” and Packages R program: http://www.r-project.org/ Package: http://cran.r-project.org/web/packages/tm/index.html Package: http://cran.r-project.org/web/packages/twitteR/index.html Package: http://cran.r-project.org/web/packages/wordcloud/index.html Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf Step 1) Retrieving Text from Twitter: Twitter API (Using twitteR)
  24. 24. Twitter Mining in R – 2/2 Step 2) Transforming Text Step 3) Stemming Words Step 4) Build a Term-Document Matrix Step 5) Frequent Terms and Associations Step 6) Word Cloud
  25. 25. Software for Text Mining • A number of academic/commercial software available: – 1. Open source packages in R – e.g. tm • R program: http://www.r-project.org/ • Package: http://cran.r-project.org/web/packages/tm/index.html • Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf – 2. Stanford NLP core • http://nlp.stanford.edu/software/corenlp.shtml – – – – – 3. SAS TextMiner 4. IBM SPSS 5. Boos Texter 6. StatSoft 7. AeroText • Text Data is everywhere – you can mine it to gain insights!

×