3. Difference between a Symposium & a Tutorial at HICSS Symposium Audience M:M Tutorial 1:M
4. Difference between a Symposium & a Tutorial at HICSS Wv(t + 1) = Wv(t) + Θ (v, t) α(t)(D(t) - Wv(t))
5. Agenda Part 1: Growing Interest in Analytics Overview of Text Mining and Analysis General Text Mining and Analysis Processes Part 2: Classification and Categorization Clustering Information Extraction Overview of Tools & Packages
6. This is the only note you’ll need to take Presentation can be found at: www.slideshare.net
7. Biography: Dave King Currently, EVP of Product Development and Management at JDA Software 28 years in enterprise package software business 15 years as university professor 12 years as Co-Chair of the Internet & Digital Economy Track (HICSS) Long time interest in various aspects of E-Commerce & Business Intelligence Tutorial topic primarily reflects a personal interest and tangentially a job(s) related interest.
8. Personal Experiences with Analytics Taught applied statistics and math modeling In software R&D Optimization in the 80s Natural Language Frontends NLI Query & CMU Robotics Lab EIS Competitive Analysis Dow Jones and Reuters Verity Topics NewsAlert InXight’s Hyperbolic Tree Often the audiences has been small, sometimes bewildered, and often fleeting
9. If I have seen further it is only by … plagiarizing the works of others.
15. Interest in Analytics:Growing Awareness Source: Google Trends Analytics – “Extensive use of data, statistical and quantitative analysis, exploratory and predictive models, and fact-based management to drive decisions and actions…a subset of what has come to be called BI.” (Davenport and Harris, Competing on Analytics, HBS, 2007)
16. Interest in Analytics:Theory and Practice Data Mining Optimization In theory, there is no difference between theory and practice. But, in practice, there is.
18. Interest in Analytics:Potential Reasons for the Interest Next generation DSS: Progression of DSS->EIS->BI->PM->Analytics Increasing volumes of data requiring new approaches or modifications in existing approaches Focus on CRM and Supply Chains … General belief that more sophisticated analysis is required to compete in today’s environments …
19. Interest in Text Mining & Analytics: An old adage George Mallory . “WHY did you want to climb Mount Everest?" (in 1923 interview). His reply, “Because it’s there.” .
20. Interest in Text Mining & Analytics: The 80% Rule Unstructured (Textual) 80% Structured (Databases) 20% “It's a truism that 80 percent of business-relevant information originates in unstructured form, primarily text… The 80 percent unstructured figure comes from, well, everywhere.” Source: Seth Grimes, Unstructured Data and the 80 Percent Rule
21. Text Mining and Analytics:Definitions General: All types of text processing that deal with finding, organizing and analyzing textual (unstructured) information. Formal: Utilizing data mining techniques to create new information that is not obvious in a collection of documents (implies that Text Analytics ~ Text Mining ~ Text Data Mining)
22. Text Mining and Analytics:Types of Processing and Techniques Clustering. Grouping similar documents without having a predefined set of categories. Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching Named-Entity Recognition Seeks to locate and classify atomic elements in text into predefined categories (e.g. names of persons) Concept linking and Topic Tracking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. Summarization. Summarizing a document to save time on the part of the reader.
23. Text Mining and Analytics:Sample Application Areas Seth Grimes Papers
24. Text Mining:A Common Issue George Herbert, Welsh Poet & Priest A great dowry is a bed full of brambles. Outlandish Proverbs, 1640 Structured data mining is a bed of roses when compared to unstructured, textual mining which is a bed of brambles
25. Data Mining: Simple Example (Affinity Analysis) Study of attributes or characteristics that “go together.” Seek to uncover “association rules” that quantify the relationship between two or more attributes. Rules take the form of “If antecedent, then consequent” Examples: Market basket analysis to determine which items are purchased together (in single transaction) Web analysis to determine which sequences of pages users visit Major issue is number of potential combinations as the number of attributes increases
26. Data Mining: Simple Example (Affinity Analysis) 1. Market Basket Analysis: Items for Sale: Apples Bananas Cherries Durians 2. Possible Transactions: With one item or a collection of items selected as the Driver or Independent Variable 3. Objective is to empirically determine those groups of items that occur frequently together in a set of transactions, producing a set of rules of the form X -> Y.
27. Data Mining: Simple Example (Affinity Analysis) Standard Market Basket Measures: Support = N(X & Y)/ N(T) Example: N(A & B)/ N(T) = 2/7 = 29% Confidence = N(X & Y)/ N(X) Example: N(A & B)/ N(A) = 2/4 = 50% Where N(T) = No of Trans and N(X & Y) = No of Trans X&Y
29. Data Mining: General Data Assumptions Requires structured data (numbers and categories well-defined) Transformed by data preparation or collected with a prior design in mind Typically housed and organized in a relational database, data mart or data warehouse
30. Data Mining: Simple Example But, what if the baskets were described in the following manner: Jane bought a handful of maraschinos and a couple of granny smiths. Harold purchased a bag of appls and 2 bananas. Bill paid for a pound of cherries but decided not to buy the three durians because of their odor. How could we automate the analysis?
31. Data Mining: CRISP-DM Real-World Data Data Consolidation Data Cleaning Business Understanding Data Understanding Data Preparation Deployment Data Transformation Data Reduction Modeling Evaluation Well-Formed Data Cross-Industry Standard Process for Data Mining
32. Text MiningCRISP-Like Processes Real-World Text Data Document Consolidation Establish the Corpus Business Understanding Document Understanding Document Preparation Deployment Corpus Refinement (Token, Stem, Stop…) Feature Selection & Weighting Documents Modeling Evaluation Term- Doc-Matrix* * - Entity-Relationships
33. Text Mining Process:Establish the Corpus First step in textual data preparation is to systematically collect samples of text, i.e. the documents related to the context being studied Range of possibilities: word documents, PDFs, emails, IM chat, Web pages, RSS Feeds, Blogs, Tweets, Open ended surveys, Transcripts of Helpline calls … Convert into organized set of texts – called a corpus – standardized and prepared for the purpose of knowledge discovery.
34. Text Mining Process:Establish the Corpus Brown Corpus – first million word corpus compiled in 60s at Brown U., 500 samples across 15 genres, each ~2000 words with POS tags Linguistic Consortium Treebanks– collections of manually tagged and parsed (tree structures) of sentences from a variety of sources (includes well-known Penn Treebank collection) Reuters 21578, RCV1 & V2 -- collections (1000s of) Reuter’s English & multi-lingual news stories classified into topics and grouped into training & test sets Pang & Lee’s Sentiment Analysis – 1000 positive and 1000 negative movie reviews MEDLINE – An extensive collection of articles and abstracts (18M+) used in a variety of biomedical and linguistic text mining applications WordNet® -- large lexical database of English grouped into sets of cognitive synonyms (synsets) and interlinked by means of conceptual-semantic and lexical relations. Google Ngram -- 500 billion words from 5.2 million books published between 1500 and 2008 in English, French, Spanish, German, Russian, and Chinese.
36. Text Mining Process:Establishing the Corpus (Penn Treebank) .START Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Raw [ Pierre/NNP Vinken/NNP ] ,/, [ 61/CD years/NNS ] old/JJ ,/, will/MD join/VB [ the/DT board/NN ] as/IN [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ] ./. Tagged ( (S (NP-SBJ (NP Pierre Vinken) , (ADJP (NP 61 years) old) ,) (VP will (VP join (NP the board) (PP-CLR as (NP a nonexecutive director)) (NP-TMP Nov. 29))) .)) Parsed
37. Text Mining Process:Establishing the Corpus (Reuters) 14826 ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT Mounting trade friction between the U.S. And Japan has raised fears among many of Asia's exporting nations that the row could inflict far-reaching economic damage, businessmen and officials said. They told Reuter correspondents in Asian capitals a U.S. Move against Japan might boost protectionist sentiment in the U.S. And lead to curbs on American imports of their products. But some exporters said that while the conflict would hurt them in the long-run, in the short-term Tokyo's loss might be their gain. The U.S. Has said it will impose 300 mlndlrs of tariffs on imports of Japanese electronics goods on April 17, in retaliation for Japan's alleged failure to stick to a pact not to sell semiconductors on world markets at below cost. Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes.
40. Text Mining Process:Establish the Corpus (Google NGrams) 8,500 new words a year, 70% growth from 1950-2000, 50%+ of English lexicon is "dark matter." We’re forgetting our past faster with each passing year (tracking the references to the numerical years) Innovations spread faster than ever Modern celebrities are younger and more famous than predecessors, but their fame is shorter-lived. Culturomics is a powerful tool for automatically identifying censorship and propaganda. (e.g. e, Jewish artist Marc Chagall was mentioned just once in the entire German corpus from 1936-44) to 1944, even as his prominence in English-language books grew roughly fivefold. "Freud" is more deeply engrained in our collective subconscious than "Galileo," "Darwin," or "Einstein." “Quantitative Analysis of Culture Using Millions of Digitized Books” Science Magazine, Dec. 18, 2010
41. Text Mining Process: Corpus Refinement Common representation of tokens within and between documents Eliminate Stop Words Tokenization Normalize Stemming Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text. Normalize — Convert them to lowercase. Eliminate stop words — Eliminate terms that appear very often (e.g. the, and, …). Stemming — Convert the terms into their stemmed form—remove plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNetSynset]
42. Text Mining: Feature Extraction & Weighting Feature Extraction “Bag of Words, Terms or Tokens” Vector Representation: Word, Term or Token/Doc Matrix Words or Tokens are attributes and documents are examples
43. Text Mining:Transforming Frequencies Binary Frequencies: tf =1 for tf>0; otherwise 0 Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K Log Frequencies: 1 + log(tf) for tf>0; otherwise 0 Normalized Frequencies: Divide each frequency by SQRT of Sum of Squares of the frequencies within the vector (column) Term Frequency–Inverse Document Frequency TF * IDF Inverse Document Frequency: log(N/(1+D)) where N is total number of docs and D is number with term
44. Text Mining Processes:Simple Overview Example Scours the Internet every ten minutes, harvesting human feelings from a large number of blogs (generally identifying and saving between 15,000 and 20,000 feelings per day. ). Scans blog posts for sentences with the phrases "I feel" and "I am feeling“, extracts the sentence, and looks to see if it includes one of about 5,000 pre-identified "feelings". If a valid feeling is found, the sentence is said to represent one person who feels that way. URL format of many blog posts can be used to extract the username of the post's author which is used to extract the age, gender, country, state, and city of the blog's owner. Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.
45. Text Mining Processes:Simple Overview Example API Query from wefeelfine.org: http://api.wefeelfine.org:8080/ShowFeelings?display=xml&returnfields=imageid,feeling,sentence,posttime,postdate,posturl,gender,born,country,state,city,lat,lon,conditions&limit=500 Result from Query: <?xml version="1.0" ?> - <feelings> <feeling feeling="super" sentence="i've been feeling super depressed missing my ex" posttime="1292298985" postdate="2010-12-13" posturl="http://screamingnspace.blogspot.com/2010/12/guilty-as-charged.html" gender="0" country="united states" state="south carolina" /> Source: www.wefeelfine.org/api.html
46. Text Mining Processes:Simple Overview Example i'm blinded to other santas because this was my first but i can't help feeling that there can't be a better one i went to mcd with an idiot which is having the same feeling as me now i feel asleep i feel about little red shoes and mittens i feel the sands of time moving so quickly in my life it seems i feel too young to have her this beauty across from me i feel like im waiting for something profound or inspirational to hit me …
47. Text Mining Processes:Simple Overview Example Input String (43743 chars; 8245 spaces) "i'm blinded to other santas because this was my first but i can't help feeling that there can't be a better onei went to mcd with an idiot which is having the same feeling as me nowi'll feel bad bout it and soi feel asleep…” Tokenize (9019 tokens) ['i', "'m", 'blinded', 'to', 'other', 'santas', 'because', 'this', 'was', 'my', 'first', 'but', 'i', 'ca', "n't", 'help', 'feeling', 'that', 'there', 'ca', "n't", 'be', 'a', 'better', 'one', 'i', 'went', 'to', 'mcd', 'with', 'an', 'idiot', 'which', 'is', 'having', 'the', 'same', 'feeling', 'as', 'me', 'now', 'i', "'ll", 'feel', 'bad', 'bout', 'it', 'and', 'so', 'i', 'feel', 'asleep', …] Set of Tokens (1816 distinct tokens) ["'", "'bout", "'cleaner", "'d", "'http", "'i", "'ll", "'m", "'re", "'s", "'ve", '000', '039', '097', '1', '100', '101', '102', '104', '105', '108', '111', '114', '115', '116', '118', '11am', '12', '121', '15', '16', '180', '1998', '1st', '2', '2013', '23', '2nd', '3', '30', '78', '9', ':', 'a', 'ab', 'abit', 'able', 'about', 'above', 'abs', 'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', …]
54. Text Mining Process:Overview Example 2 Twitter Statistics: ~106M registered users. New users 300K per day. 180 million unique visitors per mnth. 75% of traffic from 3rd Party Apps Average 55 million tweets a day. 600 million search queries per day. 37% use their phone to tweet. 60% of tweets from 3rd Party Apps Based on 1+B tweets generated by over 20 million Twitter users in 2010 (bio, web site, loc info). Source:huffingtonpost.com/2010/04/14/twitter-user-statistics-r_n_537992.html
55. Text Mining Process:Overview Example 2 Each tweet <= 140 characters (avg. 10-15 words/message) Heavy presence of non-alpha symb0-ols, abbrevs, misspellings and slang Tweets often include retweets (original tweet repeated) In spite of this – Tweets have proven to be an interesting text mining resource (e.g. see lifeanalytics.blogspot.com & mashable.com/author/dan-zarrella/)
56. Text Mining Process:Overview Example 2 Twitter gets a total of 3 billion requests a day via its API API Calls for Public Tweets http://search.twitter.com/search.json?q=%3A)+feel+ feeling&rpp=100&page=1 http://api.twitter.com/1/trends/current.json?exclude=hashtags
57. Text Mining Process:Overview Example 2 u'iso_language_code': u'en', u'to_user_id_str': None, u'text': u"RT @EverSoSassy56 <--- I'm sportin' my glasses... I feel all sophisticated and stuff. :-) -- And the operative word is feeling...LOL", u'from_user_id_str': u'168852471', u'profile_image_url': u'http://a0.twimg.com/profile_images/1166685224/Jonise_normal.jpg', u'id': 16300313380130816L, u'source': u'<ahref="http://twidroid.com" rel="nofollow">twidroid</a>', u'id_str': u'16300313380130816', u'from_user': u‘XXXXXXXXXX', u'from_user_id': 168852471, u'to_user_id': None, u'geo': None, u'created_at': u'Sun, 19 Dec 2010 01:14:32 +0000', u'metadata': {u'result_type': u'recent'}
58. Text Mining Process:Establish the Corpus (2nd Example) Happy Face Sad Face Tokens = 14670 Set of Tokens= 2289 avg./Sent = 24 lex. div. = 6.4 Non-Stop words = 10406 Set Non-Stop = 2117 Stems = 5003 Set of Stems = 1052 w/o Feel = 3921 Set w/o Feel = 1051
59. Text Mining Process:Overview Example 2 “Twitter Sentiment Classification using Distant Supervision” Utilizes presence of emoticons “ :)” & “ :( “ to serve as surrogates for classification as positive and negative sentiment statements To construct the term-document matrix relies on a list of positive and negative key words from Twittratr, counting number of key words that appear in each tweet. 180K tweets collected for training purposes between April and June 2009 80%+ accuracy in classification
60. Text Mining Processes:Overview Example 2 What is this? An areacartogram is a map in which some thematic mapping variable – such as travel time or GNP -- is substituted for land area. The geometry or space of the map is distorted in order to convey the information of this alternate variable.
61. Text Mining Process:Overview Example 2 Pulse of the Nation: U.S. Mood throughout the Day Inferred from Twitter Analyzed 300M public tweets produced in the US from 9/2006-8/2009 and containing words from a psychological word-rating system (“Affective Norms for English Words”) Through a natural language processing algorithm called Sentiment Analysis, each tweet was assigned a mood score based on the number of positive or negative words it contained. Calculated the average mood score of all the users living in a state hour by hour which formed the basis of a series of time-varying mood maps.
5.2 million digitized books - about 4% of all books ever printedpublished during the past 200 yearsAll told, about 129 million books have been published since the invention of the printing press. In 2004, Google software engineers began making electronic copies of them, and have about 15 million so far, comprising more than two trillion words in 400 languages.They currently include Chinese, English, French, German, Russian and Spanish books dating back to the year 1500—about 4% of all books published. The database doesn't include periodicals, which might reflect popular culture from a different vantage.The resulting corpus contains over 500 billion words, inEnglish (361 billion), French (45B), Spanish (45B), German(37B), Chinese (13B), Russian (35B), and Hebrew (2B). Theoldest works were published in the 1500s. The early decadesare represented by only a few books per year, comprisingseveral hundred thousand words. By 1800, the corpus growsto 60 million words per year; by 1900, 1.4 billion; and by2000, 8 billion.
5.2 million digitized books - about 4% of all books ever printedpublished during the past 200 yearsAll told, about 129 million books have been published since the invention of the printing press. In 2004, Google software engineers began making electronic copies of them, and have about 15 million so far, comprising more than two trillion words in 400 languages.They currently include Chinese, English, French, German, Russian and Spanish books dating back to the year 1500—about 4% of all books published. The database doesn't include periodicals, which might reflect popular culture from a different vantage.The resulting corpus contains over 500 billion words, inEnglish (361 billion), French (45B), Spanish (45B), German(37B), Chinese (13B), Russian (35B), and Hebrew (2B). Theoldest works were published in the 1500s. The early decadesare represented by only a few books per year, comprisingseveral hundred thousand words. By 1800, the corpus growsto 60 million words per year; by 1900, 1.4 billion; and by2000, 8 billion.