SlideShare une entreprise Scribd logo
1  sur  63
Tutorial: Text Data Mining and Analytics HICSS 44 – January 2011 Dave King
Welcome to one of the HICSS SWoTs
Difference between a Symposium & a Tutorial at HICSS Symposium Audience M:M Tutorial 1:M
Difference between a Symposium & a Tutorial at HICSS Wv(t + 1) = Wv(t) + Θ (v, t) α(t)(D(t) - Wv(t))
Agenda Part 1: Growing Interest in Analytics Overview of Text Mining and Analysis General Text Mining and Analysis Processes Part 2: Classification and Categorization Clustering Information Extraction Overview of Tools & Packages
This is the only note you’ll need to take Presentation can be found at: www.slideshare.net
Biography: Dave King Currently, EVP of Product Development and Management at JDA Software 28 years in enterprise package software business 15 years as university professor 12 years as Co-Chair of the Internet & Digital Economy Track (HICSS) Long time interest in various aspects of E-Commerce & Business Intelligence Tutorial topic primarily reflects a personal interest and tangentially a job(s) related interest.
Personal Experiences with Analytics Taught applied statistics and math modeling In software R&D Optimization in the 80s Natural Language Frontends NLI Query & CMU Robotics Lab EIS Competitive Analysis Dow Jones and Reuters Verity Topics NewsAlert InXight’s Hyperbolic Tree Often the audiences has been small, sometimes bewildered, and often fleeting
If I have seen further it is only by … plagiarizing the works of others.
Text Mining & Analysis Resources: Books
Text Mining & Analysis Resources: Books
Text Mining & Analysis Resources: Web Sites & Sources TM/Blog -- blogs.sas.com/text-mining TM/Blog -- texttechnologies.com TM/Blog -- lingpipe-blog.com TM & Analytics /Blog -- intelligent-enterprise.informationweek.com/movabletype/blog/sgrimes.html TM/Wiki -- textanalytics.wikidot.com TA/General -- social.textanalyticsnews.com TA/General -- textanalysis.info TA/General -- klariti.com/text-mining/index.shtml TM & DM/Online Book -- statsoft.com/textbook/text-mining/ TM & DM/Tutorial -- alias-i.com/lingpipe/demos/tutorial/db/read-me.html TA Tutorial -- slideshare.net/SethGrimes/text-analytics-for-dummies-2010 TM Tutorial -- www.esi.uem.es/~jmgomez/tutorials/ecmlpkdd02 TM Tutorial -- scienceforseo.com/tutorials/text-mining-tutorial
Text Mining & Analysis Resources: Associated Web Sites & Sources DM/Blog -- datamining.typepad.com DM/Blog -- abbottanalytics.blogspot.com DM/Blog -- bx.businessweek.com/data-mining/blogs DM/Blog -- blog.data-miners.com DM/Blog -- datawrangling.com DM/Blog -- bytemining.com DM/Blog -- marktab.net/datamining DM/Blog -- dataminingblog.com DM/Blog -- timmanns.blogspot.com DM/General -- kdnuggets.com DM/General -- mydatamine.com DM/General -- the-data-mine.com DM/Online Book -- chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm DM/Tutorial -- autonlab.org/tutorials/
Initial Question:What search terms are graphed?
Interest in Analytics:Growing Awareness Source: Google Trends Analytics – “Extensive use of data, statistical and quantitative analysis, exploratory and predictive models, and fact-based management to drive decisions and actions…a subset of what has come to be called BI.” (Davenport and Harris, Competing on Analytics, HBS, 2007)
Interest in Analytics:Theory and Practice Data Mining Optimization In theory, there is no difference between theory and practice. But, in practice, there is.
Interest in Analytics:Popular Titles
Interest in Analytics:Potential Reasons for the Interest Next generation DSS: Progression of  DSS->EIS->BI->PM->Analytics Increasing volumes of data requiring new approaches or modifications in existing approaches Focus on CRM and Supply Chains … General belief that more sophisticated analysis is required to compete in today’s environments …
Interest in Text Mining & Analytics: An old adage George Mallory . “WHY did you want to climb Mount Everest?" (in 1923 interview). His reply, “Because it’s there.” .
Interest in Text Mining & Analytics: The 80% Rule Unstructured (Textual) 80% Structured (Databases) 20% “It's a truism that 80 percent of business-relevant information originates in unstructured form, primarily text… The 80 percent unstructured figure comes from, well, everywhere.” Source: Seth Grimes, Unstructured Data and the 80 Percent Rule
Text Mining and Analytics:Definitions General: All types of text processing that deal with finding, organizing and analyzing textual (unstructured) information. Formal: Utilizing data mining techniques to create new information that is not obvious in a collection of documents (implies that Text Analytics ~ Text Mining ~ Text Data Mining)
Text Mining and Analytics:Types of Processing and Techniques Clustering. Grouping similar documents without having a predefined set of categories. Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching Named-Entity Recognition Seeks to locate and classify atomic elements in text into predefined categories (e.g. names of persons) Concept linking and Topic Tracking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. Summarization. Summarizing a document to save time on the part of the reader.
Text Mining and Analytics:Sample Application Areas Seth Grimes Papers
Text Mining:A Common Issue George Herbert,  Welsh Poet & Priest A great dowry is a bed full of brambles. Outlandish Proverbs, 1640 Structured data mining is a bed of roses when compared to unstructured, textual mining which is a bed of brambles
Data Mining: Simple Example (Affinity Analysis) Study of attributes or characteristics that “go together.” Seek to uncover “association rules” that quantify the relationship between two or more attributes. Rules take the form of “If antecedent, then consequent” Examples: Market basket analysis to determine which items are purchased together (in single transaction) Web analysis to determine which sequences of pages users visit Major issue is number of potential combinations as the number of attributes increases
Data Mining:  Simple Example (Affinity Analysis) 1. Market Basket Analysis: Items for Sale: Apples Bananas Cherries Durians 2. Possible Transactions:  With one item or a collection of items selected as the Driver or Independent Variable 3. Objective is to empirically determine those groups of items that occur frequently together in a set of transactions, producing a set of rules of the form X -> Y.
Data Mining:  Simple Example (Affinity Analysis) Standard Market Basket Measures: Support      = N(X & Y)/ N(T)       Example: N(A & B)/ N(T) = 2/7 = 29% Confidence = N(X & Y)/ N(X)      Example: N(A & B)/ N(A) = 2/4 = 50% Where N(T) = No of Trans and  N(X & Y) = No of Trans X&Y
Data Mining:  Simple Example (Affinity Analysis)
Data Mining: General Data Assumptions Requires structured data (numbers and categories well-defined) Transformed by data preparation or collected with a prior design in mind Typically housed and organized in a relational database, data mart or data warehouse
Data Mining: Simple Example But, what if the baskets were described in the following  manner: Jane bought a handful of maraschinos and a couple of granny smiths. Harold purchased a bag of appls and 2 bananas. Bill paid for a pound of cherries but decided not to buy the three durians because of their odor. How could we automate the analysis?
Data Mining: CRISP-DM Real-World Data Data Consolidation Data Cleaning Business Understanding Data Understanding Data Preparation Deployment Data Transformation Data Reduction Modeling Evaluation Well-Formed Data Cross-Industry Standard Process for Data Mining
Text MiningCRISP-Like Processes Real-World Text Data Document Consolidation Establish the Corpus Business Understanding Document Understanding Document Preparation Deployment Corpus Refinement (Token, Stem, Stop…) Feature Selection  & Weighting Documents Modeling Evaluation Term- Doc-Matrix* * - Entity-Relationships
Text Mining Process:Establish the Corpus First step in textual data preparation is to systematically collect samples of text, i.e. the documents related to the context being studied Range of possibilities: word documents, PDFs, emails, IM chat, Web pages, RSS Feeds, Blogs, Tweets, Open ended surveys, Transcripts of Helpline calls … Convert into organized set of texts – called a corpus – standardized and prepared for the purpose of knowledge discovery.
Text Mining Process:Establish the Corpus Brown Corpus – first million word corpus compiled in 60s at Brown U., 500 samples across 15 genres, each ~2000 words with POS tags  Linguistic Consortium Treebanks– collections of manually tagged and parsed (tree structures) of sentences from a variety of sources (includes well-known Penn Treebank collection) Reuters 21578, RCV1 & V2  -- collections (1000s of) Reuter’s English & multi-lingual news stories classified into topics and grouped into training & test sets Pang & Lee’s Sentiment Analysis – 1000 positive and 1000 negative movie reviews MEDLINE – An extensive collection of articles and abstracts (18M+) used in a variety of biomedical and linguistic text mining applications WordNet® -- large lexical database of English grouped into sets of cognitive synonyms (synsets) and interlinked by means of conceptual-semantic and lexical relations. Google Ngram -- 500 billion words from 5.2 million books published between 1500 and 2008 in English, French, Spanish, German, Russian, and Chinese.
Text Mining Process:Establishing the Corpus (Brown) Sample Tagged Entry The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.
Text Mining Process:Establishing the Corpus (Penn Treebank) .START Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Raw [ Pierre/NNP Vinken/NNP ] ,/,  [ 61/CD years/NNS ] old/JJ ,/, will/MD join/VB  [ the/DT board/NN ] as/IN  [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ] ./.  Tagged ( (S (NP-SBJ (NP Pierre Vinken)              ,              (ADJP (NP 61 years) 		   old)              ,)      (VP will          (VP join              (NP the board)              (PP-CLR as 		     (NP a nonexecutive director)) 	     (NP-TMP Nov. 29)))      .)) Parsed
Text Mining Process:Establishing the Corpus (Reuters) 14826 ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT   Mounting trade friction between the   U.S. And Japan has raised fears among many of Asia's exporting   nations that the row could inflict far-reaching economic   damage, businessmen and officials said.       They told Reuter correspondents in Asian capitals a U.S.   Move against Japan might boost protectionist sentiment in the   U.S. And lead to curbs on American imports of their products.       But some exporters said that while the conflict would hurt   them in the long-run, in the short-term Tokyo's loss might be   their gain.       The U.S. Has said it will impose 300 mlndlrs of tariffs on   imports of Japanese electronics goods on April 17, in   retaliation for Japan's alleged failure to stick to a pact not   to sell semiconductors on world markets at below cost.       Unofficial Japanese estimates put the impact of the tariffs   at 10 billion dlrs and spokesmen for major electronics firms   said they would virtually halt exports of products hit by the   new taxes.
Text Mining Process:Establish the Corpus (Google NGrams) http://ngrams.googlelabs.com
Text Mining Process:Establish the Corpus (Google NGrams) Source: Google NGram
Text Mining Process:Establish the Corpus (Google NGrams) 8,500 new words a year, 70% growth from 1950-2000, 50%+ of English lexicon is "dark matter." We’re forgetting our past faster with each passing year (tracking the references to the numerical years) Innovations spread faster than ever Modern celebrities are younger and more famous than predecessors, but their fame is shorter-lived.  Culturomics is a powerful tool for automatically identifying censorship and propaganda. (e.g. e, Jewish artist Marc Chagall was mentioned just once in the entire German corpus from 1936-44) to 1944, even as his prominence in English-language books grew roughly fivefold. "Freud" is more deeply engrained in our collective subconscious than "Galileo," "Darwin," or "Einstein." “Quantitative Analysis of Culture Using Millions of Digitized Books” Science Magazine, Dec. 18, 2010
Text Mining Process: Corpus Refinement Common representation of tokens within and between documents Eliminate Stop Words Tokenization Normalize Stemming Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text. Normalize — Convert them to lowercase. Eliminate stop words — Eliminate terms that appear very often (e.g. the, and, …). Stemming — Convert the terms into their stemmed form—remove plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNetSynset]
Text Mining: Feature Extraction & Weighting Feature Extraction “Bag of Words, Terms  or Tokens” Vector Representation: Word, Term or Token/Doc Matrix Words or Tokens are attributes and documents are examples
Text Mining:Transforming Frequencies Binary Frequencies: tf =1 for tf>0; otherwise 0 Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K Log Frequencies: 1 + log(tf) for tf>0; otherwise 0 Normalized Frequencies: Divide each frequency by SQRT of Sum of Squares of the frequencies within the vector (column) Term Frequency–Inverse Document Frequency TF * IDF Inverse Document Frequency: log(N/(1+D)) where N is total number of docs and D is number with term
Text Mining Processes:Simple Overview Example Scours the Internet every ten minutes, harvesting human feelings from a large number of blogs (generally identifying and saving between 15,000 and 20,000 feelings per day. ). Scans blog posts for sentences with the phrases "I feel" and "I am feeling“, extracts the sentence, and looks to see if it includes one of about 5,000 pre-identified "feelings". If a valid feeling is found, the sentence is said to represent one person who feels that way. URL format of many blog posts can be used to extract the username of the post's author which is used to extract the age, gender, country, state, and city of the blog's owner.  Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.
Text Mining Processes:Simple Overview Example API Query from wefeelfine.org: http://api.wefeelfine.org:8080/ShowFeelings?display=xml&returnfields=imageid,feeling,sentence,posttime,postdate,posturl,gender,born,country,state,city,lat,lon,conditions&limit=500 Result from Query:  <?xml version="1.0" ?>  - <feelings>   <feeling feeling="super" sentence="i've been feeling super depressed missing my ex" posttime="1292298985" postdate="2010-12-13" posturl="http://screamingnspace.blogspot.com/2010/12/guilty-as-charged.html" gender="0" country="united states" state="south carolina" />  Source: www.wefeelfine.org/api.html
Text Mining Processes:Simple Overview Example i'm blinded to other santas because this was my first but i can't help feeling that there can't be a better one i went to mcd with an idiot which is having the same feeling as me now i feel asleep i feel about little red shoes and mittens i feel the sands of time moving so quickly in my life it seems i feel too young to have her this beauty across from me i feel like im waiting for something profound or inspirational to hit me …
Text Mining Processes:Simple Overview Example Input String (43743 chars; 8245 spaces) "i'm blinded to other santas because this was my first but i can't help feeling that there can't be a better onei went to mcd with an idiot which is having the same feeling as me nowi'll feel bad bout it and soi feel asleep…” Tokenize (9019 tokens) ['i', "'m", 'blinded', 'to', 'other', 'santas', 'because', 'this', 'was', 'my', 'first', 'but', 'i', 'ca', "n't", 'help', 'feeling', 'that', 'there', 'ca', "n't", 'be', 'a', 'better', 'one', 'i', 'went', 'to', 'mcd', 'with', 'an', 'idiot', 'which', 'is', 'having', 'the', 'same', 'feeling', 'as', 'me', 'now', 'i', "'ll", 'feel', 'bad', 'bout', 'it', 'and', 'so', 'i', 'feel', 'asleep', …] Set of Tokens (1816 distinct tokens)  ["'", "'bout", "'cleaner", "'d", "'http", "'i", "'ll", "'m", "'re", "'s", "'ve", '000', '039', '097', '1', '100', '101', '102', '104', '105', '108', '111', '114', '115', '116', '118', '11am', '12', '121', '15', '16', '180', '1998', '1st', '2', '2013', '23', '2nd', '3', '30', '78', '9', ':', 'a', 'ab', 'abit', 'able', 'about', 'above', 'abs', 'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', …]
Text Mining Processes:Simple Overview Example
Text Mining Process: Simple Overview Example Eliminate Stopwords (175 words - 'a', 'about', 'above', 'after', …) Content (4390 or 49% of tokens not stopwords – 4053 with tokens starting with apostrophes and  #s eliminated ) Set of tokens (1651) with stopwords eliminated ['ab', 'abit', 'able', 'abs', 'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', 'add', …] Stemming	 Stemmed tokens (4053) ['ab', 'abit', 'abl', 'ab', 'absolut', 'absolut', 'absorb', 'abus', 'accomplish', 'accomplish', 'achiev', 'achiev', 'across', 'act', 'action', 'activ', 'activ', 'actual', 'acura', 'add’,…] Set of tokens in stemmed content(1388) ['ab', 'abit', 'abl', 'absolut', 'absorb', 'abus', 'accomplish', 'achiev', 'across', 'act', 'action', 'activ', 'actual', 'acura', 'ad', 'add’,…]
Text Mining Processes:Simple Overview Example
Text Mining Process: Simple Overview Example Document-Term Matrix
Text Mining Process:Establish the Corpus (Simple Example) Madness Murmerings Montage Mounds Metrics Mobs
Text Mining Processes:Overview Example 2 Question: What is this? Answer: This is Twitter on steroids.
Text Mining Process:Overview Example 2 Twitter Statistics: ~106M registered users. New users 300K per day. 180 million unique visitors per mnth. 75%  of traffic from 3rd Party Apps Average 55 million tweets a day. 600 million search queries per day. 37% use their phone to tweet. 60% of tweets from 3rd Party Apps  Based on 1+B tweets generated by over 20 million Twitter users in 2010 (bio, web site, loc info). Source:huffingtonpost.com/2010/04/14/twitter-user-statistics-r_n_537992.html
Text Mining Process:Overview Example 2 Each tweet <= 140 characters (avg. 10-15 words/message) Heavy presence of non-alpha symb0-ols, abbrevs, misspellings and slang Tweets often include retweets (original tweet repeated) In spite of this – Tweets have proven to be an interesting text mining resource (e.g. see  lifeanalytics.blogspot.com & mashable.com/author/dan-zarrella/)
Text Mining Process:Overview Example 2 Twitter gets a total of 3 billion requests a day via its API API Calls for Public Tweets http://search.twitter.com/search.json?q=%3A)+feel+ feeling&rpp=100&page=1 http://api.twitter.com/1/trends/current.json?exclude=hashtags
Text Mining Process:Overview Example 2 u'iso_language_code': u'en', u'to_user_id_str': None,  u'text': u"RT @EverSoSassy56 &lt;--- I'm sportin' my glasses... I feel all sophisticated and stuff. :-)  -- And the operative word is feeling...LOL",  u'from_user_id_str': u'168852471',  u'profile_image_url': u'http://a0.twimg.com/profile_images/1166685224/Jonise_normal.jpg',  u'id': 16300313380130816L,  u'source': u'&lt;ahref=&quot;http://twidroid.com&quot; rel=&quot;nofollow&quot;&gt;twidroid&lt;/a&gt;',  u'id_str': u'16300313380130816',  u'from_user': u‘XXXXXXXXXX',  u'from_user_id': 168852471,  u'to_user_id': None,  u'geo': None,  u'created_at': u'Sun, 19 Dec 2010 01:14:32 +0000',  u'metadata': {u'result_type': u'recent'}
Text Mining Process:Establish the Corpus (2nd Example) Happy Face  Sad Face  Tokens   = 14670  Set of Tokens= 2289  avg./Sent = 24 lex. div.     = 6.4   Non-Stop words = 10406 Set Non-Stop = 2117 Stems    =   5003  Set of Stems = 1052 w/o Feel =   3921  Set w/o Feel = 1051
Text Mining Process:Overview Example 2 “Twitter Sentiment Classification using Distant Supervision” Utilizes presence of emoticons “ :)” & “ :( “ to serve as surrogates for classification as positive and negative sentiment statements To construct the term-document matrix relies on a list of positive and negative key words from Twittratr, counting number of key words that appear in each tweet.  180K tweets collected for training purposes between April and June 2009 80%+ accuracy in classification
Text Mining Processes:Overview Example 2 What is this? An areacartogram is a map in which some thematic mapping variable – such as travel time or GNP -- is substituted for land area. The geometry or space of the map is distorted in order to convey the information of this alternate variable.
Text Mining Process:Overview Example 2 Pulse of the Nation: U.S. Mood throughout the Day Inferred from Twitter Analyzed 300M public tweets produced in the US from 9/2006-8/2009 and containing words from a psychological word-rating system (“Affective Norms for English Words”)  Through a natural language processing algorithm called Sentiment Analysis, each tweet was assigned a mood score based on the number of positive or negative words it contained. Calculated the average mood score of all the users living in a state hour by hour which formed the basis of a series of time-varying mood maps.
Text Mining Process:Overview Example 2
Text Mining Process:Establish the Corpus (2nd Example)

Contenu connexe

Tendances

Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and MiningDaniel JACOB
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text miningKrish_ver2
 
Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Linglucenerevolution
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Tovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio CostantiniTovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio Costantinimaxfalc
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 

Tendances (20)

Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Data mining
Data miningData mining
Data mining
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Text mining
Text miningText mining
Text mining
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Ling
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Text mining
Text miningText mining
Text mining
 
A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Tovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio CostantiniTovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio Costantini
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 

En vedette

моите красавици,
моите  красавици,моите  красавици,
моите красавици,Zezka Rangelova
 
Proposed rezoning 1569 w-6th, ian adam, 19-oct-2010
Proposed rezoning 1569 w-6th, ian adam, 19-oct-2010Proposed rezoning 1569 w-6th, ian adam, 19-oct-2010
Proposed rezoning 1569 w-6th, ian adam, 19-oct-2010WestEnd Prepare
 
0.9پاورپینت فهرست وار کتاب برکت نسخه1
0.9پاورپینت فهرست وار کتاب برکت نسخه10.9پاورپینت فهرست وار کتاب برکت نسخه1
0.9پاورپینت فهرست وار کتاب برکت نسخه1javadrabbani
 
Accelerate Journey To The Cloud
Accelerate Journey To The CloudAccelerate Journey To The Cloud
Accelerate Journey To The CloudMark Treweeke
 
1569 W6th Ave, Technical/Design Comment, S. Bohus
1569 W6th Ave, Technical/Design Comment, S. Bohus1569 W6th Ave, Technical/Design Comment, S. Bohus
1569 W6th Ave, Technical/Design Comment, S. BohusWestEnd Prepare
 
Presentation congnghegiay.com final
Presentation congnghegiay.com   finalPresentation congnghegiay.com   final
Presentation congnghegiay.com finalNhan Vo Trong
 
Service Oriented Architecture
Service Oriented ArchitectureService Oriented Architecture
Service Oriented ArchitectureLuqman Shareef
 
NCB London Seminar GoL Presentation The Health Of Looked after Children Febru...
NCB London Seminar GoL Presentation The Health Of Looked after Children Febru...NCB London Seminar GoL Presentation The Health Of Looked after Children Febru...
NCB London Seminar GoL Presentation The Health Of Looked after Children Febru...Shirley Ayres
 
KB Gold Presentation
KB Gold PresentationKB Gold Presentation
KB Gold Presentationbaaustin
 
Sample Skull Casts
Sample Skull CastsSample Skull Casts
Sample Skull Castsstralow
 
2 verktyg för minskad stress
2 verktyg för minskad stress2 verktyg för minskad stress
2 verktyg för minskad stressmartingrunwald
 
Guidelines to avoid kulula Sky trademark infringement
Guidelines to avoid kulula Sky trademark infringementGuidelines to avoid kulula Sky trademark infringement
Guidelines to avoid kulula Sky trademark infringementBlogatize.net
 
структура эумк
структура эумкструктура эумк
структура эумкfarcrys
 
Bhajan Bhaktvatsal Bhagwan
Bhajan   Bhaktvatsal BhagwanBhajan   Bhaktvatsal Bhagwan
Bhajan Bhaktvatsal BhagwanMool Chand
 
стратегическая цель гос политики ( измен)
стратегическая цель гос политики ( измен)стратегическая цель гос политики ( измен)
стратегическая цель гос политики ( измен)farcrys
 
Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2Dave King
 

En vedette (20)

моите красавици,
моите  красавици,моите  красавици,
моите красавици,
 
Proposed rezoning 1569 w-6th, ian adam, 19-oct-2010
Proposed rezoning 1569 w-6th, ian adam, 19-oct-2010Proposed rezoning 1569 w-6th, ian adam, 19-oct-2010
Proposed rezoning 1569 w-6th, ian adam, 19-oct-2010
 
StartupGreece Patras
StartupGreece PatrasStartupGreece Patras
StartupGreece Patras
 
0.9پاورپینت فهرست وار کتاب برکت نسخه1
0.9پاورپینت فهرست وار کتاب برکت نسخه10.9پاورپینت فهرست وار کتاب برکت نسخه1
0.9پاورپینت فهرست وار کتاب برکت نسخه1
 
Accelerate Journey To The Cloud
Accelerate Journey To The CloudAccelerate Journey To The Cloud
Accelerate Journey To The Cloud
 
1569 W6th Ave, Technical/Design Comment, S. Bohus
1569 W6th Ave, Technical/Design Comment, S. Bohus1569 W6th Ave, Technical/Design Comment, S. Bohus
1569 W6th Ave, Technical/Design Comment, S. Bohus
 
Presentation congnghegiay.com final
Presentation congnghegiay.com   finalPresentation congnghegiay.com   final
Presentation congnghegiay.com final
 
Service Oriented Architecture
Service Oriented ArchitectureService Oriented Architecture
Service Oriented Architecture
 
NCB London Seminar GoL Presentation The Health Of Looked after Children Febru...
NCB London Seminar GoL Presentation The Health Of Looked after Children Febru...NCB London Seminar GoL Presentation The Health Of Looked after Children Febru...
NCB London Seminar GoL Presentation The Health Of Looked after Children Febru...
 
KB Gold Presentation
KB Gold PresentationKB Gold Presentation
KB Gold Presentation
 
Sample Skull Casts
Sample Skull CastsSample Skull Casts
Sample Skull Casts
 
2 verktyg för minskad stress
2 verktyg för minskad stress2 verktyg för minskad stress
2 verktyg för minskad stress
 
Guidelines to avoid kulula Sky trademark infringement
Guidelines to avoid kulula Sky trademark infringementGuidelines to avoid kulula Sky trademark infringement
Guidelines to avoid kulula Sky trademark infringement
 
Untitled Presentation
Untitled PresentationUntitled Presentation
Untitled Presentation
 
структура эумк
структура эумкструктура эумк
структура эумк
 
Bhajan Bhaktvatsal Bhagwan
Bhajan   Bhaktvatsal BhagwanBhajan   Bhaktvatsal Bhagwan
Bhajan Bhaktvatsal Bhagwan
 
стратегическая цель гос политики ( измен)
стратегическая цель гос политики ( измен)стратегическая цель гос политики ( измен)
стратегическая цель гос политики ( измен)
 
Apply for Graduate Schemes - Strategies for success - Network Rail
Apply for Graduate Schemes - Strategies for success - Network Rail   Apply for Graduate Schemes - Strategies for success - Network Rail
Apply for Graduate Schemes - Strategies for success - Network Rail
 
Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2
 
Pik's portfolio2011
Pik's portfolio2011Pik's portfolio2011
Pik's portfolio2011
 

Similaire à Text mining and analytics v6 - p1

Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender systemKapil Kumar
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxTake1As
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Templatebutest
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryYoung Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryHarry Potter
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryJames Wong
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryFraboni Ec
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryLuis Goldster
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryTony Nguyen
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryHoang Nguyen
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxHasanulFahmi2
 

Similaire à Text mining and analytics v6 - p1 (20)

Text mining
Text miningText mining
Text mining
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Text mining
Text miningText mining
Text mining
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptx
 
Unit i
Unit iUnit i
Unit i
 

Plus de Dave King

Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media   part 2 - hicss47 tutorial - dave kingMining and analyzing social media   part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 2 - hicss47 tutorial - dave kingDave King
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave kingDave King
 
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...Dave King
 
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...Dave King
 
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
Mining and analyzing social media   sample network w ora - hicss47 tutorial -...Mining and analyzing social media   sample network w ora - hicss47 tutorial -...
Mining and analyzing social media sample network w ora - hicss47 tutorial -...Dave King
 
Social media mining hicss 46 part 2
Social media mining   hicss 46 part 2Social media mining   hicss 46 part 2
Social media mining hicss 46 part 2Dave King
 
Social media mining hicss 46 part 1
Social media mining   hicss 46 part 1Social media mining   hicss 46 part 1
Social media mining hicss 46 part 1Dave King
 
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1Dave King
 
Text mining and analytics v6 - p2
Text mining and analytics   v6 - p2Text mining and analytics   v6 - p2
Text mining and analytics v6 - p2Dave King
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3Dave King
 
Digital Trails Dave King 1 5 10 Part 1 D3
Digital Trails   Dave King   1 5 10   Part 1 D3Digital Trails   Dave King   1 5 10   Part 1 D3
Digital Trails Dave King 1 5 10 Part 1 D3Dave King
 

Plus de Dave King (11)

Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media   part 2 - hicss47 tutorial - dave kingMining and analyzing social media   part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
 
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
 
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
 
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
Mining and analyzing social media   sample network w ora - hicss47 tutorial -...Mining and analyzing social media   sample network w ora - hicss47 tutorial -...
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
 
Social media mining hicss 46 part 2
Social media mining   hicss 46 part 2Social media mining   hicss 46 part 2
Social media mining hicss 46 part 2
 
Social media mining hicss 46 part 1
Social media mining   hicss 46 part 1Social media mining   hicss 46 part 1
Social media mining hicss 46 part 1
 
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
 
Text mining and analytics v6 - p2
Text mining and analytics   v6 - p2Text mining and analytics   v6 - p2
Text mining and analytics v6 - p2
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3
 
Digital Trails Dave King 1 5 10 Part 1 D3
Digital Trails   Dave King   1 5 10   Part 1 D3Digital Trails   Dave King   1 5 10   Part 1 D3
Digital Trails Dave King 1 5 10 Part 1 D3
 

Dernier

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 

Dernier (20)

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 

Text mining and analytics v6 - p1

  • 1. Tutorial: Text Data Mining and Analytics HICSS 44 – January 2011 Dave King
  • 2. Welcome to one of the HICSS SWoTs
  • 3. Difference between a Symposium & a Tutorial at HICSS Symposium Audience M:M Tutorial 1:M
  • 4. Difference between a Symposium & a Tutorial at HICSS Wv(t + 1) = Wv(t) + Θ (v, t) α(t)(D(t) - Wv(t))
  • 5. Agenda Part 1: Growing Interest in Analytics Overview of Text Mining and Analysis General Text Mining and Analysis Processes Part 2: Classification and Categorization Clustering Information Extraction Overview of Tools & Packages
  • 6. This is the only note you’ll need to take Presentation can be found at: www.slideshare.net
  • 7. Biography: Dave King Currently, EVP of Product Development and Management at JDA Software 28 years in enterprise package software business 15 years as university professor 12 years as Co-Chair of the Internet & Digital Economy Track (HICSS) Long time interest in various aspects of E-Commerce & Business Intelligence Tutorial topic primarily reflects a personal interest and tangentially a job(s) related interest.
  • 8. Personal Experiences with Analytics Taught applied statistics and math modeling In software R&D Optimization in the 80s Natural Language Frontends NLI Query & CMU Robotics Lab EIS Competitive Analysis Dow Jones and Reuters Verity Topics NewsAlert InXight’s Hyperbolic Tree Often the audiences has been small, sometimes bewildered, and often fleeting
  • 9. If I have seen further it is only by … plagiarizing the works of others.
  • 10. Text Mining & Analysis Resources: Books
  • 11. Text Mining & Analysis Resources: Books
  • 12. Text Mining & Analysis Resources: Web Sites & Sources TM/Blog -- blogs.sas.com/text-mining TM/Blog -- texttechnologies.com TM/Blog -- lingpipe-blog.com TM & Analytics /Blog -- intelligent-enterprise.informationweek.com/movabletype/blog/sgrimes.html TM/Wiki -- textanalytics.wikidot.com TA/General -- social.textanalyticsnews.com TA/General -- textanalysis.info TA/General -- klariti.com/text-mining/index.shtml TM & DM/Online Book -- statsoft.com/textbook/text-mining/ TM & DM/Tutorial -- alias-i.com/lingpipe/demos/tutorial/db/read-me.html TA Tutorial -- slideshare.net/SethGrimes/text-analytics-for-dummies-2010 TM Tutorial -- www.esi.uem.es/~jmgomez/tutorials/ecmlpkdd02 TM Tutorial -- scienceforseo.com/tutorials/text-mining-tutorial
  • 13. Text Mining & Analysis Resources: Associated Web Sites & Sources DM/Blog -- datamining.typepad.com DM/Blog -- abbottanalytics.blogspot.com DM/Blog -- bx.businessweek.com/data-mining/blogs DM/Blog -- blog.data-miners.com DM/Blog -- datawrangling.com DM/Blog -- bytemining.com DM/Blog -- marktab.net/datamining DM/Blog -- dataminingblog.com DM/Blog -- timmanns.blogspot.com DM/General -- kdnuggets.com DM/General -- mydatamine.com DM/General -- the-data-mine.com DM/Online Book -- chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm DM/Tutorial -- autonlab.org/tutorials/
  • 14. Initial Question:What search terms are graphed?
  • 15. Interest in Analytics:Growing Awareness Source: Google Trends Analytics – “Extensive use of data, statistical and quantitative analysis, exploratory and predictive models, and fact-based management to drive decisions and actions…a subset of what has come to be called BI.” (Davenport and Harris, Competing on Analytics, HBS, 2007)
  • 16. Interest in Analytics:Theory and Practice Data Mining Optimization In theory, there is no difference between theory and practice. But, in practice, there is.
  • 18. Interest in Analytics:Potential Reasons for the Interest Next generation DSS: Progression of DSS->EIS->BI->PM->Analytics Increasing volumes of data requiring new approaches or modifications in existing approaches Focus on CRM and Supply Chains … General belief that more sophisticated analysis is required to compete in today’s environments …
  • 19. Interest in Text Mining & Analytics: An old adage George Mallory . “WHY did you want to climb Mount Everest?" (in 1923 interview). His reply, “Because it’s there.” .
  • 20. Interest in Text Mining & Analytics: The 80% Rule Unstructured (Textual) 80% Structured (Databases) 20% “It's a truism that 80 percent of business-relevant information originates in unstructured form, primarily text… The 80 percent unstructured figure comes from, well, everywhere.” Source: Seth Grimes, Unstructured Data and the 80 Percent Rule
  • 21. Text Mining and Analytics:Definitions General: All types of text processing that deal with finding, organizing and analyzing textual (unstructured) information. Formal: Utilizing data mining techniques to create new information that is not obvious in a collection of documents (implies that Text Analytics ~ Text Mining ~ Text Data Mining)
  • 22. Text Mining and Analytics:Types of Processing and Techniques Clustering. Grouping similar documents without having a predefined set of categories. Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching Named-Entity Recognition Seeks to locate and classify atomic elements in text into predefined categories (e.g. names of persons) Concept linking and Topic Tracking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. Summarization. Summarizing a document to save time on the part of the reader.
  • 23. Text Mining and Analytics:Sample Application Areas Seth Grimes Papers
  • 24. Text Mining:A Common Issue George Herbert, Welsh Poet & Priest A great dowry is a bed full of brambles. Outlandish Proverbs, 1640 Structured data mining is a bed of roses when compared to unstructured, textual mining which is a bed of brambles
  • 25. Data Mining: Simple Example (Affinity Analysis) Study of attributes or characteristics that “go together.” Seek to uncover “association rules” that quantify the relationship between two or more attributes. Rules take the form of “If antecedent, then consequent” Examples: Market basket analysis to determine which items are purchased together (in single transaction) Web analysis to determine which sequences of pages users visit Major issue is number of potential combinations as the number of attributes increases
  • 26. Data Mining: Simple Example (Affinity Analysis) 1. Market Basket Analysis: Items for Sale: Apples Bananas Cherries Durians 2. Possible Transactions: With one item or a collection of items selected as the Driver or Independent Variable 3. Objective is to empirically determine those groups of items that occur frequently together in a set of transactions, producing a set of rules of the form X -> Y.
  • 27. Data Mining: Simple Example (Affinity Analysis) Standard Market Basket Measures: Support = N(X & Y)/ N(T) Example: N(A & B)/ N(T) = 2/7 = 29% Confidence = N(X & Y)/ N(X) Example: N(A & B)/ N(A) = 2/4 = 50% Where N(T) = No of Trans and N(X & Y) = No of Trans X&Y
  • 28. Data Mining: Simple Example (Affinity Analysis)
  • 29. Data Mining: General Data Assumptions Requires structured data (numbers and categories well-defined) Transformed by data preparation or collected with a prior design in mind Typically housed and organized in a relational database, data mart or data warehouse
  • 30. Data Mining: Simple Example But, what if the baskets were described in the following manner: Jane bought a handful of maraschinos and a couple of granny smiths. Harold purchased a bag of appls and 2 bananas. Bill paid for a pound of cherries but decided not to buy the three durians because of their odor. How could we automate the analysis?
  • 31. Data Mining: CRISP-DM Real-World Data Data Consolidation Data Cleaning Business Understanding Data Understanding Data Preparation Deployment Data Transformation Data Reduction Modeling Evaluation Well-Formed Data Cross-Industry Standard Process for Data Mining
  • 32. Text MiningCRISP-Like Processes Real-World Text Data Document Consolidation Establish the Corpus Business Understanding Document Understanding Document Preparation Deployment Corpus Refinement (Token, Stem, Stop…) Feature Selection & Weighting Documents Modeling Evaluation Term- Doc-Matrix* * - Entity-Relationships
  • 33. Text Mining Process:Establish the Corpus First step in textual data preparation is to systematically collect samples of text, i.e. the documents related to the context being studied Range of possibilities: word documents, PDFs, emails, IM chat, Web pages, RSS Feeds, Blogs, Tweets, Open ended surveys, Transcripts of Helpline calls … Convert into organized set of texts – called a corpus – standardized and prepared for the purpose of knowledge discovery.
  • 34. Text Mining Process:Establish the Corpus Brown Corpus – first million word corpus compiled in 60s at Brown U., 500 samples across 15 genres, each ~2000 words with POS tags Linguistic Consortium Treebanks– collections of manually tagged and parsed (tree structures) of sentences from a variety of sources (includes well-known Penn Treebank collection) Reuters 21578, RCV1 & V2 -- collections (1000s of) Reuter’s English & multi-lingual news stories classified into topics and grouped into training & test sets Pang & Lee’s Sentiment Analysis – 1000 positive and 1000 negative movie reviews MEDLINE – An extensive collection of articles and abstracts (18M+) used in a variety of biomedical and linguistic text mining applications WordNet® -- large lexical database of English grouped into sets of cognitive synonyms (synsets) and interlinked by means of conceptual-semantic and lexical relations. Google Ngram -- 500 billion words from 5.2 million books published between 1500 and 2008 in English, French, Spanish, German, Russian, and Chinese.
  • 35. Text Mining Process:Establishing the Corpus (Brown) Sample Tagged Entry The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.
  • 36. Text Mining Process:Establishing the Corpus (Penn Treebank) .START Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Raw [ Pierre/NNP Vinken/NNP ] ,/, [ 61/CD years/NNS ] old/JJ ,/, will/MD join/VB [ the/DT board/NN ] as/IN [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ] ./. Tagged ( (S (NP-SBJ (NP Pierre Vinken) , (ADJP (NP 61 years) old) ,) (VP will (VP join (NP the board) (PP-CLR as (NP a nonexecutive director)) (NP-TMP Nov. 29))) .)) Parsed
  • 37. Text Mining Process:Establishing the Corpus (Reuters) 14826 ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT Mounting trade friction between the U.S. And Japan has raised fears among many of Asia's exporting nations that the row could inflict far-reaching economic damage, businessmen and officials said. They told Reuter correspondents in Asian capitals a U.S. Move against Japan might boost protectionist sentiment in the U.S. And lead to curbs on American imports of their products. But some exporters said that while the conflict would hurt them in the long-run, in the short-term Tokyo's loss might be their gain. The U.S. Has said it will impose 300 mlndlrs of tariffs on imports of Japanese electronics goods on April 17, in retaliation for Japan's alleged failure to stick to a pact not to sell semiconductors on world markets at below cost. Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes.
  • 38. Text Mining Process:Establish the Corpus (Google NGrams) http://ngrams.googlelabs.com
  • 39. Text Mining Process:Establish the Corpus (Google NGrams) Source: Google NGram
  • 40. Text Mining Process:Establish the Corpus (Google NGrams) 8,500 new words a year, 70% growth from 1950-2000, 50%+ of English lexicon is "dark matter." We’re forgetting our past faster with each passing year (tracking the references to the numerical years) Innovations spread faster than ever Modern celebrities are younger and more famous than predecessors, but their fame is shorter-lived. Culturomics is a powerful tool for automatically identifying censorship and propaganda. (e.g. e, Jewish artist Marc Chagall was mentioned just once in the entire German corpus from 1936-44) to 1944, even as his prominence in English-language books grew roughly fivefold. "Freud" is more deeply engrained in our collective subconscious than "Galileo," "Darwin," or "Einstein." “Quantitative Analysis of Culture Using Millions of Digitized Books” Science Magazine, Dec. 18, 2010
  • 41. Text Mining Process: Corpus Refinement Common representation of tokens within and between documents Eliminate Stop Words Tokenization Normalize Stemming Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text. Normalize — Convert them to lowercase. Eliminate stop words — Eliminate terms that appear very often (e.g. the, and, …). Stemming — Convert the terms into their stemmed form—remove plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNetSynset]
  • 42. Text Mining: Feature Extraction & Weighting Feature Extraction “Bag of Words, Terms or Tokens” Vector Representation: Word, Term or Token/Doc Matrix Words or Tokens are attributes and documents are examples
  • 43. Text Mining:Transforming Frequencies Binary Frequencies: tf =1 for tf>0; otherwise 0 Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K Log Frequencies: 1 + log(tf) for tf>0; otherwise 0 Normalized Frequencies: Divide each frequency by SQRT of Sum of Squares of the frequencies within the vector (column) Term Frequency–Inverse Document Frequency TF * IDF Inverse Document Frequency: log(N/(1+D)) where N is total number of docs and D is number with term
  • 44. Text Mining Processes:Simple Overview Example Scours the Internet every ten minutes, harvesting human feelings from a large number of blogs (generally identifying and saving between 15,000 and 20,000 feelings per day. ). Scans blog posts for sentences with the phrases "I feel" and "I am feeling“, extracts the sentence, and looks to see if it includes one of about 5,000 pre-identified "feelings". If a valid feeling is found, the sentence is said to represent one person who feels that way. URL format of many blog posts can be used to extract the username of the post's author which is used to extract the age, gender, country, state, and city of the blog's owner. Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.
  • 45. Text Mining Processes:Simple Overview Example API Query from wefeelfine.org: http://api.wefeelfine.org:8080/ShowFeelings?display=xml&returnfields=imageid,feeling,sentence,posttime,postdate,posturl,gender,born,country,state,city,lat,lon,conditions&limit=500 Result from Query: <?xml version="1.0" ?> - <feelings>   <feeling feeling="super" sentence="i've been feeling super depressed missing my ex" posttime="1292298985" postdate="2010-12-13" posturl="http://screamingnspace.blogspot.com/2010/12/guilty-as-charged.html" gender="0" country="united states" state="south carolina" /> Source: www.wefeelfine.org/api.html
  • 46. Text Mining Processes:Simple Overview Example i'm blinded to other santas because this was my first but i can't help feeling that there can't be a better one i went to mcd with an idiot which is having the same feeling as me now i feel asleep i feel about little red shoes and mittens i feel the sands of time moving so quickly in my life it seems i feel too young to have her this beauty across from me i feel like im waiting for something profound or inspirational to hit me …
  • 47. Text Mining Processes:Simple Overview Example Input String (43743 chars; 8245 spaces) "i'm blinded to other santas because this was my first but i can't help feeling that there can't be a better onei went to mcd with an idiot which is having the same feeling as me nowi'll feel bad bout it and soi feel asleep…” Tokenize (9019 tokens) ['i', "'m", 'blinded', 'to', 'other', 'santas', 'because', 'this', 'was', 'my', 'first', 'but', 'i', 'ca', "n't", 'help', 'feeling', 'that', 'there', 'ca', "n't", 'be', 'a', 'better', 'one', 'i', 'went', 'to', 'mcd', 'with', 'an', 'idiot', 'which', 'is', 'having', 'the', 'same', 'feeling', 'as', 'me', 'now', 'i', "'ll", 'feel', 'bad', 'bout', 'it', 'and', 'so', 'i', 'feel', 'asleep', …] Set of Tokens (1816 distinct tokens) ["'", "'bout", "'cleaner", "'d", "'http", "'i", "'ll", "'m", "'re", "'s", "'ve", '000', '039', '097', '1', '100', '101', '102', '104', '105', '108', '111', '114', '115', '116', '118', '11am', '12', '121', '15', '16', '180', '1998', '1st', '2', '2013', '23', '2nd', '3', '30', '78', '9', ':', 'a', 'ab', 'abit', 'able', 'about', 'above', 'abs', 'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', …]
  • 48. Text Mining Processes:Simple Overview Example
  • 49. Text Mining Process: Simple Overview Example Eliminate Stopwords (175 words - 'a', 'about', 'above', 'after', …) Content (4390 or 49% of tokens not stopwords – 4053 with tokens starting with apostrophes and #s eliminated ) Set of tokens (1651) with stopwords eliminated ['ab', 'abit', 'able', 'abs', 'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', 'add', …] Stemming Stemmed tokens (4053) ['ab', 'abit', 'abl', 'ab', 'absolut', 'absolut', 'absorb', 'abus', 'accomplish', 'accomplish', 'achiev', 'achiev', 'across', 'act', 'action', 'activ', 'activ', 'actual', 'acura', 'add’,…] Set of tokens in stemmed content(1388) ['ab', 'abit', 'abl', 'absolut', 'absorb', 'abus', 'accomplish', 'achiev', 'across', 'act', 'action', 'activ', 'actual', 'acura', 'ad', 'add’,…]
  • 50. Text Mining Processes:Simple Overview Example
  • 51. Text Mining Process: Simple Overview Example Document-Term Matrix
  • 52. Text Mining Process:Establish the Corpus (Simple Example) Madness Murmerings Montage Mounds Metrics Mobs
  • 53. Text Mining Processes:Overview Example 2 Question: What is this? Answer: This is Twitter on steroids.
  • 54. Text Mining Process:Overview Example 2 Twitter Statistics: ~106M registered users. New users 300K per day. 180 million unique visitors per mnth. 75% of traffic from 3rd Party Apps Average 55 million tweets a day. 600 million search queries per day. 37% use their phone to tweet. 60% of tweets from 3rd Party Apps Based on 1+B tweets generated by over 20 million Twitter users in 2010 (bio, web site, loc info). Source:huffingtonpost.com/2010/04/14/twitter-user-statistics-r_n_537992.html
  • 55. Text Mining Process:Overview Example 2 Each tweet <= 140 characters (avg. 10-15 words/message) Heavy presence of non-alpha symb0-ols, abbrevs, misspellings and slang Tweets often include retweets (original tweet repeated) In spite of this – Tweets have proven to be an interesting text mining resource (e.g. see lifeanalytics.blogspot.com & mashable.com/author/dan-zarrella/)
  • 56. Text Mining Process:Overview Example 2 Twitter gets a total of 3 billion requests a day via its API API Calls for Public Tweets http://search.twitter.com/search.json?q=%3A)+feel+ feeling&rpp=100&page=1 http://api.twitter.com/1/trends/current.json?exclude=hashtags
  • 57. Text Mining Process:Overview Example 2 u'iso_language_code': u'en', u'to_user_id_str': None, u'text': u"RT @EverSoSassy56 &lt;--- I'm sportin' my glasses... I feel all sophisticated and stuff. :-) -- And the operative word is feeling...LOL", u'from_user_id_str': u'168852471', u'profile_image_url': u'http://a0.twimg.com/profile_images/1166685224/Jonise_normal.jpg', u'id': 16300313380130816L, u'source': u'&lt;ahref=&quot;http://twidroid.com&quot; rel=&quot;nofollow&quot;&gt;twidroid&lt;/a&gt;', u'id_str': u'16300313380130816', u'from_user': u‘XXXXXXXXXX', u'from_user_id': 168852471, u'to_user_id': None, u'geo': None, u'created_at': u'Sun, 19 Dec 2010 01:14:32 +0000', u'metadata': {u'result_type': u'recent'}
  • 58. Text Mining Process:Establish the Corpus (2nd Example) Happy Face  Sad Face  Tokens = 14670 Set of Tokens= 2289 avg./Sent = 24 lex. div. = 6.4 Non-Stop words = 10406 Set Non-Stop = 2117 Stems = 5003 Set of Stems = 1052 w/o Feel = 3921 Set w/o Feel = 1051
  • 59. Text Mining Process:Overview Example 2 “Twitter Sentiment Classification using Distant Supervision” Utilizes presence of emoticons “ :)” & “ :( “ to serve as surrogates for classification as positive and negative sentiment statements To construct the term-document matrix relies on a list of positive and negative key words from Twittratr, counting number of key words that appear in each tweet. 180K tweets collected for training purposes between April and June 2009 80%+ accuracy in classification
  • 60. Text Mining Processes:Overview Example 2 What is this? An areacartogram is a map in which some thematic mapping variable – such as travel time or GNP -- is substituted for land area. The geometry or space of the map is distorted in order to convey the information of this alternate variable.
  • 61. Text Mining Process:Overview Example 2 Pulse of the Nation: U.S. Mood throughout the Day Inferred from Twitter Analyzed 300M public tweets produced in the US from 9/2006-8/2009 and containing words from a psychological word-rating system (“Affective Norms for English Words”) Through a natural language processing algorithm called Sentiment Analysis, each tweet was assigned a mood score based on the number of positive or negative words it contained. Calculated the average mood score of all the users living in a state hour by hour which formed the basis of a series of time-varying mood maps.
  • 63. Text Mining Process:Establish the Corpus (2nd Example)

Notes de l'éditeur

  1. 5.2 million digitized books - about 4% of all books ever printedpublished during the past 200 yearsAll told, about 129 million books have been published since the invention of the printing press. In 2004, Google software engineers began making electronic copies of them, and have about 15 million so far, comprising more than two trillion words in 400 languages.They currently include Chinese, English, French, German, Russian and Spanish books dating back to the year 1500—about 4% of all books published. The database doesn&apos;t include periodicals, which might reflect popular culture from a different vantage.The resulting corpus contains over 500 billion words, inEnglish (361 billion), French (45B), Spanish (45B), German(37B), Chinese (13B), Russian (35B), and Hebrew (2B). Theoldest works were published in the 1500s. The early decadesare represented by only a few books per year, comprisingseveral hundred thousand words. By 1800, the corpus growsto 60 million words per year; by 1900, 1.4 billion; and by2000, 8 billion.
  2. 5.2 million digitized books - about 4% of all books ever printedpublished during the past 200 yearsAll told, about 129 million books have been published since the invention of the printing press. In 2004, Google software engineers began making electronic copies of them, and have about 15 million so far, comprising more than two trillion words in 400 languages.They currently include Chinese, English, French, German, Russian and Spanish books dating back to the year 1500—about 4% of all books published. The database doesn&apos;t include periodicals, which might reflect popular culture from a different vantage.The resulting corpus contains over 500 billion words, inEnglish (361 billion), French (45B), Spanish (45B), German(37B), Chinese (13B), Russian (35B), and Hebrew (2B). Theoldest works were published in the 1500s. The early decadesare represented by only a few books per year, comprisingseveral hundred thousand words. By 1800, the corpus growsto 60 million words per year; by 1900, 1.4 billion; and by2000, 8 billion.