SlideShare une entreprise Scribd logo
1  sur  31
Text Analytics and R
        Open Question: A Good Match?



                    Marina Santini
                     (LinkedIn)
Research Scientist at SICS East Swedish ICT AB (Santa Anna)



     R useR MeetUp: Text analytics using R
          R useR group (StockholmR)
                   R useR MeetUp, 14 March 2013, 18:00
                               Stockholm
My Quest or… Why do I attend this meetup?

  – The Quest: finding the optimal way to handle Big
    Textual Data for Information Discovery
  – The Question: is R convenient for text analytics of
    Big TEXTUAL Data?
  – Mission: identification of pros, cons, limits,
    benefits …


• Current Status: investigation in progress…

                  R useR MeetUp, 14 March 2013, 18:00
                              Stockholm
Outline
• Big Data vs. Big TEXTUAL Data
• Text Analytics & NLP (Natural Language Processing)
• Statistics for linguistics with R by Stefan Th. Gries
• From Information Discovery to Actionable
  TEXTUAL Intelligence
• The Enron Challange: Predictions and Crisis
  Intelligence

                   R useR MeetUp, 14 March 2013, 18:00
                               Stockholm
Big Data
• BIG DATA [Wikipedia]:
   – Big data usually includes data sets with sizes beyond the ability of
     commonly used software tools to capture, curate, manage, and
     process the data within a tolerable elapsed time. Big data sizes are a
     constantly moving target, as of 2012 ranging from a few dozen
     terabytes to many petabytes of data in a single data set. With this
     difficulty, new platforms of "big data" tools are being developed to
     handle various aspects of large quantities of data.

   – Examples include Big Science, web logs, RFID, sensor networks, social
     networks, social data (due to the social data revolution), Internet
     text and documents, Internet search indexing, call detail records,
     astronomy, atmospheric science, genomics, biogeochemical,
     biological, and other complex and often interdisciplinary scientific
     research, military surveillance, medical records, photography archives,
     video archives, and large-scale e-commerce.


                          R useR MeetUp, 14 March 2013, 18:00
                                      Stockholm
R, Strata, Hadoop…?



Apparently many solutions are available on
the market…

Uhm… Big Data is a vague label…


              R useR MeetUp, 14 March 2013, 18:00
                          Stockholm
Merrill Lynch is one of the world's
Big Unstructured                                             leading financial management and
                                                             advisory companies, providing

 TEXTUAL Data                                                financial advice and investment
                                                             banking services.

 “Merrill Lynch estimates that more than 85 percent of all
   business information exists as unstructured data –commonly
   appearing in e‐mails, memos, notes from call centers and
   support operations, news, user groups, chats, reports, letters,
   surveys, white papers, marketing material, research,
   presentations and web pages.” [DM Review Magazine,
   February 2003 Issue]

  ECONOMIC LOSS!


          A plethora of diverse document genres!
                       R useR MeetUp, 14 March 2013, 18:00
                                   Stockholm
Simple search is not enough…
• Of course, it is possible to use simple search. But
  simple search is unrewarding, because is based on
  single terms.
   – ”a search is made on the term felony. In a simple search,
     the term felony is used, and everywhere there is a
     reference to felony, a hit to an unstructured document is
     made. But a simple search is crude. It does not find
     references to crime, arson, murder, embezzlement,
     vehicular homicide, and such, even though these crimes
     are types of felonies” * Source: Inmon, B. & A. Nesavich,
     "Unstructured Textual Data in the Organization" from
     "Managing Unstructured data in the organization",
     Prentice Hall 2008, pp. 1–13]


                      R useR MeetUp, 14 March 2013, 18:00
                                  Stockholm
Textual Documents and Document Genres




             R useR MeetUp, 14 March 2013, 18:00
                         Stockholm
Definition: Text Analytics
• A set of NLP techniques that provide some
  structure to textual documents and help
  identify and extract important information.




                 R useR MeetUp, 14 March 2013, 18:00
                             Stockholm
Set of NLP techniques
• Common components of a text analytic
  package are:
  – Tokenization
  – Morphological Analysis
  – Syntactic Analysis
  – Named Entity Recognition
  – Sentiment Analysis
  – Automatic Summarization
  – Etc.

                 R useR MeetUp, 14 March 2013, 18:00
                             Stockholm
NLP at Coursera




  R useR MeetUp, 14 March 2013, 18:00
              Stockholm
NLP is pervasive
              Ex: spell-checkers

•   Google Search
•   Google Mail
•   Facebook
•   Office Word
•   *…+


                    R useR MeetUp, 14 March 2013, 18:00
                                Stockholm
NLP is parvasive
        Ex: Name Entity Recognition
• Opinion
  mining
• Brand Trends
• Conversation
  clouds on web
  magazines and
  online
  newspapers…

                  R useR MeetUp, 14 March 2013, 18:00
                              Stockholm
Sentiment Analysis




           R useR MeetUp, 14 March 2013, 18:00
                       Stockholm
Text Analytics Products and Frameworks
• Commercial Products:                 Open Source Frameworks:
  –   Attensity                        •    GATE
  –   Clarabridge                      •    NLTK
  –   Temis                            •    UIMA
  –   Lexalytics                       •    etc.
  –   Texify
  –   SAS
  –   SPSS
  –   IBM Cognos
  –   etc.
                    R useR MeetUp, 14 March 2013, 18:00
                                Stockholm
However… (I)
• NLP tools and applications (both commercial
  and open source) are not perfert. Research is
  still very active in all NLP subfields.




                 R useR MeetUp, 14 March 2013, 18:00
                             Stockholm
Ex: Syntactic Parser
• Connexor




• What about parsing a tweet?
• “My son, 6y/o, asked me for the first time today how
  my DAY was . . . I about melted. Told him that I had
  pizza for lunch. Response? No fair “ (Twitter Tutorial 1:
  How to Tweet Well)

                     R useR MeetUp, 14 March 2013, 18:00
                                 Stockholm
Why NLP and Text Analytics are
 important for Information Discovery?
• Why is it important to know that a word is a noun, or a
  verb or the name of brand?

• Broadly speaking:
• Nouns and verbs: Nouns are important for topic
  detection; verbs are important if you want to identify
  actions or intentions.
• Adjectives = sentiment identification.
• Function words (a.k.a. stop words) are important for
  authorship attribution, plagiarism detection, etc.
• etc.

                    R useR MeetUp, 14 March 2013, 18:00
                                Stockholm
However… (II)
• At present, the main pitfall of many NLP applications is
  that they are not flexible enough to:
   – Completly disambiguate language
   – Identify how language is used in different types of
     documents (a.k.a. genres).
  For instance, in tweets langauge is used in a different
  way than an emails, language used in email is
  different from the language used in academic papers,
  etc. )
• Often tweaking NLP tools to different types of text or
  solve language ambiguity in an ad-hoc manner is
  time-consuming, difficult and unrewarding…

                      R useR MeetUp, 14 March 2013, 18:00
                                  Stockholm
How can R help?
• Can R help overcome NLP shortcomings and
  open a new direction in Text Analytics and
  Information Discovery in order to extract
  useful information from Big TEXTUAL Data?




                R useR MeetUp, 14 March 2013, 18:00
                            Stockholm
Existing literature for linguists
• Stefan Th. Gries (2013) Statistics for linguistics
  With R: A Practical Introduction. De Gruyter
  Mouton. New Edition.
• Stefan Th. Gries (2009) Quantitative corpus
  linguistics with R: a practical introduction.
  Routledge, Taylor & Francis Group (companion
  website).
• Harald R. Baayen (2800) Analyzing Linguistic Data:
  A Practical Introduction to Statistics using R.
  Cambridge.
• ….
                  R useR MeetUp, 14 March 2013, 18:00
                              Stockholm
Companion website by Stefan Th. Gries
 • BNC=British National Corpus (PoS tagged)




                 R useR MeetUp, 14 March 2013, 18:00
                             Stockholm
BNC
• The British National Corpus (BNC) is a 100 million word collection of
  samples of written and spoken language from a wide range of sources,
  designed to represent a wide cross-section of British English from the later
  part of the 20th century, both spoken and written. The latest edition is
  the BNC XML Edition, released in 2007.

• The corpus is encoded according to the Guidelines of the Text Encoding
  Initiative (TEI) to represent both the output from CLAWS (automatic part-
  of-speech tagger) and a variety of other structural properties of texts (e.g.
  headings, paragraphs, lists etc.). Full classification, contextual and
  bibliographic information is also included with each text in the form of a
  TEI-conformant header.




                            R useR MeetUp, 14 March 2013, 18:00
                                        Stockholm
R & the BNC: Excerpt from Google Books




R = Corpus-based Lingusitc Analysis = OK
1.     Descriptive statistics
2.     Analytical statistics
3.     Multifactorial methods

                                R useR MeetUp, 14 March 2013, 18:00
                                            Stockholm
What about Information Discovery?
• Non standardized language
• Non standard texts
• Electronic documents of all kinds, eg. formal,
  informal, short, long, private, public, etc.




                 R useR MeetUp, 14 March 2013, 18:00
                             Stockholm
Information Discovery 
         Actionable Textual Intelligence
• Business Intelligence (BI) + Customer Analytics +
  Social Network Analytics + Crisis Intelligence *…+ =
  Actionable Textual Intelligence
• Actionable Textual Intelligence is information that:
   1.   must be accurate and verifiable
   2.   must be timely
   3.   must be comprehensive
   4.   must be comprehensible
   5.   !!! give the power to make decisions and to act straightaway !!!
   6.   !!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL DATA !!!

                           R useR MeetUp, 14 March 2013, 18:00
                                       Stockholm
From The Economist:
The Big Data scenario




    R useR MeetUp, 14 March 2013, 18:00
                Stockholm
Enron & Crisis Intelligence:
             The Enron Scandal
• The Enron scandal, revealed in October 2001, eventually
  led to the bankruptcy of the Enron Corporation, an
  American energy company based in Houston, Texas.

• “Enron's complex financial statements were confusing to
  shareholders and analysts. In addition, its complex business
  model and unethical practices required that the company
  use accounting limitations to misrepresent earnings and
  modify the balance sheet to indicate favorable
  performance. According to McLean and Elkind in their
  book The Smartest Guys in the Room, "The Enron scandal
  grew out of a steady accumulation of habits and values and
  actions that began years before and finally spiraled out of
  control. “ *wikipedia]
                     R useR MeetUp, 14 March 2013, 18:00
                                 Stockholm
The Enron Dataset
    http://www.cs.cmu.edu/~enron/
• ” This dataset was collected and prepared by the
  CALO Project (A Cognitive Assistant that Learns
  and Organizes). It contains data from about 150
  users, mostly senior management of Enron,
  organized into folders. The corpus contains a total
  of about 0.5M messages. This data was originally
  made public, and posted to the web, by the
  Federal Energy Regulatory Commission during its
  investigation.”
• Resource for researchers
                  R useR MeetUp, 14 March 2013, 18:00
                              Stockholm
The Challenge: Crisis Intelligence
• Task:
  Can you suggest and implement a predictive model
  that would tell us that the Enron CRISIS (= scandal &
  collapse) would have happend by analysing and
  processing the raw textual data of emails belonging
  to the Enron dataset with R?
 Some basic references:
 •Enron scandal at-a-glance, BBC
 •The Enron Dataset (corpus=dataset=document collection)
 •A subset of about 1700 labeled email messages (4.5M ) [genre, topic,
 emotion]
 •Actionable Corpus & Actionable Intelligence (this post contains
 additional referenes in the cmments)
                         R useR MeetUp, 14 March 2013, 18:00
                                     Stockholm
Thank you for your attention

             Preseantation available here:
http://www.slideshare.net/marinasantini1/text-analytics-and-r


            http://www.forum.santini.se/
                     R useR MeetUp, 14 March 2013, 18:00
                                 Stockholm

Contenu connexe

Tendances

Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
You too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talkYou too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talkJacob Perkins
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
 
GATE : General Architecture for Text Engineering
GATE : General Architecture for Text EngineeringGATE : General Architecture for Text Engineering
GATE : General Architecture for Text EngineeringAhmed Magdy Ezzeldin, MSc.
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review Jayneel Vora
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingDavid Rostcheck
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
From NLP to text mining
From NLP to text mining From NLP to text mining
From NLP to text mining Yi-Shin Chen
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2Yuriy Guts
 
Natural language processing
Natural language processingNatural language processing
Natural language processingHansi Thenuwara
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingdhruv_chaudhari
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 

Tendances (20)

Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
You too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talkYou too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talk
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
NLP
NLPNLP
NLP
 
GATE : General Architecture for Text Engineering
GATE : General Architecture for Text EngineeringGATE : General Architecture for Text Engineering
GATE : General Architecture for Text Engineering
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
From NLP to text mining
From NLP to text mining From NLP to text mining
From NLP to text mining
 
Python NLTK
Python NLTKPython NLTK
Python NLTK
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 

En vedette

Measuring Opinion Credibility in Twitter
Measuring Opinion Credibility in TwitterMeasuring Opinion Credibility in Twitter
Measuring Opinion Credibility in TwitterMya Thandar
 
Challenges of social media analysis in the real world
Challenges of social media analysis in the real worldChallenges of social media analysis in the real world
Challenges of social media analysis in the real worldDiana Maynard
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkDavide Nardone
 
Persona Driven Keyword Research
Persona Driven Keyword ResearchPersona Driven Keyword Research
Persona Driven Keyword ResearchMichael King
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 

En vedette (7)

Measuring Opinion Credibility in Twitter
Measuring Opinion Credibility in TwitterMeasuring Opinion Credibility in Twitter
Measuring Opinion Credibility in Twitter
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Challenges of social media analysis in the real world
Challenges of social media analysis in the real worldChallenges of social media analysis in the real world
Challenges of social media analysis in the real world
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache Spark
 
Persona Driven Keyword Research
Persona Driven Keyword ResearchPersona Driven Keyword Research
Persona Driven Keyword Research
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 

Similaire à Text analytics and R - Open Question: is it a good match?

An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationSeth Grimes
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesZoltan Varju
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEDiana Maynard
 
Lecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive DatasetsLecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive DatasetsMarina Santini
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018SusanMRob
 
New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applicationsConstantin Orasan
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague
 
Tools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisTools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisDiana Maynard
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaDiana Maynard
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Paige Morgan
 
An Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxAn Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxSoftxai
 
NLP applicata a LIS
NLP applicata a LISNLP applicata a LIS
NLP applicata a LISnoemiricci2
 
OpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeOpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeopenminted_eu
 
Towards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkTowards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkMonaDiab7
 
Data as a service: a human-centered design approach/Retha de la Harpe
Data as a service: a human-centered design approach/Retha de la HarpeData as a service: a human-centered design approach/Retha de la Harpe
Data as a service: a human-centered design approach/Retha de la HarpeAfrican Open Science Platform
 
Text Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEText Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEDiana Maynard
 
Web Storytelling and Open Data Publishing for Tourism
Web Storytelling and Open Data Publishing for TourismWeb Storytelling and Open Data Publishing for Tourism
Web Storytelling and Open Data Publishing for TourismAndrea Volpini
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018Leon Derczynski
 

Similaire à Text analytics and R - Open Question: is it a good match? (20)

An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
Lecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive DatasetsLecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive Datasets
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018
 
New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applications
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Tools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisTools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media Analysis
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
An Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxAn Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptx
 
NLP applicata a LIS
NLP applicata a LISNLP applicata a LIS
NLP applicata a LIS
 
OpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeOpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledge
 
Towards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkTowards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walk
 
Data as a service: a human-centered design approach/Retha de la Harpe
Data as a service: a human-centered design approach/Retha de la HarpeData as a service: a human-centered design approach/Retha de la Harpe
Data as a service: a human-centered design approach/Retha de la Harpe
 
Text Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEText Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATE
 
Web Storytelling and Open Data Publishing for Tourism
Web Storytelling and Open Data Publishing for TourismWeb Storytelling and Open Data Publishing for Tourism
Web Storytelling and Open Data Publishing for Tourism
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 

Plus de Marina Santini

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-Marina Santini
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 

Plus de Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 

Text analytics and R - Open Question: is it a good match?

  • 1. Text Analytics and R Open Question: A Good Match? Marina Santini (LinkedIn) Research Scientist at SICS East Swedish ICT AB (Santa Anna) R useR MeetUp: Text analytics using R R useR group (StockholmR) R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 2. My Quest or… Why do I attend this meetup? – The Quest: finding the optimal way to handle Big Textual Data for Information Discovery – The Question: is R convenient for text analytics of Big TEXTUAL Data? – Mission: identification of pros, cons, limits, benefits … • Current Status: investigation in progress… R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 3. Outline • Big Data vs. Big TEXTUAL Data • Text Analytics & NLP (Natural Language Processing) • Statistics for linguistics with R by Stefan Th. Gries • From Information Discovery to Actionable TEXTUAL Intelligence • The Enron Challange: Predictions and Crisis Intelligence R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 4. Big Data • BIG DATA [Wikipedia]: – Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, new platforms of "big data" tools are being developed to handle various aspects of large quantities of data. – Examples include Big Science, web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce. R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 5. R, Strata, Hadoop…? Apparently many solutions are available on the market… Uhm… Big Data is a vague label… R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 6. Merrill Lynch is one of the world's Big Unstructured leading financial management and advisory companies, providing TEXTUAL Data financial advice and investment banking services. “Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data –commonly appearing in e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations and web pages.” [DM Review Magazine, February 2003 Issue]  ECONOMIC LOSS! A plethora of diverse document genres! R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 7. Simple search is not enough… • Of course, it is possible to use simple search. But simple search is unrewarding, because is based on single terms. – ”a search is made on the term felony. In a simple search, the term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies” * Source: Inmon, B. & A. Nesavich, "Unstructured Textual Data in the Organization" from "Managing Unstructured data in the organization", Prentice Hall 2008, pp. 1–13] R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 8. Textual Documents and Document Genres R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 9. Definition: Text Analytics • A set of NLP techniques that provide some structure to textual documents and help identify and extract important information. R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 10. Set of NLP techniques • Common components of a text analytic package are: – Tokenization – Morphological Analysis – Syntactic Analysis – Named Entity Recognition – Sentiment Analysis – Automatic Summarization – Etc. R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 11. NLP at Coursera R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 12. NLP is pervasive Ex: spell-checkers • Google Search • Google Mail • Facebook • Office Word • *…+ R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 13. NLP is parvasive Ex: Name Entity Recognition • Opinion mining • Brand Trends • Conversation clouds on web magazines and online newspapers… R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 14. Sentiment Analysis R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 15. Text Analytics Products and Frameworks • Commercial Products: Open Source Frameworks: – Attensity • GATE – Clarabridge • NLTK – Temis • UIMA – Lexalytics • etc. – Texify – SAS – SPSS – IBM Cognos – etc. R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 16. However… (I) • NLP tools and applications (both commercial and open source) are not perfert. Research is still very active in all NLP subfields. R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 17. Ex: Syntactic Parser • Connexor • What about parsing a tweet? • “My son, 6y/o, asked me for the first time today how my DAY was . . . I about melted. Told him that I had pizza for lunch. Response? No fair “ (Twitter Tutorial 1: How to Tweet Well) R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 18. Why NLP and Text Analytics are important for Information Discovery? • Why is it important to know that a word is a noun, or a verb or the name of brand? • Broadly speaking: • Nouns and verbs: Nouns are important for topic detection; verbs are important if you want to identify actions or intentions. • Adjectives = sentiment identification. • Function words (a.k.a. stop words) are important for authorship attribution, plagiarism detection, etc. • etc. R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 19. However… (II) • At present, the main pitfall of many NLP applications is that they are not flexible enough to: – Completly disambiguate language – Identify how language is used in different types of documents (a.k.a. genres). For instance, in tweets langauge is used in a different way than an emails, language used in email is different from the language used in academic papers, etc. ) • Often tweaking NLP tools to different types of text or solve language ambiguity in an ad-hoc manner is time-consuming, difficult and unrewarding… R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 20. How can R help? • Can R help overcome NLP shortcomings and open a new direction in Text Analytics and Information Discovery in order to extract useful information from Big TEXTUAL Data? R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 21. Existing literature for linguists • Stefan Th. Gries (2013) Statistics for linguistics With R: A Practical Introduction. De Gruyter Mouton. New Edition. • Stefan Th. Gries (2009) Quantitative corpus linguistics with R: a practical introduction. Routledge, Taylor & Francis Group (companion website). • Harald R. Baayen (2800) Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge. • …. R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 22. Companion website by Stefan Th. Gries • BNC=British National Corpus (PoS tagged) R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 23. BNC • The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007. • The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part- of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header. R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 24. R & the BNC: Excerpt from Google Books R = Corpus-based Lingusitc Analysis = OK 1. Descriptive statistics 2. Analytical statistics 3. Multifactorial methods R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 25. What about Information Discovery? • Non standardized language • Non standard texts • Electronic documents of all kinds, eg. formal, informal, short, long, private, public, etc. R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 26. Information Discovery  Actionable Textual Intelligence • Business Intelligence (BI) + Customer Analytics + Social Network Analytics + Crisis Intelligence *…+ = Actionable Textual Intelligence • Actionable Textual Intelligence is information that: 1. must be accurate and verifiable 2. must be timely 3. must be comprehensive 4. must be comprehensible 5. !!! give the power to make decisions and to act straightaway !!! 6. !!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL DATA !!! R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 27. From The Economist: The Big Data scenario R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 28. Enron & Crisis Intelligence: The Enron Scandal • The Enron scandal, revealed in October 2001, eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas. • “Enron's complex financial statements were confusing to shareholders and analysts. In addition, its complex business model and unethical practices required that the company use accounting limitations to misrepresent earnings and modify the balance sheet to indicate favorable performance. According to McLean and Elkind in their book The Smartest Guys in the Room, "The Enron scandal grew out of a steady accumulation of habits and values and actions that began years before and finally spiraled out of control. “ *wikipedia] R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 29. The Enron Dataset http://www.cs.cmu.edu/~enron/ • ” This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.” • Resource for researchers R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 30. The Challenge: Crisis Intelligence • Task: Can you suggest and implement a predictive model that would tell us that the Enron CRISIS (= scandal & collapse) would have happend by analysing and processing the raw textual data of emails belonging to the Enron dataset with R? Some basic references: •Enron scandal at-a-glance, BBC •The Enron Dataset (corpus=dataset=document collection) •A subset of about 1700 labeled email messages (4.5M ) [genre, topic, emotion] •Actionable Corpus & Actionable Intelligence (this post contains additional referenes in the cmments) R useR MeetUp, 14 March 2013, 18:00 Stockholm
  • 31. Thank you for your attention Preseantation available here: http://www.slideshare.net/marinasantini1/text-analytics-and-r http://www.forum.santini.se/ R useR MeetUp, 14 March 2013, 18:00 Stockholm

Notes de l'éditeur

  1. Problem of size + a problem of diverse data! = heterogeneos dataRadio-frequencyidentification (RFID )
  2. Strata: http://youtu.be/8vmGAV5Nx4Y
  3. Mucheffort hasbeenallocate to improvebig native data numeric data: balancesheets, income reports, financial and business reports, etc.Merrill Lynch – financial management and advisorywww.ml.com/Merrill Lynch is one of the world's leading financial management and advisory companies, providing financial advice and investment banking services.e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations , etc are different genres, ie different types of text. For example, emails and white papers are both textual genres but they differ a lot from each other. They might deal with the same topic, but in a complete different way. So the type of information related to the same topic can vary according to genre.
  4. Weneedtools toanalyse this huge amont of textual data and extract the information weneed.
  5. Orthographic check: is somethingwrittencorrectly or not? Vital for searching
  6. What is a namedentity?
  7. If you try with longer texts or with another genre, results are not reliable
  8. Business intelligence (BI) is the ability of an organization to collect, maintain, and organize data. This produces large amounts of information that can help develop new opportunities. Identifying these opportunities, and implementing an effective strategy, can provide a competitive market advantage and long-term stability. BI technologies provide historical, current and predictive views of business operations.Customer Experience Management (CEM) is the practice of actively listening to the Voice of the Customer through a variety of listening posts, analyzing customer feedback to create a basis for acting on better business decisions and then measuring the impact of those decisions to drive even greater operational performance and customer loyalty. Through this process, a company strategically organizes itself to manage a customer's entire experience with its product, service or company.  Companies invest in CEM to improve customer retention
  9. A tweet: My son, 6y/o, asked me for the first time today how my DAY was . . . I about melted. Told him that I had pizza for lunch. Response? No fairLanguage is highty ambiguous. Fair =reasonable and acceptable//treatingeveryoneequallyFair=a form of outdoor entertainment, at which there are large machines to ride on and games in which you can win prizes//an event at which people or businesses show and sell their productsplay fair: to do something in a fair and honest way
  10. Professor of Linguistics, Department of Linguistics, University of California, Santa Barbara
  11. N-gramsAveragesentence and wordlengthIndexingSplit infinitives
  12. Stockholm –umeÅcorpus (joakim)
  13. DescriptivestatisticsAnalyticalstatisticsMultifactorialmethodsToken/typeratio=The type-token ratio (TTR) is a measure of vocabulary variation within a written text or a person’s speech. The type-token ratios of two real world examples are calculated and interpreted. The type-token ratio is shown to be a helpful measure of lexical variety within a text. It can be used to monitor changes in children and adults with vocabulary difficulties.Tokens are the number of words. several of these tokens are repeated. For example, the token again occurs two times, the token are occurs three times, and the token and occurs five times. the total of 87 tokens in this text there are 62 so-called types. The relationship between the number of types and the number of tokens is known as the type-token ratio (TTR). For Text 1 above we can now calculate this as follows:Type-Token Ratio = (number of types/number of tokens) * 100= (62/87) * 100 = 71.3%The more types there are in comparison to the number of tokens, then the more varied is the vocabulary, i.e. it there is greater lexical variety.http://www.speech-therapy-information-and-resources.com/type-token-ratio.html
  14. Informationdiscovery is toovague
  15. http://youtu.be/qqfeUUjAIyQ
  16. http://en.wikipedia.org/wiki/Enron_scandal
  17. http://www.cs.cmu.edu/~enron/resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of "real" email that is public
  18. KrishunteringTheythrew down the challenge that he couldn't wash 40 cars in one hour (=invited him to try to do it)It is not a contest yet… it might become a contest in the future, if I launch the same contest to other meetup or other groups like strata, hadoop, etc. Enron-scandal at glance: http://news.bbc.co.uk/2/hi/business/1780075.stmThe Enron Dataset (corpus=dataset=documentcollection) =http://www.cs.cmu.edu/~enron/A subset of about 1700 labeled email messages (4.5M ) =http://bailando.sims.berkeley.edu/enron_email.html