http://www.forum.santini.se
* The Quest: finding the optimal way to handle Big Textual Data for Information Discovery
* The Question: is R convenient for text analytics of Big TEXTUAL Data?
* Mission: identification of pros, cons, limits, benefits …
Current Status: investigation in progress…
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Text analytics and R - Open Question: is it a good match?
1. Text Analytics and R
Open Question: A Good Match?
Marina Santini
(LinkedIn)
Research Scientist at SICS East Swedish ICT AB (Santa Anna)
R useR MeetUp: Text analytics using R
R useR group (StockholmR)
R useR MeetUp, 14 March 2013, 18:00
Stockholm
2. My Quest or… Why do I attend this meetup?
– The Quest: finding the optimal way to handle Big
Textual Data for Information Discovery
– The Question: is R convenient for text analytics of
Big TEXTUAL Data?
– Mission: identification of pros, cons, limits,
benefits …
• Current Status: investigation in progress…
R useR MeetUp, 14 March 2013, 18:00
Stockholm
3. Outline
• Big Data vs. Big TEXTUAL Data
• Text Analytics & NLP (Natural Language Processing)
• Statistics for linguistics with R by Stefan Th. Gries
• From Information Discovery to Actionable
TEXTUAL Intelligence
• The Enron Challange: Predictions and Crisis
Intelligence
R useR MeetUp, 14 March 2013, 18:00
Stockholm
4. Big Data
• BIG DATA [Wikipedia]:
– Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and
process the data within a tolerable elapsed time. Big data sizes are a
constantly moving target, as of 2012 ranging from a few dozen
terabytes to many petabytes of data in a single data set. With this
difficulty, new platforms of "big data" tools are being developed to
handle various aspects of large quantities of data.
– Examples include Big Science, web logs, RFID, sensor networks, social
networks, social data (due to the social data revolution), Internet
text and documents, Internet search indexing, call detail records,
astronomy, atmospheric science, genomics, biogeochemical,
biological, and other complex and often interdisciplinary scientific
research, military surveillance, medical records, photography archives,
video archives, and large-scale e-commerce.
R useR MeetUp, 14 March 2013, 18:00
Stockholm
5. R, Strata, Hadoop…?
Apparently many solutions are available on
the market…
Uhm… Big Data is a vague label…
R useR MeetUp, 14 March 2013, 18:00
Stockholm
6. Merrill Lynch is one of the world's
Big Unstructured leading financial management and
advisory companies, providing
TEXTUAL Data financial advice and investment
banking services.
“Merrill Lynch estimates that more than 85 percent of all
business information exists as unstructured data –commonly
appearing in e‐mails, memos, notes from call centers and
support operations, news, user groups, chats, reports, letters,
surveys, white papers, marketing material, research,
presentations and web pages.” [DM Review Magazine,
February 2003 Issue]
ECONOMIC LOSS!
A plethora of diverse document genres!
R useR MeetUp, 14 March 2013, 18:00
Stockholm
7. Simple search is not enough…
• Of course, it is possible to use simple search. But
simple search is unrewarding, because is based on
single terms.
– ”a search is made on the term felony. In a simple search,
the term felony is used, and everywhere there is a
reference to felony, a hit to an unstructured document is
made. But a simple search is crude. It does not find
references to crime, arson, murder, embezzlement,
vehicular homicide, and such, even though these crimes
are types of felonies” * Source: Inmon, B. & A. Nesavich,
"Unstructured Textual Data in the Organization" from
"Managing Unstructured data in the organization",
Prentice Hall 2008, pp. 1–13]
R useR MeetUp, 14 March 2013, 18:00
Stockholm
9. Definition: Text Analytics
• A set of NLP techniques that provide some
structure to textual documents and help
identify and extract important information.
R useR MeetUp, 14 March 2013, 18:00
Stockholm
10. Set of NLP techniques
• Common components of a text analytic
package are:
– Tokenization
– Morphological Analysis
– Syntactic Analysis
– Named Entity Recognition
– Sentiment Analysis
– Automatic Summarization
– Etc.
R useR MeetUp, 14 March 2013, 18:00
Stockholm
11. NLP at Coursera
R useR MeetUp, 14 March 2013, 18:00
Stockholm
12. NLP is pervasive
Ex: spell-checkers
• Google Search
• Google Mail
• Facebook
• Office Word
• *…+
R useR MeetUp, 14 March 2013, 18:00
Stockholm
13. NLP is parvasive
Ex: Name Entity Recognition
• Opinion
mining
• Brand Trends
• Conversation
clouds on web
magazines and
online
newspapers…
R useR MeetUp, 14 March 2013, 18:00
Stockholm
15. Text Analytics Products and Frameworks
• Commercial Products: Open Source Frameworks:
– Attensity • GATE
– Clarabridge • NLTK
– Temis • UIMA
– Lexalytics • etc.
– Texify
– SAS
– SPSS
– IBM Cognos
– etc.
R useR MeetUp, 14 March 2013, 18:00
Stockholm
16. However… (I)
• NLP tools and applications (both commercial
and open source) are not perfert. Research is
still very active in all NLP subfields.
R useR MeetUp, 14 March 2013, 18:00
Stockholm
17. Ex: Syntactic Parser
• Connexor
• What about parsing a tweet?
• “My son, 6y/o, asked me for the first time today how
my DAY was . . . I about melted. Told him that I had
pizza for lunch. Response? No fair “ (Twitter Tutorial 1:
How to Tweet Well)
R useR MeetUp, 14 March 2013, 18:00
Stockholm
18. Why NLP and Text Analytics are
important for Information Discovery?
• Why is it important to know that a word is a noun, or a
verb or the name of brand?
• Broadly speaking:
• Nouns and verbs: Nouns are important for topic
detection; verbs are important if you want to identify
actions or intentions.
• Adjectives = sentiment identification.
• Function words (a.k.a. stop words) are important for
authorship attribution, plagiarism detection, etc.
• etc.
R useR MeetUp, 14 March 2013, 18:00
Stockholm
19. However… (II)
• At present, the main pitfall of many NLP applications is
that they are not flexible enough to:
– Completly disambiguate language
– Identify how language is used in different types of
documents (a.k.a. genres).
For instance, in tweets langauge is used in a different
way than an emails, language used in email is
different from the language used in academic papers,
etc. )
• Often tweaking NLP tools to different types of text or
solve language ambiguity in an ad-hoc manner is
time-consuming, difficult and unrewarding…
R useR MeetUp, 14 March 2013, 18:00
Stockholm
20. How can R help?
• Can R help overcome NLP shortcomings and
open a new direction in Text Analytics and
Information Discovery in order to extract
useful information from Big TEXTUAL Data?
R useR MeetUp, 14 March 2013, 18:00
Stockholm
21. Existing literature for linguists
• Stefan Th. Gries (2013) Statistics for linguistics
With R: A Practical Introduction. De Gruyter
Mouton. New Edition.
• Stefan Th. Gries (2009) Quantitative corpus
linguistics with R: a practical introduction.
Routledge, Taylor & Francis Group (companion
website).
• Harald R. Baayen (2800) Analyzing Linguistic Data:
A Practical Introduction to Statistics using R.
Cambridge.
• ….
R useR MeetUp, 14 March 2013, 18:00
Stockholm
22. Companion website by Stefan Th. Gries
• BNC=British National Corpus (PoS tagged)
R useR MeetUp, 14 March 2013, 18:00
Stockholm
23. BNC
• The British National Corpus (BNC) is a 100 million word collection of
samples of written and spoken language from a wide range of sources,
designed to represent a wide cross-section of British English from the later
part of the 20th century, both spoken and written. The latest edition is
the BNC XML Edition, released in 2007.
• The corpus is encoded according to the Guidelines of the Text Encoding
Initiative (TEI) to represent both the output from CLAWS (automatic part-
of-speech tagger) and a variety of other structural properties of texts (e.g.
headings, paragraphs, lists etc.). Full classification, contextual and
bibliographic information is also included with each text in the form of a
TEI-conformant header.
R useR MeetUp, 14 March 2013, 18:00
Stockholm
24. R & the BNC: Excerpt from Google Books
R = Corpus-based Lingusitc Analysis = OK
1. Descriptive statistics
2. Analytical statistics
3. Multifactorial methods
R useR MeetUp, 14 March 2013, 18:00
Stockholm
25. What about Information Discovery?
• Non standardized language
• Non standard texts
• Electronic documents of all kinds, eg. formal,
informal, short, long, private, public, etc.
R useR MeetUp, 14 March 2013, 18:00
Stockholm
26. Information Discovery
Actionable Textual Intelligence
• Business Intelligence (BI) + Customer Analytics +
Social Network Analytics + Crisis Intelligence *…+ =
Actionable Textual Intelligence
• Actionable Textual Intelligence is information that:
1. must be accurate and verifiable
2. must be timely
3. must be comprehensive
4. must be comprehensible
5. !!! give the power to make decisions and to act straightaway !!!
6. !!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL DATA !!!
R useR MeetUp, 14 March 2013, 18:00
Stockholm
28. Enron & Crisis Intelligence:
The Enron Scandal
• The Enron scandal, revealed in October 2001, eventually
led to the bankruptcy of the Enron Corporation, an
American energy company based in Houston, Texas.
• “Enron's complex financial statements were confusing to
shareholders and analysts. In addition, its complex business
model and unethical practices required that the company
use accounting limitations to misrepresent earnings and
modify the balance sheet to indicate favorable
performance. According to McLean and Elkind in their
book The Smartest Guys in the Room, "The Enron scandal
grew out of a steady accumulation of habits and values and
actions that began years before and finally spiraled out of
control. “ *wikipedia]
R useR MeetUp, 14 March 2013, 18:00
Stockholm
29. The Enron Dataset
http://www.cs.cmu.edu/~enron/
• ” This dataset was collected and prepared by the
CALO Project (A Cognitive Assistant that Learns
and Organizes). It contains data from about 150
users, mostly senior management of Enron,
organized into folders. The corpus contains a total
of about 0.5M messages. This data was originally
made public, and posted to the web, by the
Federal Energy Regulatory Commission during its
investigation.”
• Resource for researchers
R useR MeetUp, 14 March 2013, 18:00
Stockholm
30. The Challenge: Crisis Intelligence
• Task:
Can you suggest and implement a predictive model
that would tell us that the Enron CRISIS (= scandal &
collapse) would have happend by analysing and
processing the raw textual data of emails belonging
to the Enron dataset with R?
Some basic references:
•Enron scandal at-a-glance, BBC
•The Enron Dataset (corpus=dataset=document collection)
•A subset of about 1700 labeled email messages (4.5M ) [genre, topic,
emotion]
•Actionable Corpus & Actionable Intelligence (this post contains
additional referenes in the cmments)
R useR MeetUp, 14 March 2013, 18:00
Stockholm
31. Thank you for your attention
Preseantation available here:
http://www.slideshare.net/marinasantini1/text-analytics-and-r
http://www.forum.santini.se/
R useR MeetUp, 14 March 2013, 18:00
Stockholm
Notes de l'éditeur
Problem of size + a problem of diverse data! = heterogeneos dataRadio-frequencyidentification (RFID )
Strata: http://youtu.be/8vmGAV5Nx4Y
Mucheffort hasbeenallocate to improvebig native data numeric data: balancesheets, income reports, financial and business reports, etc.Merrill Lynch – financial management and advisorywww.ml.com/Merrill Lynch is one of the world's leading financial management and advisory companies, providing financial advice and investment banking services.e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations , etc are different genres, ie different types of text. For example, emails and white papers are both textual genres but they differ a lot from each other. They might deal with the same topic, but in a complete different way. So the type of information related to the same topic can vary according to genre.
Weneedtools toanalyse this huge amont of textual data and extract the information weneed.
Orthographic check: is somethingwrittencorrectly or not? Vital for searching
What is a namedentity?
If you try with longer texts or with another genre, results are not reliable
Business intelligence (BI) is the ability of an organization to collect, maintain, and organize data. This produces large amounts of information that can help develop new opportunities. Identifying these opportunities, and implementing an effective strategy, can provide a competitive market advantage and long-term stability. BI technologies provide historical, current and predictive views of business operations.Customer Experience Management (CEM) is the practice of actively listening to the Voice of the Customer through a variety of listening posts, analyzing customer feedback to create a basis for acting on better business decisions and then measuring the impact of those decisions to drive even greater operational performance and customer loyalty. Through this process, a company strategically organizes itself to manage a customer's entire experience with its product, service or company. Companies invest in CEM to improve customer retention
A tweet: My son, 6y/o, asked me for the first time today how my DAY was . . . I about melted. Told him that I had pizza for lunch. Response? No fairLanguage is highty ambiguous. Fair =reasonable and acceptable//treatingeveryoneequallyFair=a form of outdoor entertainment, at which there are large machines to ride on and games in which you can win prizes//an event at which people or businesses show and sell their productsplay fair: to do something in a fair and honest way
Professor of Linguistics, Department of Linguistics, University of California, Santa Barbara
N-gramsAveragesentence and wordlengthIndexingSplit infinitives
Stockholm –umeÅcorpus (joakim)
DescriptivestatisticsAnalyticalstatisticsMultifactorialmethodsToken/typeratio=The type-token ratio (TTR) is a measure of vocabulary variation within a written text or a person’s speech. The type-token ratios of two real world examples are calculated and interpreted. The type-token ratio is shown to be a helpful measure of lexical variety within a text. It can be used to monitor changes in children and adults with vocabulary difficulties.Tokens are the number of words. several of these tokens are repeated. For example, the token again occurs two times, the token are occurs three times, and the token and occurs five times. the total of 87 tokens in this text there are 62 so-called types. The relationship between the number of types and the number of tokens is known as the type-token ratio (TTR). For Text 1 above we can now calculate this as follows:Type-Token Ratio = (number of types/number of tokens) * 100= (62/87) * 100 = 71.3%The more types there are in comparison to the number of tokens, then the more varied is the vocabulary, i.e. it there is greater lexical variety.http://www.speech-therapy-information-and-resources.com/type-token-ratio.html
Informationdiscovery is toovague
http://youtu.be/qqfeUUjAIyQ
http://en.wikipedia.org/wiki/Enron_scandal
http://www.cs.cmu.edu/~enron/resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of "real" email that is public
KrishunteringTheythrew down the challenge that he couldn't wash 40 cars in one hour (=invited him to try to do it)It is not a contest yet… it might become a contest in the future, if I launch the same contest to other meetup or other groups like strata, hadoop, etc. Enron-scandal at glance: http://news.bbc.co.uk/2/hi/business/1780075.stmThe Enron Dataset (corpus=dataset=documentcollection) =http://www.cs.cmu.edu/~enron/A subset of about 1700 labeled email messages (4.5M ) =http://bailando.sims.berkeley.edu/enron_email.html