Natural Language Processing and Machine Learning for Discovery2. GOALS
Understand the BLACK BOX.
Natural language processing
Mathematical and linguistic concepts
Models of representation
Real-world application
Machine learning
Common pre-processing and learning algorithms
Real-world application
Communicate with software and service vendors!
© Bommarito Consulting
3. BLACK BOX
How do we characterize a black box?
3 English medium
Inputs Parameters Outputs
© Bommarito Consulting
4. BLACK BOX
Secret: Most black boxes are
?
very similar inside.
We‟re going to learn to
identify the common parts.
© Bommarito Consulting
5. NATURAL LANGUAGE PROCESSING
Definition: Dealing with real-world text in an automated,
reproducible way.
Often referred to as NLP.
Used somewhat interchangeably with computational
linguistics.
© Bommarito Consulting
6. NATURAL LANGUAGE PROCESSING
Let‟s start with some text.
“Hurricane Sandy grounded 3,200 flights scheduled for today and
tomorrow, prompted New York to suspend subway and bus service and
forced the evacuation of the New Jersey shore as it headed toward land
with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its
path north, may be capable of inflicting as much as $18 billion in
damage when it barrels into New Jersey tomorrow and knock out power
to millions for a week or more, according to forecasters and risk
experts.”
(Bloomberg article on Sandy)
© Bommarito Consulting
7. NATURAL LANGUAGE PROCESSING
What kind of questions can we ask?
Basic
What is the structure of the text?
Paragraphs
Sentences
Tokens/words
What are the words that appear in this text?
Nouns
Subjects
Direct objects
Verbs
Advanced
What are the concepts that appear in this text?
How does this text compare to other text?
© Bommarito Consulting
8. NATURAL LANGUAGE PROCESSING
Segmentation and Tokenization
“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
prompted New York to suspend subway and bus service and forced the evacuation of
the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north,
may be capable of inflicting as much as $18 billion in damage when it barrels into New
Jersey tomorrow and knock out power to millions for a week or more, according to
forecasters and risk experts.”
• Segments Types
• Paragraphs
• Sentences
• Tokens
© Bommarito Consulting
9. NATURAL LANGUAGE PROCESSING
Segmentation and Tokenization
But how does it work?
Paragraphs
Two consecutive line breaks
A hard line break followed by an indent
Sentences
Period, except abbreviation, ellipsis within quotation, etc.
Tokens and Words
Whitespace
Punctuation
Remember what real -world text looks like – think text and email.
© Bommarito Consulting
10. NATURAL LANGUAGE PROCESSING
Segmentation and Tokenization
“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
prompted New York to suspend subway and bus service and forced the evacuation of
the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north,
may be capable of inflicting as much as $18 billion in damage when it barrels into New
Jersey tomorrow and knock out power to millions for a week or more, according to
forecasters and risk experts.”
Paragraphs: 2
Sentences: 2
Words: 561 .
['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for',
'today', 'and', 'tomorrow„, …]
© Bommarito Consulting
11. NATURAL LANGUAGE PROCESSING
What kind of questions can we ask?
We now have an ordered list of tokens.
['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for',
'today', 'and', 'tomorrow„, …]
Does the word phrase “quote stuffing” occur in the text?
How many times does “Sandy” occur?
How often does “outage” occur after “power?”
What percentage of tokens are numbers?
© Bommarito Consulting
12. NATURAL LANGUAGE PROCESSING
An Aside on Storage
D ata: The word „the‟ ten times and the word ‘a’ ten times.
Representation 1 - Ordered List:
[‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]
Representation 2 – Term Frequency:
[(„the‟, 10), („a‟, 10)]
© Bommarito Consulting
13. NATURAL LANGUAGE PROCESSING
An Aside on Storage
Representation 1 - Ordered List:
[‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]
Representation 2 - Frequency Map:
[(„the‟, 10), („a‟, 10)]
Tradeoffs
Total space
Ease of answering certain questions
Information about context
Not all software make the same choice!
© Bommarito Consulting
14. NATURAL LANGUAGE PROCESSING
Stopwording, Stemming, Parsing, and Tagging
Stopwording
Removing “filler” words like prepositions, auxiliary or infinitive verbs, and
conjunctions.
Stemming
Matching declined nouns like dog/dogs or child/children.
Matching conjugated verbs like run/ran.
Parsing
Determining the “structure” of a sentence, typically as represented by a
grade school sentence diagram (requires grammar definition; we‟ll skip).
Tagging
Identifying the part of speech of each token in a sentence.
© Bommarito Consulting
15. NATURAL LANGUAGE PROCESSING
Stopwording
Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
prompted New York to suspend subway and bus service and forced the evacuation of
the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north,
may be capable of inflicting as much as $18 billion in damage when it barrels into New
Jersey tomorrow and knock out power to millions for a week or more, according to
forecasters and risk experts.
Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New
York suspend subway bus service forced evacuation New Jersey shore headed toward
land life-threatening wind rain.
System, killed many 65 people Caribbean path north, may capable inflicting much
$18 billion damage barrels New Jersey tomorrow knock power millions week, according
forecasters risk experts.
© Bommarito Consulting
16. NATURAL LANGUAGE PROCESSING
Stopwording + Stemming
Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
prompted New York to suspend subway and bus service and forced the evacuation of
the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north,
may be capable of inflicting as much as $18 billion in damage when it barrels into New
Jersey tomorrow and knock out power to millions for a week or more, according to
forecasters and risk experts.
Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York
suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten
wind rain.
System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion
damag barrel New Jersey tomorrow knock power million week, accord forecast risk
expert.
© Bommarito Consulting
17. NATURAL LANGUAGE PROCESSING
Tagging
Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
prompted New York to suspend subway and bus service and forced the evacuation of
the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north,
may be capable of inflicting as much as $18 billion in damage when it barrels into New
Jersey tomorrow and knock out power to millions for a week or more, according to
forecasters and risk experts.
[('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights',
'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …]
© Bommarito Consulting
19. NATURAL LANGUAGE PROCESSING
Let‟s say that we‟re investigating Enron for accounting fraud
related to its reserve reporting and transfers.
We want to look for any material that discusses reserves and
profits in the same sentence. However, we want cases where
these words are used as nouns; we‟re not interested in dinner
reservations.
Inputs Parameters Output
Memos Stopword: No Memos
Research Stem: Yes Research
Emails Tag: Yes Emails
Texts Search: … Texts
Transcriptions Transcriptions
© Bommarito Consulting
20. NATURAL LANGUAGE PROCESSING
In general, all document search and discovery software
combines the elements discussed above.
Segment
Tokenize
Stopword
Stem
Parse
Tag
Store
Search
Retrieve
© Bommarito Consulting
21. NATURAL LANGUAGE PROCESSING
How do they dif fer?
Interface and ease-of-use
De-duplication and versioning
Supported languages
Optical character recognition (OCR)
File formats, e.g., Word, WordPerfect, PDF, HTML
Ability to scale to large databases.
© Bommarito Consulting
22. MACHINE LEARNING
Definition: Automated classification and prediction on data.
Examples:
Product recommenders, a la Amazon
Computer vision – is it a cat?
Sentiment analysis
Topic classification
Document clustering
At least two stages to machine learning:
Training
Classification
© Bommarito Consulting
23. MACHINE LEARNING
Learning
Machine learning requires “learning” or “training.”
There are two types of training:
Supervised
Unsupervised
The goal of training is to determine a mapping from input
features to a set of target classes.
© Bommarito Consulting
24. MACHINE LEARNING
Learning
Imagine a student given a small list of organisms and
descriptions. The student is tasked to assign the organisms into
groups based on these descriptions. Where do the groups come
from?
Super vised: The teacher provides the answers.
Unsuper vised: The teacher provides nothing.
When the student is done with the task , the teacher checks the
student‟s responses and decides if the student has learned.
In our example, the teac her will typically provide the “canonical” domains
and ki ngdoms of bi ol ogy. However, mos t real -world problems domai ns are
not so well-studied.
© Bommarito Consulting
25. MACHINE LEARNING
Learning
What if the teacher gave the student some of the answers?
This is semi-supervised learning.
Supervised: The teacher provides the answers.
Semi-supervised: The teacher provides some answers.
Unsupervised: The teacher provides nothing.
© Bommarito Consulting
26. MACHINE LEARNING
Classification
The student has now learned to map from an organism‟s
description to a group.
Now, the student is sent out into the field to use their
knowledge to classify newly discovered organisms. They
observe the organisms and document the features they learned
to use. Then, they apply the learned rules to determine the
class of organism.
© Bommarito Consulting
27. MACHINE LEARNING
This is exactly how predictive coding works!
Organisms : Documents
Descriptions : Natural language features or models
Semi-supervised : Sample coding
The goal of predictive coding in discovery is to learn to classify
documents based on natural language features, typically into
relevant/irrelevant or privileged/unprivileged.
© Bommarito Consulting
28. MACHINE LEARNING
Some Machine Learning Algorithms
Super vised
Statistical models
Bayesian, e.g., Naïve Bayes Classification
Frequentist, e.g., Ordinary Least Squares.
Neural Networks (NN)
Support Vector Machines (SVM)
Random Forests (RF)
Genetic Algorithms (GA)
Semi/unsuper vised
Neural Networks (NN)
Clustering
K-means
Hierarchical
Radial Basis (RBF)
Graph
© Bommarito Consulting
29. MACHINE LEARNING
Notes on Algorithm Diversity
Not all algorithms return scores; some are binar y.
True, True, False
0.9, 0.7, 0.1
Not all algorithms suppor t more than two classes.
Cat, Dog, Mouse
Cat, Not Cat
Not all algorithms scale similarly.
1M documents = 1 day
10M documents = {10 days, 100 days, 1000 days}
© Bommarito Consulting
30. THANKS!
You can get these slides on my blog – http://bommaritollc.com/blog/.
Michael J Bommarito II
CEO, Bommarito Consulting, LLC
Email: michael@bommaritollc.com
Web: http://bommaritollc.com/
© Bommarito Consulting
31. REFERENCES
B o o k s a n d Wi k i Pa g e s
A Brief Sur vey of Text Mining. Hotho, Nurnberger, Paaß.
http://www.kde.cs.uni -kassel.de/hotho/pub/2005/hotho05TextMining.pdf
Text Mining: Predictive Methods for Analyzing Unstructured Information. Weiss, Indurkhya,
Zhang, Damerau.
http://www.amazon.com/Text -Mining-Predictive-Unstructured -Information/dp/0387954333
The Elements of Statistical Learning.
http://www-stat.stanford.edu/~tibs/ElemStatLearn /
Wiki – Machine Learning.
http://en.wikipedia.org/wiki/Machine_learning
Wiki – Machine Learning Algorithms.
http://en.wikipedia.org/wiki/List_of_machine_learni ng_algorithms
So f t wa re
Natural Language Toolkit (NLTK).
http://nltk.org /
Stanford NLP Group.
http://nlp.stanford.edu/software /
Weka.
http://www.cs.waikato.ac.nz/ml/weka /
R.
http://www.r -project.org /
SAS Predictive Analytics and Data Mining.
http://www.sas.com/technologies/analytics/datamining/i ndex.html