SlideShare une entreprise Scribd logo
1  sur  31
NATURAL LANGUAGE PROCESSING   MSU Law

       AND MACHINE LEARNING
                              Electronic Discovery
                              Fall 2 01 2

              FOR DISCOVERY   Week 9
GOALS

                     Understand the BLACK BOX.
 Natural language processing
    Mathematical and linguistic concepts
    Models of representation
    Real-world application

 Machine learning
    Common pre-processing and learning algorithms
    Real-world application

 Communicate with software and service vendors!




© Bommarito Consulting
BLACK BOX

 How do we characterize a black box?




                         3     English   medium




          Inputs             Parameters           Outputs
© Bommarito Consulting
BLACK BOX




                              Secret: Most black boxes are




         ?
                               very similar inside.

                              We‟re going to learn to
                               identify the common parts.




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

 Definition: Dealing with real-world text in an automated,
  reproducible way.

 Often referred to as NLP.

 Used somewhat interchangeably with computational
  linguistics.




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Let‟s start with some text.

   “Hurricane Sandy grounded 3,200 flights scheduled for today and
   tomorrow, prompted New York to suspend subway and bus service and
   forced the evacuation of the New Jersey shore as it headed toward land
   with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its
   path north, may be capable of inflicting as much as $18 billion in
   damage when it barrels into New Jersey tomorrow and knock out power
   to millions for a week or more, according to forecasters and risk
   experts.”

   (Bloomberg article on Sandy)




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

What kind of questions can we ask?
 Basic
    What is the structure of the text?
        Paragraphs
        Sentences
        Tokens/words
    What are the words that appear in this text?
        Nouns
            Subjects
            Direct objects
        Verbs

 Advanced
    What are the concepts that appear in this text?
    How does this text compare to other text?




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Segmentation and Tokenization

   “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
   prompted New York to suspend subway and bus service and forced the evacuation of
   the New Jersey shore as it headed toward land with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its path north,
   may be capable of inflicting as much as $18 billion in damage when it barrels into New
   Jersey tomorrow and knock out power to millions for a week or more, according to
   forecasters and risk experts.”



                 • Segments Types
                    • Paragraphs
                    • Sentences
                    • Tokens


© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Segmentation and Tokenization
But how does it work?

 Paragraphs
    Two consecutive line breaks
    A hard line break followed by an indent

 Sentences
    Period, except abbreviation, ellipsis within quotation, etc.

 Tokens and Words
    Whitespace
    Punctuation

Remember what real -world text looks like – think text and email.


© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Segmentation and Tokenization
   “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
   prompted New York to suspend subway and bus service and forced the evacuation of
   the New Jersey shore as it headed toward land with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its path north,
   may be capable of inflicting as much as $18 billion in damage when it barrels into New
   Jersey tomorrow and knock out power to millions for a week or more, according to
   forecasters and risk experts.”



 Paragraphs: 2
 Sentences: 2
 Words: 561 .
    ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for',
     'today', 'and', 'tomorrow„, …]


© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

What kind of questions can we ask?
We now have an ordered list of tokens.

['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for',
'today', 'and', 'tomorrow„, …]

      Does the word phrase “quote stuffing” occur in the text?
      How many times does “Sandy” occur?
      How often does “outage” occur after “power?”
      What percentage of tokens are numbers?




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

An Aside on Storage

D ata: The word „the‟ ten times and the word ‘a’ ten times.


 Representation 1 - Ordered List:
   [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]

 Representation 2 – Term Frequency:
   [(„the‟, 10), („a‟, 10)]




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

An Aside on Storage
 Representation 1 - Ordered List:
   [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]

 Representation 2 - Frequency Map:
   [(„the‟, 10), („a‟, 10)]

 Tradeoffs
    Total space
    Ease of answering certain questions
    Information about context

 Not all software make the same choice!


© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Stopwording, Stemming, Parsing, and Tagging
       Stopwording
         Removing “filler” words like prepositions, auxiliary or infinitive verbs, and
          conjunctions.

       Stemming
         Matching declined nouns like dog/dogs or child/children.
         Matching conjugated verbs like run/ran.

       Parsing
         Determining the “structure” of a sentence, typically as represented by a
          grade school sentence diagram (requires grammar definition; we‟ll skip).

       Tagging
         Identifying the part of speech of each token in a sentence.



© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Stopwording
    Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
   prompted New York to suspend subway and bus service and forced the evacuation of
   the New Jersey shore as it headed toward land with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its path north,
   may be capable of inflicting as much as $18 billion in damage when it barrels into New
   Jersey tomorrow and knock out power to millions for a week or more, according to
   forecasters and risk experts.

     Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New
   York suspend subway bus service forced evacuation New Jersey shore headed toward
   land life-threatening wind rain.

    System, killed many 65 people Caribbean path north, may capable inflicting much
   $18 billion damage barrels New Jersey tomorrow knock power millions week, according
   forecasters risk experts.




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Stopwording + Stemming
    Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
   prompted New York to suspend subway and bus service and forced the evacuation of
   the New Jersey shore as it headed toward land with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its path north,
   may be capable of inflicting as much as $18 billion in damage when it barrels into New
   Jersey tomorrow and knock out power to millions for a week or more, according to
   forecasters and risk experts.

    Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York
   suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten
   wind rain.

    System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion
   damag barrel New Jersey tomorrow knock power million week, accord forecast risk
   expert.




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Tagging
   Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
  prompted New York to suspend subway and bus service and forced the evacuation of
  the New Jersey shore as it headed toward land with life-threatening wind and rain.

   The system, which killed as many as 65 people in the Caribbean on its path north,
  may be capable of inflicting as much as $18 billion in damage when it barrels into New
  Jersey tomorrow and knock out power to millions for a week or more, according to
  forecasters and risk experts.

    [('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights',
   'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …]




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Back to the black box.




                         3     English   medium




          Inputs             Parameters           Outputs
© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

 Let‟s say that we‟re investigating Enron for accounting fraud
related to its reserve reporting and transfers.

 We want to look for any material that discusses reserves and
profits in the same sentence. However, we want cases where
these words are used as nouns; we‟re not interested in dinner
reservations.


             Inputs           Parameters     Output
             Memos            Stopword: No   Memos
             Research         Stem: Yes      Research
             Emails           Tag: Yes       Emails
             Texts            Search: …      Texts
             Transcriptions                  Transcriptions

© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

 In general, all document search and discovery software
combines the elements discussed above.
      Segment
      Tokenize
      Stopword
      Stem
      Parse
      Tag
      Store
      Search
      Retrieve




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

 How do they dif fer?
      Interface and ease-of-use
      De-duplication and versioning
      Supported languages
      Optical character recognition (OCR)
      File formats, e.g., Word, WordPerfect, PDF, HTML
      Ability to scale to large databases.




© Bommarito Consulting
MACHINE LEARNING

 Definition: Automated classification and prediction on data.

 Examples:
      Product recommenders, a la Amazon
      Computer vision – is it a cat?
      Sentiment analysis
      Topic classification
      Document clustering


 At least two stages to machine learning:
    Training
    Classification



© Bommarito Consulting
MACHINE LEARNING

Learning

 Machine learning requires “learning” or “training.”

 There are two types of training:
    Supervised
    Unsupervised


 The goal of training is to determine a mapping from input
  features to a set of target classes.




© Bommarito Consulting
MACHINE LEARNING

Learning
  Imagine a student given a small list of organisms and
descriptions. The student is tasked to assign the organisms into
groups based on these descriptions. Where do the groups come
from?

 Super vised: The teacher provides the answers.
 Unsuper vised: The teacher provides nothing.

 When the student is done with the task , the teacher checks the
student‟s responses and decides if the student has learned.

 In our example, the teac her will typically provide the “canonical” domains
and ki ngdoms of bi ol ogy. However, mos t real -world problems domai ns are
not so well-studied.



© Bommarito Consulting
MACHINE LEARNING

Learning

 What if the teacher gave the student some of the answers?

 This is semi-supervised learning.

 Supervised: The teacher provides the answers.
 Semi-supervised: The teacher provides some answers.
 Unsupervised: The teacher provides nothing.




© Bommarito Consulting
MACHINE LEARNING

Classification

 The student has now learned to map from an organism‟s
description to a group.

 Now, the student is sent out into the field to use their
knowledge to classify newly discovered organisms.       They
observe the organisms and document the features they learned
to use. Then, they apply the learned rules to determine the
class of organism.




© Bommarito Consulting
MACHINE LEARNING

This is exactly how predictive coding works!

 Organisms : Documents
 Descriptions : Natural language features or models
 Semi-supervised : Sample coding

 The goal of predictive coding in discovery is to learn to classify
documents based on natural language features, typically into
relevant/irrelevant or privileged/unprivileged.




© Bommarito Consulting
MACHINE LEARNING

Some Machine Learning Algorithms
 Super vised
    Statistical models
       Bayesian, e.g., Naïve Bayes Classification
       Frequentist, e.g., Ordinary Least Squares.
    Neural Networks (NN)
    Support Vector Machines (SVM)
    Random Forests (RF)
    Genetic Algorithms (GA)
 Semi/unsuper vised
    Neural Networks (NN)
    Clustering
          K-means
          Hierarchical
          Radial Basis (RBF)
          Graph

© Bommarito Consulting
MACHINE LEARNING

Notes on Algorithm Diversity

 Not all algorithms return scores; some are binar y.
    True, True, False
    0.9, 0.7, 0.1
 Not all algorithms suppor t more than two classes.
    Cat, Dog, Mouse
    Cat, Not Cat
 Not all algorithms scale similarly.
    1M documents = 1 day
    10M documents = {10 days, 100 days, 1000 days}




© Bommarito Consulting
THANKS!

        You can get these slides on my blog – http://bommaritollc.com/blog/.




                              Michael J Bommarito II
                                 CEO, Bommarito Consulting, LLC
                                 Email: michael@bommaritollc.com
                                 Web: http://bommaritollc.com/




© Bommarito Consulting
REFERENCES

 B o o k s a n d Wi k i Pa g e s
     A Brief Sur vey of Text Mining. Hotho, Nurnberger, Paaß.
         http://www.kde.cs.uni -kassel.de/hotho/pub/2005/hotho05TextMining.pdf
     Text Mining: Predictive Methods for Analyzing Unstructured Information. Weiss, Indurkhya,
      Zhang, Damerau.
         http://www.amazon.com/Text -Mining-Predictive-Unstructured -Information/dp/0387954333
     The Elements of Statistical Learning.
         http://www-stat.stanford.edu/~tibs/ElemStatLearn /
     Wiki – Machine Learning.
         http://en.wikipedia.org/wiki/Machine_learning
     Wiki – Machine Learning Algorithms.
         http://en.wikipedia.org/wiki/List_of_machine_learni ng_algorithms
 So f t wa re
     Natural Language Toolkit (NLTK).
         http://nltk.org /
     Stanford NLP Group.
         http://nlp.stanford.edu/software /
     Weka.
         http://www.cs.waikato.ac.nz/ml/weka /
     R.
         http://www.r -project.org /
     SAS Predictive Analytics and Data Mining.
         http://www.sas.com/technologies/analytics/datamining/i ndex.html

Contenu connexe

En vedette

Bommarito Presentation for University of Houston Computational Law Conference
Bommarito Presentation for University of Houston Computational Law ConferenceBommarito Presentation for University of Houston Computational Law Conference
Bommarito Presentation for University of Houston Computational Law Conference
mjbommar
 
Preserve the Luxury Or Extend the Brand? HBR Case Study
Preserve the Luxury Or Extend the Brand? HBR Case StudyPreserve the Luxury Or Extend the Brand? HBR Case Study
Preserve the Luxury Or Extend the Brand? HBR Case Study
Sameer Mathur
 

En vedette (16)

Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
Bommarito Presentation for University of Houston Computational Law Conference
Bommarito Presentation for University of Houston Computational Law ConferenceBommarito Presentation for University of Houston Computational Law Conference
Bommarito Presentation for University of Houston Computational Law Conference
 
Natural Language Processing and Machine Learning
Natural Language Processing and Machine LearningNatural Language Processing and Machine Learning
Natural Language Processing and Machine Learning
 
Thinaire Accelerated Aire
Thinaire Accelerated AireThinaire Accelerated Aire
Thinaire Accelerated Aire
 
Magazine layout assignment
Magazine layout assignmentMagazine layout assignment
Magazine layout assignment
 
SBM x
SBM xSBM x
SBM x
 
Assignment 1 l'oreal
Assignment 1   l'orealAssignment 1   l'oreal
Assignment 1 l'oreal
 
Comparative Analysis
Comparative AnalysisComparative Analysis
Comparative Analysis
 
Lakme brand
Lakme brandLakme brand
Lakme brand
 
Preserve the Luxury Or Extend the Brand? HBR Case Study
Preserve the Luxury Or Extend the Brand? HBR Case StudyPreserve the Luxury Or Extend the Brand? HBR Case Study
Preserve the Luxury Or Extend the Brand? HBR Case Study
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Brand Image and Identity
Brand Image and IdentityBrand Image and Identity
Brand Image and Identity
 
Upgrade Your Business Skills
Upgrade Your Business SkillsUpgrade Your Business Skills
Upgrade Your Business Skills
 
Brand Audit on Loreal
Brand Audit on LorealBrand Audit on Loreal
Brand Audit on Loreal
 
Health n Wellness Marketing
Health n Wellness MarketingHealth n Wellness Marketing
Health n Wellness Marketing
 
Lakme Absolute Brand Extension Analysis
Lakme Absolute Brand Extension AnalysisLakme Absolute Brand Extension Analysis
Lakme Absolute Brand Extension Analysis
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Natural Language Processing and Machine Learning for Discovery

  • 1. NATURAL LANGUAGE PROCESSING MSU Law AND MACHINE LEARNING Electronic Discovery Fall 2 01 2 FOR DISCOVERY Week 9
  • 2. GOALS Understand the BLACK BOX.  Natural language processing  Mathematical and linguistic concepts  Models of representation  Real-world application  Machine learning  Common pre-processing and learning algorithms  Real-world application  Communicate with software and service vendors! © Bommarito Consulting
  • 3. BLACK BOX  How do we characterize a black box? 3 English medium Inputs Parameters Outputs © Bommarito Consulting
  • 4. BLACK BOX  Secret: Most black boxes are ? very similar inside.  We‟re going to learn to identify the common parts. © Bommarito Consulting
  • 5. NATURAL LANGUAGE PROCESSING  Definition: Dealing with real-world text in an automated, reproducible way.  Often referred to as NLP.  Used somewhat interchangeably with computational linguistics. © Bommarito Consulting
  • 6. NATURAL LANGUAGE PROCESSING Let‟s start with some text. “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” (Bloomberg article on Sandy) © Bommarito Consulting
  • 7. NATURAL LANGUAGE PROCESSING What kind of questions can we ask?  Basic  What is the structure of the text?  Paragraphs  Sentences  Tokens/words  What are the words that appear in this text?  Nouns  Subjects  Direct objects  Verbs  Advanced  What are the concepts that appear in this text?  How does this text compare to other text? © Bommarito Consulting
  • 8. NATURAL LANGUAGE PROCESSING Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” • Segments Types • Paragraphs • Sentences • Tokens © Bommarito Consulting
  • 9. NATURAL LANGUAGE PROCESSING Segmentation and Tokenization But how does it work?  Paragraphs  Two consecutive line breaks  A hard line break followed by an indent  Sentences  Period, except abbreviation, ellipsis within quotation, etc.  Tokens and Words  Whitespace  Punctuation Remember what real -world text looks like – think text and email. © Bommarito Consulting
  • 10. NATURAL LANGUAGE PROCESSING Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”  Paragraphs: 2  Sentences: 2  Words: 561 .  ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow„, …] © Bommarito Consulting
  • 11. NATURAL LANGUAGE PROCESSING What kind of questions can we ask? We now have an ordered list of tokens. ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow„, …]  Does the word phrase “quote stuffing” occur in the text?  How many times does “Sandy” occur?  How often does “outage” occur after “power?”  What percentage of tokens are numbers? © Bommarito Consulting
  • 12. NATURAL LANGUAGE PROCESSING An Aside on Storage D ata: The word „the‟ ten times and the word ‘a’ ten times.  Representation 1 - Ordered List:  [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]  Representation 2 – Term Frequency:  [(„the‟, 10), („a‟, 10)] © Bommarito Consulting
  • 13. NATURAL LANGUAGE PROCESSING An Aside on Storage  Representation 1 - Ordered List:  [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]  Representation 2 - Frequency Map:  [(„the‟, 10), („a‟, 10)]  Tradeoffs  Total space  Ease of answering certain questions  Information about context  Not all software make the same choice! © Bommarito Consulting
  • 14. NATURAL LANGUAGE PROCESSING Stopwording, Stemming, Parsing, and Tagging  Stopwording  Removing “filler” words like prepositions, auxiliary or infinitive verbs, and conjunctions.  Stemming  Matching declined nouns like dog/dogs or child/children.  Matching conjugated verbs like run/ran.  Parsing  Determining the “structure” of a sentence, typically as represented by a grade school sentence diagram (requires grammar definition; we‟ll skip).  Tagging  Identifying the part of speech of each token in a sentence. © Bommarito Consulting
  • 15. NATURAL LANGUAGE PROCESSING Stopwording Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain. System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts. © Bommarito Consulting
  • 16. NATURAL LANGUAGE PROCESSING Stopwording + Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain. System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert. © Bommarito Consulting
  • 17. NATURAL LANGUAGE PROCESSING Tagging Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. [('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …] © Bommarito Consulting
  • 18. NATURAL LANGUAGE PROCESSING Back to the black box. 3 English medium Inputs Parameters Outputs © Bommarito Consulting
  • 19. NATURAL LANGUAGE PROCESSING Let‟s say that we‟re investigating Enron for accounting fraud related to its reserve reporting and transfers. We want to look for any material that discusses reserves and profits in the same sentence. However, we want cases where these words are used as nouns; we‟re not interested in dinner reservations. Inputs Parameters Output Memos Stopword: No Memos Research Stem: Yes Research Emails Tag: Yes Emails Texts Search: … Texts Transcriptions Transcriptions © Bommarito Consulting
  • 20. NATURAL LANGUAGE PROCESSING In general, all document search and discovery software combines the elements discussed above.  Segment  Tokenize  Stopword  Stem  Parse  Tag  Store  Search  Retrieve © Bommarito Consulting
  • 21. NATURAL LANGUAGE PROCESSING  How do they dif fer?  Interface and ease-of-use  De-duplication and versioning  Supported languages  Optical character recognition (OCR)  File formats, e.g., Word, WordPerfect, PDF, HTML  Ability to scale to large databases. © Bommarito Consulting
  • 22. MACHINE LEARNING  Definition: Automated classification and prediction on data.  Examples:  Product recommenders, a la Amazon  Computer vision – is it a cat?  Sentiment analysis  Topic classification  Document clustering  At least two stages to machine learning:  Training  Classification © Bommarito Consulting
  • 23. MACHINE LEARNING Learning  Machine learning requires “learning” or “training.”  There are two types of training:  Supervised  Unsupervised  The goal of training is to determine a mapping from input features to a set of target classes. © Bommarito Consulting
  • 24. MACHINE LEARNING Learning Imagine a student given a small list of organisms and descriptions. The student is tasked to assign the organisms into groups based on these descriptions. Where do the groups come from?  Super vised: The teacher provides the answers.  Unsuper vised: The teacher provides nothing. When the student is done with the task , the teacher checks the student‟s responses and decides if the student has learned. In our example, the teac her will typically provide the “canonical” domains and ki ngdoms of bi ol ogy. However, mos t real -world problems domai ns are not so well-studied. © Bommarito Consulting
  • 25. MACHINE LEARNING Learning What if the teacher gave the student some of the answers? This is semi-supervised learning.  Supervised: The teacher provides the answers.  Semi-supervised: The teacher provides some answers.  Unsupervised: The teacher provides nothing. © Bommarito Consulting
  • 26. MACHINE LEARNING Classification The student has now learned to map from an organism‟s description to a group. Now, the student is sent out into the field to use their knowledge to classify newly discovered organisms. They observe the organisms and document the features they learned to use. Then, they apply the learned rules to determine the class of organism. © Bommarito Consulting
  • 27. MACHINE LEARNING This is exactly how predictive coding works!  Organisms : Documents  Descriptions : Natural language features or models  Semi-supervised : Sample coding The goal of predictive coding in discovery is to learn to classify documents based on natural language features, typically into relevant/irrelevant or privileged/unprivileged. © Bommarito Consulting
  • 28. MACHINE LEARNING Some Machine Learning Algorithms  Super vised  Statistical models  Bayesian, e.g., Naïve Bayes Classification  Frequentist, e.g., Ordinary Least Squares.  Neural Networks (NN)  Support Vector Machines (SVM)  Random Forests (RF)  Genetic Algorithms (GA)  Semi/unsuper vised  Neural Networks (NN)  Clustering  K-means  Hierarchical  Radial Basis (RBF)  Graph © Bommarito Consulting
  • 29. MACHINE LEARNING Notes on Algorithm Diversity  Not all algorithms return scores; some are binar y.  True, True, False  0.9, 0.7, 0.1  Not all algorithms suppor t more than two classes.  Cat, Dog, Mouse  Cat, Not Cat  Not all algorithms scale similarly.  1M documents = 1 day  10M documents = {10 days, 100 days, 1000 days} © Bommarito Consulting
  • 30. THANKS! You can get these slides on my blog – http://bommaritollc.com/blog/.  Michael J Bommarito II  CEO, Bommarito Consulting, LLC  Email: michael@bommaritollc.com  Web: http://bommaritollc.com/ © Bommarito Consulting
  • 31. REFERENCES  B o o k s a n d Wi k i Pa g e s  A Brief Sur vey of Text Mining. Hotho, Nurnberger, Paaß.  http://www.kde.cs.uni -kassel.de/hotho/pub/2005/hotho05TextMining.pdf  Text Mining: Predictive Methods for Analyzing Unstructured Information. Weiss, Indurkhya, Zhang, Damerau.  http://www.amazon.com/Text -Mining-Predictive-Unstructured -Information/dp/0387954333  The Elements of Statistical Learning.  http://www-stat.stanford.edu/~tibs/ElemStatLearn /  Wiki – Machine Learning.  http://en.wikipedia.org/wiki/Machine_learning  Wiki – Machine Learning Algorithms.  http://en.wikipedia.org/wiki/List_of_machine_learni ng_algorithms  So f t wa re  Natural Language Toolkit (NLTK).  http://nltk.org /  Stanford NLP Group.  http://nlp.stanford.edu/software /  Weka.  http://www.cs.waikato.ac.nz/ml/weka /  R.  http://www.r -project.org /  SAS Predictive Analytics and Data Mining.  http://www.sas.com/technologies/analytics/datamining/i ndex.html