SlideShare une entreprise Scribd logo
1  sur  86
CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 5 2 August 2007
WORDS The Building Blocks of Language
[object Object],[object Object]
Tokens, Types and Texts ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Extracting text from the Web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Extracting text from NLTK Corpora ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Brown Corpus ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Corpus Linguistics ,[object Object],[object Object],[object Object],[object Object],[object Object]
What’s a word? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Another example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Some Useful Empirical Observations ,[object Object],[object Object],[object Object],[object Object],[object Object]
Common words in  Tom Sawyer but words in NL have an uneven distribution…
Text properties (formalized) Sample word frequency data
Frequency of frequencies ,[object Object],[object Object],[object Object],[object Object],[object Object]
Zipf’s Law ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Zipf’s Law ,[object Object]
Zipf curve
Predicting Occurrence Frequencies ,[object Object],[object Object],[object Object],[object Object],Fraction of words with frequency  n  is: Fraction  of words appearing only once is therefore ½.
Explanations for Zipf’s Law ,[object Object]
Zipf’s First Law ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Zipf’s Second Law ,[object Object],[object Object],[object Object]
Zipf’s Third Law ,[object Object],[object Object],[object Object]
Zipf’s Law Impact on Language Analysis ,[object Object],[object Object]
Vocabulary Growth ,[object Object],[object Object],[object Object]
Heaps’ Law ,[object Object],[object Object],[object Object],[object Object]
Heaps’ Law Data
Word counts are interesting... ,[object Object],[object Object],[object Object],[object Object],[object Object]
Zipf’s Law on Tom Saywer ,[object Object],[object Object],[object Object],[object Object]
Plot of Zipf’s Law ,[object Object],[object Object]
Plot of Zipf’s Law (con’t) ,[object Object],[object Object]
Zipf’s Law, so what? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
N-Grams and Corpus Linguistics
A bad language model N-grams & Language Modeling
A bad language model
A bad language model Herman is reprinted with permission from LaughingStock Licensing Inc., Ottawa Canada.  All rights reserved.
What’s a Language Model ,[object Object],[object Object],[object Object]
What’s a language model for? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Next Word Prediction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
Human Word Prediction ,[object Object],[object Object],[object Object],[object Object],[object Object]
Claim ,[object Object],[object Object]
Applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Simple N-Grams ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
N-grams ,[object Object],[object Object],[object Object],[object Object]
Computing the Probability of a Word Sequence ,[object Object],[object Object],[object Object],[object Object],[object Object]
Bigram Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Using N-Grams ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The  n -gram Approximation ,[object Object],[object Object],[object Object],[object Object],[object Object]
n- grams, continued ,[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object]
N-grams for Language Generation ,[object Object],Unigram: 5. …Here words are chosen independently but with their appropriate frequencies. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE. Bigram: 6. Second-order word approximation. The word transition probabilities are correct but no further structure is included. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
N-Gram Models of Language ,[object Object],[object Object],[object Object],[object Object],[object Object]
Counting Words in Corpora ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Terminology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Corpora ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Simple N-Grams ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Computing the Probability of a Word Sequence ,[object Object],[object Object],[object Object],[object Object],[object Object]
Bigram Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Using N-Grams ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Training and Testing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
A Simple Example ,[object Object],[object Object]
A Bigram Grammar Fragment from BERP .001 Eat British .03 Eat today .007 Eat dessert .04 Eat Indian .01 Eat tomorrow .04 Eat a .02 Eat Mexican .04 Eat at .02 Eat Chinese .05 Eat dinner .02 Eat in .06 Eat lunch .03 Eat breakfast .06 Eat some .03 Eat Thai .16 Eat on
.01 British lunch .05 Want a .01 British cuisine .65 Want to .15 British restaurant .04 I have .60 British food .08 I don’t .02 To be .29 I would .09 To spend .32 I want .14 To have .02 <start> I’m .26 To eat .04 <start> Tell .01 Want Thai .06 <start> I’d .04 Want some .25 <start> I
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
BERP Bigram Counts 0 1 0 0 0 0 4 Lunch 0 0 0 0 17 0 19 Food 1 120 0 0 0 0 2 Chinese 52 2 19 0 2 0 0 Eat 12 0 3 860 10 0 3 To 6 8 6 0 786 0 3 Want 0 0 0 13 0 1087 8 I lunch Food Chinese Eat To Want I
BERP Bigram Probabilities ,[object Object],[object Object],[object Object],[object Object],[object Object],459 1506 213 938 3256 1215 3437 Lunch Food Chinese Eat To Want I
What do we learn about the language? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
Approximating Shakespeare ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
N-Gram Training Sensitivity ,[object Object],[object Object]
Some Useful Empirical Observations ,[object Object],[object Object],[object Object],[object Object],[object Object]
Smoothing Techniques ,[object Object],[object Object],[object Object]
Smoothing Techniques ,[object Object],[object Object],[object Object]
Add-one Smoothing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],Witten-Bell Discounting
[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Good-Turing Discounting
Backoff methods (e.g. Katz ‘87) ,[object Object],[object Object],[object Object],[object Object],[object Object]
Summary ,[object Object],[object Object],[object Object],[object Object]

Contenu connexe

Tendances

Cursor implementation
Cursor implementationCursor implementation
Cursor implementation
vicky201
 
pattern classification
pattern classificationpattern classification
pattern classification
Ranjan Ganguli
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional Semantics
Andre Freitas
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
dalal404
 

Tendances (18)

Word Sense Disambiguation and Induction
Word Sense Disambiguation and InductionWord Sense Disambiguation and Induction
Word Sense Disambiguation and Induction
 
Cursor implementation
Cursor implementationCursor implementation
Cursor implementation
 
pattern classification
pattern classificationpattern classification
pattern classification
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional Semantics
 
Test unitarios
Test unitariosTest unitarios
Test unitarios
 
Compiler construction
Compiler constructionCompiler construction
Compiler construction
 
Machine Learning with R
Machine Learning with RMachine Learning with R
Machine Learning with R
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
Text summarization
Text summarizationText summarization
Text summarization
 
Lz77 by ayush
Lz77 by ayushLz77 by ayush
Lz77 by ayush
 
F14 lec12graphs
F14 lec12graphsF14 lec12graphs
F14 lec12graphs
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Convolutional neural networks
Convolutional neural networks Convolutional neural networks
Convolutional neural networks
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Types of errors 2019
Types of errors 2019Types of errors 2019
Types of errors 2019
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 

Similaire à NLP new words

Coms30123 Synthesis 3 Projector
Coms30123 Synthesis 3 ProjectorComs30123 Synthesis 3 Projector
Coms30123 Synthesis 3 Projector
Dr. Cupid Lucid
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
JemalNesre1
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
Raghu nath
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
Gopi Krishnan Nambiar
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 

Similaire à NLP new words (20)

Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
crypto_graphy_PPTs.pdf
crypto_graphy_PPTs.pdfcrypto_graphy_PPTs.pdf
crypto_graphy_PPTs.pdf
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Coms30123 Synthesis 3 Projector
Coms30123 Synthesis 3 ProjectorComs30123 Synthesis 3 Projector
Coms30123 Synthesis 3 Projector
 
Introduction to linguistics
Introduction to linguisticsIntroduction to linguistics
Introduction to linguistics
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Ir 03
Ir   03Ir   03
Ir 03
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
Linguistics
LinguisticsLinguistics
Linguistics
 
Natural Language parsing.pptx
Natural Language parsing.pptxNatural Language parsing.pptx
Natural Language parsing.pptx
 
ToC_M1L3_Grammar and Derivation.pdf
ToC_M1L3_Grammar and Derivation.pdfToC_M1L3_Grammar and Derivation.pdf
ToC_M1L3_Grammar and Derivation.pdf
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Nlp
NlpNlp
Nlp
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

NLP new words

  • 1. CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 5 2 August 2007
  • 2. WORDS The Building Blocks of Language
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.  
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. Common words in Tom Sawyer but words in NL have an uneven distribution…
  • 16. Text properties (formalized) Sample word frequency data
  • 17.
  • 18.
  • 19.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35. N-Grams and Corpus Linguistics
  • 36. A bad language model N-grams & Language Modeling
  • 37. A bad language model
  • 38. A bad language model Herman is reprinted with permission from LaughingStock Licensing Inc., Ottawa Canada. All rights reserved.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66. A Bigram Grammar Fragment from BERP .001 Eat British .03 Eat today .007 Eat dessert .04 Eat Indian .01 Eat tomorrow .04 Eat a .02 Eat Mexican .04 Eat at .02 Eat Chinese .05 Eat dinner .02 Eat in .06 Eat lunch .03 Eat breakfast .06 Eat some .03 Eat Thai .16 Eat on
  • 67. .01 British lunch .05 Want a .01 British cuisine .65 Want to .15 British restaurant .04 I have .60 British food .08 I don’t .02 To be .29 I would .09 To spend .32 I want .14 To have .02 <start> I’m .26 To eat .04 <start> Tell .01 Want Thai .06 <start> I’d .04 Want some .25 <start> I
  • 68.
  • 69. BERP Bigram Counts 0 1 0 0 0 0 4 Lunch 0 0 0 0 17 0 19 Food 1 120 0 0 0 0 2 Chinese 52 2 19 0 2 0 0 Eat 12 0 3 860 10 0 3 To 6 8 6 0 786 0 3 Want 0 0 0 13 0 1087 8 I lunch Food Chinese Eat To Want I
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.