SlideShare une entreprise Scribd logo
1  sur  86
CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 5 2 August 2007
WORDS The Building Blocks of Language
[object Object],[object Object]
Tokens, Types and Texts ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Extracting text from the Web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Extracting text from NLTK Corpora ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Brown Corpus ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Corpus Linguistics ,[object Object],[object Object],[object Object],[object Object],[object Object]
What’s a word? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Another example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Some Useful Empirical Observations ,[object Object],[object Object],[object Object],[object Object],[object Object]
Common words in  Tom Sawyer but words in NL have an uneven distribution…
Text properties (formalized) Sample word frequency data
Frequency of frequencies ,[object Object],[object Object],[object Object],[object Object],[object Object]
Zipf’s Law ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Zipf’s Law ,[object Object]
Zipf curve
Predicting Occurrence Frequencies ,[object Object],[object Object],[object Object],[object Object],Fraction of words with frequency  n  is: Fraction  of words appearing only once is therefore ½.
Explanations for Zipf’s Law ,[object Object]
Zipf’s First Law ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Zipf’s Second Law ,[object Object],[object Object],[object Object]
Zipf’s Third Law ,[object Object],[object Object],[object Object]
Zipf’s Law Impact on Language Analysis ,[object Object],[object Object]
Vocabulary Growth ,[object Object],[object Object],[object Object]
Heaps’ Law ,[object Object],[object Object],[object Object],[object Object]
Heaps’ Law Data
Word counts are interesting... ,[object Object],[object Object],[object Object],[object Object],[object Object]
Zipf’s Law on Tom Saywer ,[object Object],[object Object],[object Object],[object Object]
Plot of Zipf’s Law ,[object Object],[object Object]
Plot of Zipf’s Law (con’t) ,[object Object],[object Object]
Zipf’s Law, so what? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
N-Grams and Corpus Linguistics
A bad language model N-grams & Language Modeling
A bad language model
A bad language model Herman is reprinted with permission from LaughingStock Licensing Inc., Ottawa Canada.  All rights reserved.
What’s a Language Model ,[object Object],[object Object],[object Object]
What’s a language model for? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Next Word Prediction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
Human Word Prediction ,[object Object],[object Object],[object Object],[object Object],[object Object]
Claim ,[object Object],[object Object]
Applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Simple N-Grams ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
N-grams ,[object Object],[object Object],[object Object],[object Object]
Computing the Probability of a Word Sequence ,[object Object],[object Object],[object Object],[object Object],[object Object]
Bigram Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Using N-Grams ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The  n -gram Approximation ,[object Object],[object Object],[object Object],[object Object],[object Object]
n- grams, continued ,[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object]
N-grams for Language Generation ,[object Object],Unigram: 5. …Here words are chosen independently but with their appropriate frequencies. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE. Bigram: 6. Second-order word approximation. The word transition probabilities are correct but no further structure is included. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
N-Gram Models of Language ,[object Object],[object Object],[object Object],[object Object],[object Object]
Counting Words in Corpora ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Terminology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Corpora ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Simple N-Grams ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Computing the Probability of a Word Sequence ,[object Object],[object Object],[object Object],[object Object],[object Object]
Bigram Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Using N-Grams ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Training and Testing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
A Simple Example ,[object Object],[object Object]
A Bigram Grammar Fragment from BERP .001 Eat British .03 Eat today .007 Eat dessert .04 Eat Indian .01 Eat tomorrow .04 Eat a .02 Eat Mexican .04 Eat at .02 Eat Chinese .05 Eat dinner .02 Eat in .06 Eat lunch .03 Eat breakfast .06 Eat some .03 Eat Thai .16 Eat on
.01 British lunch .05 Want a .01 British cuisine .65 Want to .15 British restaurant .04 I have .60 British food .08 I don’t .02 To be .29 I would .09 To spend .32 I want .14 To have .02 <start> I’m .26 To eat .04 <start> Tell .01 Want Thai .06 <start> I’d .04 Want some .25 <start> I
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
BERP Bigram Counts 0 1 0 0 0 0 4 Lunch 0 0 0 0 17 0 19 Food 1 120 0 0 0 0 2 Chinese 52 2 19 0 2 0 0 Eat 12 0 3 860 10 0 3 To 6 8 6 0 786 0 3 Want 0 0 0 13 0 1087 8 I lunch Food Chinese Eat To Want I
BERP Bigram Probabilities ,[object Object],[object Object],[object Object],[object Object],[object Object],459 1506 213 938 3256 1215 3437 Lunch Food Chinese Eat To Want I
What do we learn about the language? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
Approximating Shakespeare ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
N-Gram Training Sensitivity ,[object Object],[object Object]
Some Useful Empirical Observations ,[object Object],[object Object],[object Object],[object Object],[object Object]
Smoothing Techniques ,[object Object],[object Object],[object Object]
Smoothing Techniques ,[object Object],[object Object],[object Object]
Add-one Smoothing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],Witten-Bell Discounting
[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Good-Turing Discounting
Backoff methods (e.g. Katz ‘87) ,[object Object],[object Object],[object Object],[object Object],[object Object]
Summary ,[object Object],[object Object],[object Object],[object Object]

Contenu connexe

En vedette

Naica Cavernade Cristal
Naica Cavernade CristalNaica Cavernade Cristal
Naica Cavernade Cristal
dcasco
 
Perimetros De Poligonos
Perimetros De PoligonosPerimetros De Poligonos
Perimetros De Poligonos
guest372be4
 
AutoPagerize Shibuya.js 2007 9/15
AutoPagerize Shibuya.js 2007 9/15AutoPagerize Shibuya.js 2007 9/15
AutoPagerize Shibuya.js 2007 9/15
swdyh
 
Dantesinferno Se
Dantesinferno SeDantesinferno Se
Dantesinferno Se
guest236192
 

En vedette (20)

anaGlay mOraila Souza
anaGlay mOraila SouzaanaGlay mOraila Souza
anaGlay mOraila Souza
 
Monday Notes 9/16/2007
Monday Notes 9/16/2007Monday Notes 9/16/2007
Monday Notes 9/16/2007
 
Los peligros de Internet.
Los peligros de Internet.Los peligros de Internet.
Los peligros de Internet.
 
Sep18 Mobile
Sep18 MobileSep18 Mobile
Sep18 Mobile
 
Michelle
MichelleMichelle
Michelle
 
Milagros
MilagrosMilagros
Milagros
 
Naica Cavernade Cristal
Naica Cavernade CristalNaica Cavernade Cristal
Naica Cavernade Cristal
 
Preston
PrestonPreston
Preston
 
Vma07
Vma07Vma07
Vma07
 
Perimetros De Poligonos
Perimetros De PoligonosPerimetros De Poligonos
Perimetros De Poligonos
 
A Rough Guide towards Govt 2 V0
A  Rough  Guide towards Govt 2 V0A  Rough  Guide towards Govt 2 V0
A Rough Guide towards Govt 2 V0
 
DivisióN
DivisióNDivisióN
DivisióN
 
Edusim New Interface
Edusim New InterfaceEdusim New Interface
Edusim New Interface
 
AutoPagerize Shibuya.js 2007 9/15
AutoPagerize Shibuya.js 2007 9/15AutoPagerize Shibuya.js 2007 9/15
AutoPagerize Shibuya.js 2007 9/15
 
Dantesinferno Se
Dantesinferno SeDantesinferno Se
Dantesinferno Se
 
Fiesta De Disfraces
Fiesta De DisfracesFiesta De Disfraces
Fiesta De Disfraces
 
7th Grade Chapter 2 Lesson 1
7th Grade Chapter 2 Lesson 17th Grade Chapter 2 Lesson 1
7th Grade Chapter 2 Lesson 1
 
KM Postcards
KM PostcardsKM Postcards
KM Postcards
 
Jacinto Piedraaa!
Jacinto Piedraaa!Jacinto Piedraaa!
Jacinto Piedraaa!
 
7th Grade Chapter 2 Lesson 4
7th Grade Chapter 2 Lesson 47th Grade Chapter 2 Lesson 4
7th Grade Chapter 2 Lesson 4
 

Similaire à sadf

Coms30123 Synthesis 3 Projector
Coms30123 Synthesis 3 ProjectorComs30123 Synthesis 3 Projector
Coms30123 Synthesis 3 Projector
Dr. Cupid Lucid
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
JemalNesre1
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
Raghu nath
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
Gopi Krishnan Nambiar
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 

Similaire à sadf (20)

Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
crypto_graphy_PPTs.pdf
crypto_graphy_PPTs.pdfcrypto_graphy_PPTs.pdf
crypto_graphy_PPTs.pdf
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Coms30123 Synthesis 3 Projector
Coms30123 Synthesis 3 ProjectorComs30123 Synthesis 3 Projector
Coms30123 Synthesis 3 Projector
 
Introduction to linguistics
Introduction to linguisticsIntroduction to linguistics
Introduction to linguistics
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
Ir 03
Ir   03Ir   03
Ir 03
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
Linguistics
LinguisticsLinguistics
Linguistics
 
Natural Language parsing.pptx
Natural Language parsing.pptxNatural Language parsing.pptx
Natural Language parsing.pptx
 
ToC_M1L3_Grammar and Derivation.pdf
ToC_M1L3_Grammar and Derivation.pdfToC_M1L3_Grammar and Derivation.pdf
ToC_M1L3_Grammar and Derivation.pdf
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

sadf

  • 1. CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 5 2 August 2007
  • 2. WORDS The Building Blocks of Language
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.  
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. Common words in Tom Sawyer but words in NL have an uneven distribution…
  • 16. Text properties (formalized) Sample word frequency data
  • 17.
  • 18.
  • 19.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35. N-Grams and Corpus Linguistics
  • 36. A bad language model N-grams & Language Modeling
  • 37. A bad language model
  • 38. A bad language model Herman is reprinted with permission from LaughingStock Licensing Inc., Ottawa Canada. All rights reserved.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66. A Bigram Grammar Fragment from BERP .001 Eat British .03 Eat today .007 Eat dessert .04 Eat Indian .01 Eat tomorrow .04 Eat a .02 Eat Mexican .04 Eat at .02 Eat Chinese .05 Eat dinner .02 Eat in .06 Eat lunch .03 Eat breakfast .06 Eat some .03 Eat Thai .16 Eat on
  • 67. .01 British lunch .05 Want a .01 British cuisine .65 Want to .15 British restaurant .04 I have .60 British food .08 I don’t .02 To be .29 I would .09 To spend .32 I want .14 To have .02 <start> I’m .26 To eat .04 <start> Tell .01 Want Thai .06 <start> I’d .04 Want some .25 <start> I
  • 68.
  • 69. BERP Bigram Counts 0 1 0 0 0 0 4 Lunch 0 0 0 0 17 0 19 Food 1 120 0 0 0 0 2 Chinese 52 2 19 0 2 0 0 Eat 12 0 3 860 10 0 3 To 6 8 6 0 786 0 3 Want 0 0 0 13 0 1087 8 I lunch Food Chinese Eat To Want I
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.