SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
Hypothesis testing, MLE, language models
Kira Radinsky
Based on some slides of Ilan Gronau,
Ydo Wexler, Dan Geiger & Nir Fridman
Hypothesis Testing
•Find the best explanation for the observed data
•Helps predict behavior of similar data sets
An example: Binomial experiments
• Model: The unknown parameter: θ=p(H)
• Data Set: series of experiment results, e.g.
D = H H T H T T T H H …
• Main Assumption: each experiment is independent of
others
P(H)
P(T)
Parameter Estimation
Using Likelihood Functions
• The likelihood of a given value for θ :
LD (θ) = p(D| θ)
• Maximum Likelihood Estimation (MLE) :
We wish to find a value for θ which maximizes the
likelihood
• For example: The likelihood of ‘HTTHH’ is:
LHTTHH (θ) = p(HTTHH | θ)= θ(1-θ)(1- θ)θ θ = θ3(1-θ)2
• We only need to know N(H) (number of Heads) and N(T)
(number of Tails).
• These are sufficient statistics : LD(θ) = θN(H) (1-θ)N(T)
Sufficient Statistics
• A sufficient statistic is a function of the data
that summarizes the relevant information for
the likelihood.
• s(D) is a sufficient statistics if for any two
datasets D and D’:
s(D) = s(D’ ) => LD(θ) = LD’(θ)
• Likelihood may be calculated on the statistics.
Maximum Likelihood Estimation
• Goal: Maximize the likelihood (or log-likelihood)
• In our example:
– Lilkelihood:
• LD(θ) = θN(H) (1-θ)N(T)
– Log-Lilkelihood:
• lD(θ) = log(LD(θ)) = N(H)·log(θ) + N(T)·log(1-θ)
– Maximization of Log-Lilkelihood:
• lD‘(θ) =0:
MLE with multiple parameters
• What if we have several parameters θ1, θ2,…, θK that we
wish to learn?
• Examples:
– die toss (K=6)
– Grades (K=100)
• Sufficient statistics [assumption: a series of independent experiments]:
– N1, N2, …, NK - the number of times each outcome was observed
• Likelihood:
• MLE:
From MLE to Bayesian Inference
• Likelihood Goal: maximize p(D| θ)
• Our Goal: maximize p(θ|D)
• Following Bayes Rule:
• Intuitively, the prior probability captures our prior
knowledge (prejudice) of the model parameters.
posterior probability
Likelihood Prior probability
MLE in Natural
Language Processing (NLP)
• Goal: Evaluate the probability of the next word based on the
words prior to it:
P(wi| w1,…,wi-1)
• Importance: Speech recognition, Hand written word
recognition, part of speech tagging, language identification,
spam detection, etc…
• Markov Assumption: The probability of a word wi in a
sequence of words, depends only on the n-1 words prior to it
in the sequence.
n is a constant.
N-Gram Model
• P(wi| w1,…,wi-1) = P(wi| wi-n,…,wi-1)
• Types of n-grams:
– Uni-gram
• P(wi| w1,…,wi-1) = P(wi)
– Bi-gram
• P(wi| w1,…,wi-1) = P(wi| wi-1)
– Tri-gram
• P(wi| w1,…,wi-1) = P(wi| wi-2 , wi-1)
MLE in NLP
• Problem:
How do we evaluate P(wi) , P(wi| wi-1) , P(wi| wi-2 , wi-1) ?
• Proposal: MLE
Problems with MLE
• Many sequence of length n never appear in the dataset (but do appear in
the real world).
• Example:
– Task: Speech recognition. We heard a word in a sentence, and wish to decide
between two words: “Milk” and “Silk”
– P(Milk | John drank) >? P(Silk | John drank)
– The word “John” never appeared in the dataset, therefore we cannot decide
• Church and Gal (1991)
– Dataset: 44 million words from news papers
– Vocabulary: 400,653 different words
– Therefore, 1.6 * 1011 possible bigrams
– Very few of them appeared in the dataset….
• Solutions:
Most solutions are based on some sort of smoothing:
– Laplace
– Good Turing
Evaluation
• The null hypothesis, denoted by H0
• The alternative hypothesis, denoted by H1.
• Should we reject the null hypothesis in favor of the alternative?
Input:
– a value from a certain distribution
– we don't know what the parameter of that distribution is.
Test:
– How likely it is that the value we were given could have come from the
distribution with this predicted parameter?
– If it's not very likely, we reject the null hypothesis in favor of the alternative.
• Critical Region
– But what exactly is "not very likely"?
– We choose a region known as the critical region. If the result of our
test lies in this region, then we reject the null hypothesis in favor of
the alternative.
Empirical Evolution methods
• Divide to train and test
– Leave one out
• Cross validation
– 10 fold cross validation
– 5x2 cross validation
• Never (never never!) perform evaluation on
the training data
Never!

Contenu connexe

Tendances

Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
Pierre de Lacaze
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
Valentina Paunovic
 

Tendances (14)

(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)
 
Naïve bayes
Naïve bayesNaïve bayes
Naïve bayes
 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
Search problems in Artificial Intelligence
Search problems in Artificial IntelligenceSearch problems in Artificial Intelligence
Search problems in Artificial Intelligence
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Prime numbers
Prime numbersPrime numbers
Prime numbers
 

En vedette (8)

Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Similaire à Tutorial 2 (mle + language models)

Similaire à Tutorial 2 (mle + language models) (20)

2주차
2주차2주차
2주차
 
Bayesian statistics using r intro
Bayesian statistics using r   introBayesian statistics using r   intro
Bayesian statistics using r intro
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Bayesian Learning- part of machine learning
Bayesian Learning-  part of machine learningBayesian Learning-  part of machine learning
Bayesian Learning- part of machine learning
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
 
Naïve bayes
Naïve bayesNaïve bayes
Naïve bayes
 
Naïve bayes
Naïve bayesNaïve bayes
Naïve bayes
 
Naïve bayes
Naïve bayesNaïve bayes
Naïve bayes
 
Naïve bayes
Naïve bayesNaïve bayes
Naïve bayes
 
Naïve bayes
Naïve bayesNaïve bayes
Naïve bayes
 
Naïve bayes
Naïve bayesNaïve bayes
Naïve bayes
 
predicateLogic.ppt
predicateLogic.pptpredicateLogic.ppt
predicateLogic.ppt
 
강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)
강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)
강화학습을 자연어 처리에 이용할 수 있을까? (보상의 희소성 문제와 그 방안)
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog10 logic+programming+with+prolog
10 logic+programming+with+prolog
 
MLE.pdf
MLE.pdfMLE.pdf
MLE.pdf
 
Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine Learning
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
1019Lec1.ppt
1019Lec1.ppt1019Lec1.ppt
1019Lec1.ppt
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 

Plus de Kira (9)

Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Tutorial 2 (mle + language models)

  • 1. Hypothesis testing, MLE, language models Kira Radinsky Based on some slides of Ilan Gronau, Ydo Wexler, Dan Geiger & Nir Fridman
  • 2. Hypothesis Testing •Find the best explanation for the observed data •Helps predict behavior of similar data sets
  • 3. An example: Binomial experiments • Model: The unknown parameter: θ=p(H) • Data Set: series of experiment results, e.g. D = H H T H T T T H H … • Main Assumption: each experiment is independent of others P(H) P(T)
  • 4. Parameter Estimation Using Likelihood Functions • The likelihood of a given value for θ : LD (θ) = p(D| θ) • Maximum Likelihood Estimation (MLE) : We wish to find a value for θ which maximizes the likelihood • For example: The likelihood of ‘HTTHH’ is: LHTTHH (θ) = p(HTTHH | θ)= θ(1-θ)(1- θ)θ θ = θ3(1-θ)2 • We only need to know N(H) (number of Heads) and N(T) (number of Tails). • These are sufficient statistics : LD(θ) = θN(H) (1-θ)N(T)
  • 5. Sufficient Statistics • A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood. • s(D) is a sufficient statistics if for any two datasets D and D’: s(D) = s(D’ ) => LD(θ) = LD’(θ) • Likelihood may be calculated on the statistics.
  • 6. Maximum Likelihood Estimation • Goal: Maximize the likelihood (or log-likelihood) • In our example: – Lilkelihood: • LD(θ) = θN(H) (1-θ)N(T) – Log-Lilkelihood: • lD(θ) = log(LD(θ)) = N(H)·log(θ) + N(T)·log(1-θ) – Maximization of Log-Lilkelihood: • lD‘(θ) =0:
  • 7. MLE with multiple parameters • What if we have several parameters θ1, θ2,…, θK that we wish to learn? • Examples: – die toss (K=6) – Grades (K=100) • Sufficient statistics [assumption: a series of independent experiments]: – N1, N2, …, NK - the number of times each outcome was observed • Likelihood: • MLE:
  • 8. From MLE to Bayesian Inference • Likelihood Goal: maximize p(D| θ) • Our Goal: maximize p(θ|D) • Following Bayes Rule: • Intuitively, the prior probability captures our prior knowledge (prejudice) of the model parameters. posterior probability Likelihood Prior probability
  • 9. MLE in Natural Language Processing (NLP) • Goal: Evaluate the probability of the next word based on the words prior to it: P(wi| w1,…,wi-1) • Importance: Speech recognition, Hand written word recognition, part of speech tagging, language identification, spam detection, etc… • Markov Assumption: The probability of a word wi in a sequence of words, depends only on the n-1 words prior to it in the sequence. n is a constant.
  • 10. N-Gram Model • P(wi| w1,…,wi-1) = P(wi| wi-n,…,wi-1) • Types of n-grams: – Uni-gram • P(wi| w1,…,wi-1) = P(wi) – Bi-gram • P(wi| w1,…,wi-1) = P(wi| wi-1) – Tri-gram • P(wi| w1,…,wi-1) = P(wi| wi-2 , wi-1)
  • 11. MLE in NLP • Problem: How do we evaluate P(wi) , P(wi| wi-1) , P(wi| wi-2 , wi-1) ? • Proposal: MLE
  • 12. Problems with MLE • Many sequence of length n never appear in the dataset (but do appear in the real world). • Example: – Task: Speech recognition. We heard a word in a sentence, and wish to decide between two words: “Milk” and “Silk” – P(Milk | John drank) >? P(Silk | John drank) – The word “John” never appeared in the dataset, therefore we cannot decide • Church and Gal (1991) – Dataset: 44 million words from news papers – Vocabulary: 400,653 different words – Therefore, 1.6 * 1011 possible bigrams – Very few of them appeared in the dataset…. • Solutions: Most solutions are based on some sort of smoothing: – Laplace – Good Turing
  • 13. Evaluation • The null hypothesis, denoted by H0 • The alternative hypothesis, denoted by H1. • Should we reject the null hypothesis in favor of the alternative? Input: – a value from a certain distribution – we don't know what the parameter of that distribution is. Test: – How likely it is that the value we were given could have come from the distribution with this predicted parameter? – If it's not very likely, we reject the null hypothesis in favor of the alternative. • Critical Region – But what exactly is "not very likely"? – We choose a region known as the critical region. If the result of our test lies in this region, then we reject the null hypothesis in favor of the alternative.
  • 14. Empirical Evolution methods • Divide to train and test – Leave one out • Cross validation – 10 fold cross validation – 5x2 cross validation • Never (never never!) perform evaluation on the training data Never!