Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Lecture 1: Introduction
Last updated: 2 Sept 2013
Uppsala University
Department of Linguistics and Philology, September 20...
Practical Information
Reading list, assignments, exams, etc.
Lecture 1: Introduction2
Most slides from previous courses (i...
Course web pages:
Lecture 1: Introduction3
 http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf
 http://stp.lingfil.uu...
About the Course
4
 Introduction to machine learning
 Focus on methods used in Language Technology
and NLP
 Decision tr...
Digression: Generative vs. Discriminative Methods
Lecture 1: Introduction5
 A generative method only applies to probabili...
Compulsory Reading List
6
 Main textbooks:
1. Ethem Alpaydin. 2010. Introduction to Machine Learning.
Second Edition. MIT...
Optional Reading
Lecture 1: Introduction7
 Hal Daumé III, John Langford, Daniel Marcu (2005)
Search-Based Structured Pred...
Assignments and Examination
8
 Three Assignments:
 Decision trees and nearest neighbors
 Perceptron learning
 Clusteri...
Practical Organization
9
 Marina Santini (1-7); Magnus Rosell (8)
 45min + 15 min break
 Lectures on Course webpage and...
Schedule: http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf
Lecture 1: Introduction10
What is Machine Learning?
Introduction to:
•Classification
•Regression
•Supervised Learning
•Unsupervised
Learning
•Reinfo...
What is Machine Learning
Lecture 1: Introduction12
 Machine learning is programming computers to
optimize a performance c...
Machine Learning – Data Mining – Artificial
Intelligence – Statistics
Lecture 1: Introduction13
 Machine Learning: creati...
The bio-cognitive analogy
Lecture 1: Introduction14
 Imagine that a learning algorithm as a single neuron.
 This neuron ...
Elements of Machine Learning
15
1. Generalization:
 Generalize from specific examples
 Based on statistical inference
2....
Types of Machine Learning
Lecture 1: Introduction16
 Association
 Supervised Learning
 Classification
 Regression
 Un...
Learning Associations
Lecture 1: Introduction17
 Basket analysis:
P (Y | X ) probability that somebody who buys X also
bu...
Classification
Lecture 1: Introduction18
 Example: Credit
scoring
 Differentiating
between low-risk and
high-risk custom...
Classification in NLP
19
 Binary classification:
 Spam filtering (spam vs. non-spam)
 Spelling error detection (error v...
Regression
 Example:
Price of used car
 x : car attributes
y : price
y = g (x | q )
g ( ) model,
q parameters
Lecture 1:...
Uses of Supervised Learning
 Prediction of future cases:
 Use the rule to predict the output for future inputs
 Knowled...
Unsupervised Learning
 Finding regularities in data
 No mapping to outputs
 Clustering:
 Grouping similar instances
 ...
Reinforcement Learning
 Learning a policy = sequence of outputs/actions
 No supervised output but delayed reward
 Examp...
Supervised Learning
Introduction to:
•Margin
•Noise
•Bias
Lecture 1: Introduction24
Supervised Classification
Lecture 1: Introduction25
 Learning the class C of a “family car” from
examples
 Prediction: I...
Training set X

X  {xt
,rt
}t1
N

r 
1 if x is positive
0 if x is negative



Lecture 1: Introduction
26

x 
x1
...
Hypothesis class H
Lecture 1: Introduction27

p1  price  p2  AND e1  engine power  e2 
Empirical (training) error
Lecture 1: Introduction28

h(x) 
1 if h says x is positive
0 if h says x is negative




...
S, G, and the Version Space
Lecture 1: Introduction29
most specific hypothesis, S
most general hypothesis, G
h  H, betwee...
Margin
Lecture 1: Introduction30
 Choose h with largest margin
Noise
Lecture 1: Introduction31
Unwanted anomaly in data
 Imprecision in input attributes
 Errors in labeling data point...
Noise and Model Complexity
Lecture 1: Introduction32
Arguments for simpler model (Occam’s razor principle)
1. Easier to ma...
Inductive Bias
Lecture 1: Introduction33
 Learning is an ill-posed problem
 Training data is never sufficient to find a ...
Model Selection and Generalization
Lecture 1: Introduction34
 Generalization – how well a model performs on new
data
 Ov...
Triple Trade-Off
Lecture 1: Introduction35
 Trade-off between three factors:
1. Complexity of H, c(H)
2. Training set siz...
Model Selection  Generalization Error
Lecture 1: Introduction36
 To estimate generalization error, we need data
unseen d...
Model Assessment
Lecture 1: Introduction37
 To estimate the generalization error of the best
model hi, we need data unsee...
Cross-Validation
Lecture 1: Introduction38
121
31
2
2
2
32
1
1
1



K
K
K
K
K
K
XXXTXV
XXXTXV
XXXTXV


...
Bootstrapping
Lecture 1: Introduction39
3680
1
1 1
.





 
e
N
N
 Generate new training sets of size N from X ...
Measuring Error
Lecture 1: Introduction40
 Error rate = # of errors / # of instances = (FP+FN) / N
 Accuracy = # of corr...
Statistical Inference
41
 Interval estimation to quantify the precision of our
measurements
 Hypothesis testing to asses...
Supervised Learning – Summary
42
 Training data + learner  hypothesis
 Learner incorporates inductive bias
 Test data ...
Anatomy of a Supervised Learner
(Dimensions of a supervised machine learning
algorithm)
 Model:
 Loss function:
 Optimi...
Reading
Lecture 1: Introduction44
 Alpaydin (2010): Ch. 1-2; 19 (mathematical
underpinnings)
 Witten and Frank (2005): C...
End of Lecture 1
Thanks for your attention
Lecture 1: Introduction45
Prochain SlideShare
Chargement dans…5
×

Lecture 01: Machine Learning for Language Technology - Introduction

3 441 vues

Publié le

What Is Machine Learning? Machine learning is programming computers to optimize a performance criterion using example data or past experience. We have a model defined up to some parameters, and learning is the execution of a computer program to optimize the parameters of the model using the training data or past experience. The model may be predictive to make predictions in the future, or descriptive to gain knowledge from data, or both. Machine learning uses the theory of statistics in building mathematical models, because the core task is making inference from a sample. (Alpaydin, 2010)
In this lecture, we discuss supervised learning starting from the simplest case. We introduce the concepts of: Margin, Noise, and Bias.

Publié dans : Formation, Technologie

Lecture 01: Machine Learning for Language Technology - Introduction

  1. 1. Lecture 1: Introduction Last updated: 2 Sept 2013 Uppsala University Department of Linguistics and Philology, September 2013 1 Lecture 1: Introduction Machine Learning for Language Technology (Schedule)
  2. 2. Practical Information Reading list, assignments, exams, etc. Lecture 1: Introduction2 Most slides from previous courses (i.e. from E. Alpaydin and J. Nivre) with adaptations
  3. 3. Course web pages: Lecture 1: Introduction3  http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf  http://stp.lingfil.uu.se/~santinim/ml/MachineLearning_fall20 13.htm  Contact details:  santinim@stp.lingfil.uu.se  (marinasantini.ms@gmail.com)
  4. 4. About the Course 4  Introduction to machine learning  Focus on methods used in Language Technology and NLP  Decision trees and nearest neighbor methods (Lecture 2)  Linear models –The Weka ML package (Lectures 3 and 4)  Ensemble methods – Structured Predictions (Lectures 5 and 6)  Text Mining and Big Data - R and Rapid Miner(Lecture 7)  Unsupervised learning (clustering) (Lecture 8, Magnus Rosell)  Builds on Statistical Methods in NLPLecture 1: Introduction
  5. 5. Digression: Generative vs. Discriminative Methods Lecture 1: Introduction5  A generative method only applies to probabilistic models. A model is generative if it gives us the model of the joint distribution of x and y together (P (Y , X ). It is called generative because you can generate with the correct probability distribution data points.  Conditional methods model the conditional distribution of the output given the input: P (Y | X ).  Discriminative methods do not model probabilities at all, but they map the input to the output directly.
  6. 6. Compulsory Reading List 6  Main textbooks: 1. Ethem Alpaydin. 2010. Introduction to Machine Learning. Second Edition. MIT Press (free online version) 2. Hal Daumé III. 2012. A Course in Machine Learning (free online version) 3. Ian H. Witten, Eibe Frank Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (free online version)  Additional material : 1. Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. In J. Kittler and F. Roli (Ed.) First International Workshop on Multiple Classifier Systems 2. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms.In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing 3. Hanna M. Wallach. 2004. Conditional Random Fields: An Introduction.Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania. Lecture 1: Introduction
  7. 7. Optional Reading Lecture 1: Introduction7  Hal Daumé III, John Langford, Daniel Marcu (2005) Search-Based Structured Prediction as Classification. NIPS Workshop on Advances in Structured Learning for Text and Speech Processing (ASLTSP).
  8. 8. Assignments and Examination 8  Three Assignments:  Decision trees and nearest neighbors  Perceptron learning  Clustering  General Info:  No lab sessions, supervision by email  Reports and Assignments 1 & 2 must be submitted to santinim@stp.lingfil.uu.se  Report and Assignment 3 must be submitted to rosell@kth.se  Examination:  Written report submitted for each assignment  All three assignments necessary to pass the course  Grade determined by majority grade on assignments Lecture 1: Introduction
  9. 9. Practical Organization 9  Marina Santini (1-7); Magnus Rosell (8)  45min + 15 min break  Lectures on Course webpage and SlideShare  Email all your questions to me: santinim@stp.lingfil.uu.se  Video Recordings of the previous ML course: http://stp.lingfil.uu.se/~nivre/master/ml.html  Send me an email, so I make sure that I have all the correct email addresses to santinim@stp.lingfil.uu.se Lecture 1: Introduction
  10. 10. Schedule: http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf Lecture 1: Introduction10
  11. 11. What is Machine Learning? Introduction to: •Classification •Regression •Supervised Learning •Unsupervised Learning •Reinforcement Learning Lecture 1: Introduction11
  12. 12. What is Machine Learning Lecture 1: Introduction12  Machine learning is programming computers to optimize a performance criterion for some task using example data or past experience  Why learning?  No known exact method – vision, speech recognition, robotics, spam filters, etc.  Exact method too expensive – statistical physics  Task evolves over time – network routing  Compare:  No need to use machine learning for computing payroll… we just need an algorithm
  13. 13. Machine Learning – Data Mining – Artificial Intelligence – Statistics Lecture 1: Introduction13  Machine Learning: creation of a model that uses training data or past experience  Data Mining: application of learning methods to large datasets (ex. physics, astronomy, biology, etc.)  Text mining = machine learning applied to unstructured textual data (ex. sentiment analyisis, social media monitoring, etc. Text Mining, Wikipedia)  Artificial intelligence: a model that can adapt to a changing environment.  Statistics: Machine learning uses the theory of statistics in building mathematical models, because the core task is making inference from a sample.
  14. 14. The bio-cognitive analogy Lecture 1: Introduction14  Imagine that a learning algorithm as a single neuron.  This neuron receives input from other neurons, one for each input feature.  The strength of these inputs are the feature values.  Each input has a weight and the neuron simply sums up all the weighted inputs.  Based on this sum, the neuron decides whether to “fire” or not. Firing is interpreted as being a positive example and not firing is interpreted as being a negative example.
  15. 15. Elements of Machine Learning 15 1. Generalization:  Generalize from specific examples  Based on statistical inference 2. Data:  Training data: specific examples to learn from  Test data: (new) specific examples to assess performance 3. Models:  Theoretical assumptions about the task/domain  Parameters that can be inferred from data 4. Algorithms:  Learning algorithm: infer model (parameters) from data  Inference algorithm: infer predictions from model Lecture 1: Introduction
  16. 16. Types of Machine Learning Lecture 1: Introduction16  Association  Supervised Learning  Classification  Regression  Unsupervised Learning  Reinforcement Learning
  17. 17. Learning Associations Lecture 1: Introduction17  Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services Example: P ( chips | beer ) = 0.7
  18. 18. Classification Lecture 1: Introduction18  Example: Credit scoring  Differentiating between low-risk and high-risk customers from their income and savings Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
  19. 19. Classification in NLP 19  Binary classification:  Spam filtering (spam vs. non-spam)  Spelling error detection (error vs. non error)  Multiclass classification:  Text categorization (news, economy, culture, sport, ...)  Named entity classification (person, location, organization, ...)  Structured prediction:  Part-of-speech tagging (classes = tag sequences)  Syntactic parsing (classes = parse trees) Lecture 1: Introduction
  20. 20. Regression  Example: Price of used car  x : car attributes y : price y = g (x | q ) g ( ) model, q parameters Lecture 1: Introduction20 y = wx+w0
  21. 21. Uses of Supervised Learning  Prediction of future cases:  Use the rule to predict the output for future inputs  Knowledge extraction:  The rule is easy to understand  Compression:  The rule is simpler than the data it explains  Outlier detection:  Exceptions that are not covered by the rule, e.g., fraud 21 Lecture 1: Introduction
  22. 22. Unsupervised Learning  Finding regularities in data  No mapping to outputs  Clustering:  Grouping similar instances  Example applications:  Customer segmentation in CRM  Image compression: Color quantization  NLP: Unsupervised text categorization 22 Lecture 1: Introduction
  23. 23. Reinforcement Learning  Learning a policy = sequence of outputs/actions  No supervised output but delayed reward  Example applications:  Game playing  Robot in a maze  NLP: Dialogue systems, for example:  NJFun: A Reinforcement Learning Spoken Dialogue System (http://acl.ldc.upenn.edu/W/W00/W00-0304.pdf)  Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths and Weaknesses for Practical Deployment (http://research.microsoft.com/apps/pubs/default.aspx?id=7029 5)  23 Lecture 1: Introduction
  24. 24. Supervised Learning Introduction to: •Margin •Noise •Bias Lecture 1: Introduction24
  25. 25. Supervised Classification Lecture 1: Introduction25  Learning the class C of a “family car” from examples  Prediction: Is car x a family car?  Knowledge extraction: What do people expect from a family car?  Output (labels): Positive (+) and negative (–) examples  Input representation (features): x1: price, x2 : engine power
  26. 26. Training set X  X  {xt ,rt }t1 N  r  1 if x is positive 0 if x is negative    Lecture 1: Introduction 26  x  x1 x2      
  27. 27. Hypothesis class H Lecture 1: Introduction27  p1  price  p2  AND e1  engine power  e2 
  28. 28. Empirical (training) error Lecture 1: Introduction28  h(x)  1 if h says x is positive 0 if h says x is negative     E(h | X)  1 h xt   rt  t1 N  Empirical error of h on X:
  29. 29. S, G, and the Version Space Lecture 1: Introduction29 most specific hypothesis, S most general hypothesis, G h  H, between S and G is consistent [E( h | X) = 0] and make up the version space
  30. 30. Margin Lecture 1: Introduction30  Choose h with largest margin
  31. 31. Noise Lecture 1: Introduction31 Unwanted anomaly in data  Imprecision in input attributes  Errors in labeling data points  Hidden attributes (relative to H) Consequence:  No h in H may be consistent!
  32. 32. Noise and Model Complexity Lecture 1: Introduction32 Arguments for simpler model (Occam’s razor principle) 1. Easier to make predictions 2. Easier to train (fewer parameters) 3. Easier to understand 4. Generalizes better (if data is noisy)
  33. 33. Inductive Bias Lecture 1: Introduction33  Learning is an ill-posed problem  Training data is never sufficient to find a unique solution  There are always infinitely many consistent hypotheses  We need an inductive bias:  Assumptions that entail a unique h for a training set X 1. Hypothesis class H – axis-aligned rectangles 2. Learning algorithm – find consistent hypothesis with max- margin 3. Hyperparameters – trade-off between training error and margin
  34. 34. Model Selection and Generalization Lecture 1: Introduction34  Generalization – how well a model performs on new data  Overfitting: H more complex than C  Underfitting: H less complex than C
  35. 35. Triple Trade-Off Lecture 1: Introduction35  Trade-off between three factors: 1. Complexity of H, c(H) 2. Training set size N 3. Generalization error E on new data  Dependencies:  As N E  As c(H) first E and then E
  36. 36. Model Selection  Generalization Error Lecture 1: Introduction36  To estimate generalization error, we need data unseen during training:  Given models (hypotheses) h1, ..., hk induced from the training set X, we can use E(hi | V ) to select the model hi with the smallest generalization error  ˆE  E(h |V)  1 h xt   rt  t1 M   V  {xt ,rt }t1 M  X
  37. 37. Model Assessment Lecture 1: Introduction37  To estimate the generalization error of the best model hi, we need data unseen during training and model selection  Standard setup: 1. Training set X (50–80%) 2. Validation (development) set V (10–25%) 3. Test (publication) set T (10–25%)  Note:  Validation data can be added to training set before testing  Resampling methods can be used if data is limited
  38. 38. Cross-Validation Lecture 1: Introduction38 121 31 2 2 2 32 1 1 1    K K K K K K XXXTXV XXXTXV XXXTXV      K-fold cross-validation: Divide X into X1, ..., XK  Note:  Generalization error estimated by means across K folds  Training sets for different folds share K–2 parts  Separate test set must be maintained for model assessment
  39. 39. Bootstrapping Lecture 1: Introduction39 3680 1 1 1 .        e N N  Generate new training sets of size N from X by random sampling with replacement  Use original training set as validation set (V = X )  Probability that we do not pick an instance after N draws that is, only 36.8% of instances are new!
  40. 40. Measuring Error Lecture 1: Introduction40  Error rate = # of errors / # of instances = (FP+FN) / N  Accuracy = # of correct / # of instances = (TP+TN) / N  Recall = # of found positives / # of positives = TP / (TP+FN)  Precision = # of found positives / # of found = TP / (TP+FP)
  41. 41. Statistical Inference 41  Interval estimation to quantify the precision of our measurements  Hypothesis testing to assess whether differences between models are statistically significant  m  1.96  N e01  e10 1  2 e01  e10 ~ X1 2 Lecture 1: Introduction
  42. 42. Supervised Learning – Summary 42  Training data + learner  hypothesis  Learner incorporates inductive bias  Test data + hypothesis  estimated generalization  Test data must be unseen  Next lectures:  Different learners in LT with different inductive biases Lecture 1: Introduction
  43. 43. Anatomy of a Supervised Learner (Dimensions of a supervised machine learning algorithm)  Model:  Loss function:  Optimization procedure:  g x |q   E q | X  L rt ,g xt |q  t  Lecture 1: Introduction43  q*  arg min q E q | X 
  44. 44. Reading Lecture 1: Introduction44  Alpaydin (2010): Ch. 1-2; 19 (mathematical underpinnings)  Witten and Frank (2005): Ch. 1 (examples and domains of application)
  45. 45. End of Lecture 1 Thanks for your attention Lecture 1: Introduction45

×