This document provides an introduction to a machine learning course being taught at Uppsala University. It outlines the schedule, reading list, assignments, and examination. The course covers topics like decision trees, linear models, ensemble methods, text mining, and unsupervised learning. It discusses the differences between supervised and unsupervised learning, as well as classification, regression, and other machine learning techniques. The goal is to introduce students to commonly used methods in natural language processing.
Lecture 01: Machine Learning for Language Technology - Introduction
1. Lecture 1: Introduction
Last updated: 2 Sept 2013
Uppsala University
Department of Linguistics and Philology, September 2013
1 Lecture 1: Introduction
Machine Learning for Language Technology
(Schedule)
2. Practical Information
Reading list, assignments, exams, etc.
Lecture 1: Introduction2
Most slides from previous courses (i.e. from E. Alpaydin and J. Nivre) with adaptations
4. About the Course
4
Introduction to machine learning
Focus on methods used in Language Technology
and NLP
Decision trees and nearest neighbor methods (Lecture 2)
Linear models –The Weka ML package (Lectures 3 and
4)
Ensemble methods – Structured Predictions (Lectures 5
and 6)
Text Mining and Big Data - R and Rapid Miner(Lecture 7)
Unsupervised learning (clustering) (Lecture 8, Magnus
Rosell)
Builds on Statistical Methods in NLPLecture 1: Introduction
5. Digression: Generative vs. Discriminative Methods
Lecture 1: Introduction5
A generative method only applies to probabilistic
models. A model is generative if it gives us the
model of the joint distribution of x and y together (P
(Y , X ). It is called generative because you can
generate with the correct probability distribution data
points.
Conditional methods model the conditional
distribution of the output given the input: P (Y | X ).
Discriminative methods do not model probabilities at
all, but they map the input to the output directly.
6. Compulsory Reading List
6
Main textbooks:
1. Ethem Alpaydin. 2010. Introduction to Machine Learning.
Second Edition. MIT Press (free online version)
2. Hal Daumé III. 2012. A Course in Machine Learning (free
online version)
3. Ian H. Witten, Eibe Frank Data Mining: Practical Machine
Learning Tools and Techniques, Second Edition (free online
version)
Additional material :
1. Dietterich, T. G. (2000). Ensemble Methods in Machine
Learning. In J. Kittler and F. Roli (Ed.) First International
Workshop on Multiple Classifier Systems
2. Michael Collins. 2002. Discriminative Training Methods for
Hidden Markov Models: Theory and Experiments with
Perceptron Algorithms.In Proceedings of the 2002
Conference on Empirical Methods in Natural Language
Processing
3. Hanna M. Wallach. 2004. Conditional Random Fields: An
Introduction.Technical Report MS-CIS-04-21. Department of
Computer and Information Science, University of
Pennsylvania.
Lecture 1: Introduction
7. Optional Reading
Lecture 1: Introduction7
Hal Daumé III, John Langford, Daniel Marcu (2005)
Search-Based Structured Prediction as
Classification. NIPS Workshop on Advances in
Structured Learning for Text and Speech Processing
(ASLTSP).
8. Assignments and Examination
8
Three Assignments:
Decision trees and nearest neighbors
Perceptron learning
Clustering
General Info:
No lab sessions, supervision by email
Reports and Assignments 1 & 2 must be submitted to
santinim@stp.lingfil.uu.se
Report and Assignment 3 must be submitted to rosell@kth.se
Examination:
Written report submitted for each assignment
All three assignments necessary to pass the course
Grade determined by majority grade on assignments
Lecture 1: Introduction
9. Practical Organization
9
Marina Santini (1-7); Magnus Rosell (8)
45min + 15 min break
Lectures on Course webpage and SlideShare
Email all your questions to me:
santinim@stp.lingfil.uu.se
Video Recordings of the previous ML course:
http://stp.lingfil.uu.se/~nivre/master/ml.html
Send me an email, so I make sure that I have all the
correct email addresses to santinim@stp.lingfil.uu.se
Lecture 1: Introduction
11. What is Machine Learning?
Introduction to:
•Classification
•Regression
•Supervised Learning
•Unsupervised
Learning
•Reinforcement
Learning Lecture 1: Introduction11
12. What is Machine Learning
Lecture 1: Introduction12
Machine learning is programming computers to
optimize a performance criterion for some task using
example data or past experience
Why learning?
No known exact method – vision, speech recognition,
robotics, spam filters, etc.
Exact method too expensive – statistical physics
Task evolves over time – network routing
Compare:
No need to use machine learning for computing payroll…
we just need an algorithm
13. Machine Learning – Data Mining – Artificial
Intelligence – Statistics
Lecture 1: Introduction13
Machine Learning: creation of a model that uses training
data or past experience
Data Mining: application of learning methods to large
datasets (ex. physics, astronomy, biology, etc.)
Text mining = machine learning applied to unstructured textual
data (ex. sentiment analyisis, social media monitoring, etc. Text
Mining, Wikipedia)
Artificial intelligence: a model that can adapt to a
changing environment.
Statistics: Machine learning uses the theory of
statistics in building mathematical models, because
the core task is making inference from a sample.
14. The bio-cognitive analogy
Lecture 1: Introduction14
Imagine that a learning algorithm as a single neuron.
This neuron receives input from other neurons, one
for each input feature.
The strength of these inputs are the feature values.
Each input has a weight and the neuron simply sums
up all the weighted inputs.
Based on this sum, the neuron decides whether to
“fire” or not. Firing is interpreted as being a positive
example and not firing is interpreted as being a
negative example.
15. Elements of Machine Learning
15
1. Generalization:
Generalize from specific examples
Based on statistical inference
2. Data:
Training data: specific examples to learn from
Test data: (new) specific examples to assess
performance
3. Models:
Theoretical assumptions about the task/domain
Parameters that can be inferred from data
4. Algorithms:
Learning algorithm: infer model (parameters) from data
Inference algorithm: infer predictions from model
Lecture 1: Introduction
17. Learning Associations
Lecture 1: Introduction17
Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services
Example: P ( chips | beer ) = 0.7
18. Classification
Lecture 1: Introduction18
Example: Credit
scoring
Differentiating
between low-risk and
high-risk customers
from their income and
savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
19. Classification in NLP
19
Binary classification:
Spam filtering (spam vs. non-spam)
Spelling error detection (error vs. non error)
Multiclass classification:
Text categorization (news, economy, culture, sport, ...)
Named entity classification (person, location, organization,
...)
Structured prediction:
Part-of-speech tagging (classes = tag sequences)
Syntactic parsing (classes = parse trees)
Lecture 1: Introduction
20. Regression
Example:
Price of used car
x : car attributes
y : price
y = g (x | q )
g ( ) model,
q parameters
Lecture 1: Introduction20
y = wx+w0
21. Uses of Supervised Learning
Prediction of future cases:
Use the rule to predict the output for future inputs
Knowledge extraction:
The rule is easy to understand
Compression:
The rule is simpler than the data it explains
Outlier detection:
Exceptions that are not covered by the rule, e.g., fraud
21 Lecture 1: Introduction
22. Unsupervised Learning
Finding regularities in data
No mapping to outputs
Clustering:
Grouping similar instances
Example applications:
Customer segmentation in CRM
Image compression: Color quantization
NLP: Unsupervised text categorization
22 Lecture 1: Introduction
23. Reinforcement Learning
Learning a policy = sequence of outputs/actions
No supervised output but delayed reward
Example applications:
Game playing
Robot in a maze
NLP: Dialogue systems, for example:
NJFun: A Reinforcement Learning Spoken Dialogue System
(http://acl.ldc.upenn.edu/W/W00/W00-0304.pdf)
Reinforcement Learning for Spoken Dialogue Systems:
Comparing Strengths and Weaknesses for Practical
Deployment
(http://research.microsoft.com/apps/pubs/default.aspx?id=7029
5)
23 Lecture 1: Introduction
25. Supervised Classification
Lecture 1: Introduction25
Learning the class C of a “family car” from
examples
Prediction: Is car x a family car?
Knowledge extraction: What do people expect
from a family car?
Output (labels):
Positive (+) and negative (–) examples
Input representation (features):
x1: price, x2 : engine power
26. Training set X
X {xt
,rt
}t1
N
r
1 if x is positive
0 if x is negative
Lecture 1: Introduction
26
x
x1
x2
28. Empirical (training) error
Lecture 1: Introduction28
h(x)
1 if h says x is positive
0 if h says x is negative
E(h | X) 1 h xt
rt
t1
N
Empirical error of h on X:
29. S, G, and the Version Space
Lecture 1: Introduction29
most specific hypothesis, S
most general hypothesis, G
h H, between S and G
is consistent [E( h | X) =
0] and make up the
version space
31. Noise
Lecture 1: Introduction31
Unwanted anomaly in data
Imprecision in input attributes
Errors in labeling data points
Hidden attributes (relative to H)
Consequence:
No h in H may be consistent!
32. Noise and Model Complexity
Lecture 1: Introduction32
Arguments for simpler model (Occam’s razor principle)
1. Easier to make predictions
2. Easier to train (fewer parameters)
3. Easier to understand
4. Generalizes better (if data is noisy)
33. Inductive Bias
Lecture 1: Introduction33
Learning is an ill-posed problem
Training data is never sufficient to find a unique solution
There are always infinitely many consistent hypotheses
We need an inductive bias:
Assumptions that entail a unique h for a training set X
1. Hypothesis class H – axis-aligned rectangles
2. Learning algorithm – find consistent hypothesis with max-
margin
3. Hyperparameters – trade-off between training error and
margin
34. Model Selection and Generalization
Lecture 1: Introduction34
Generalization – how well a model performs on new
data
Overfitting: H more complex than C
Underfitting: H less complex than C
35. Triple Trade-Off
Lecture 1: Introduction35
Trade-off between three factors:
1. Complexity of H, c(H)
2. Training set size N
3. Generalization error E on new data
Dependencies:
As N E
As c(H) first E and then E
36. Model Selection Generalization Error
Lecture 1: Introduction36
To estimate generalization error, we need data
unseen during training:
Given models (hypotheses) h1, ..., hk induced from
the training set X, we can use E(hi | V ) to select the
model hi with the smallest generalization error
ˆE E(h |V) 1 h xt
rt
t1
M
V {xt
,rt
}t1
M
X
37. Model Assessment
Lecture 1: Introduction37
To estimate the generalization error of the best
model hi, we need data unseen during training and
model selection
Standard setup:
1. Training set X (50–80%)
2. Validation (development) set V (10–25%)
3. Test (publication) set T (10–25%)
Note:
Validation data can be added to training set before testing
Resampling methods can be used if data is limited
39. Bootstrapping
Lecture 1: Introduction39
3680
1
1 1
.
e
N
N
Generate new training sets of size N from X by
random sampling with replacement
Use original training set as validation set (V = X )
Probability that we do not pick an instance after N
draws
that is, only 36.8% of instances are new!
40. Measuring Error
Lecture 1: Introduction40
Error rate = # of errors / # of instances = (FP+FN) / N
Accuracy = # of correct / # of instances = (TP+TN) / N
Recall = # of found positives / # of positives = TP / (TP+FN)
Precision = # of found positives / # of found = TP / (TP+FP)
41. Statistical Inference
41
Interval estimation to quantify the precision of our
measurements
Hypothesis testing to assess whether differences
between models are statistically significant
m 1.96
N
e01 e10 1
2
e01 e10
~ X1
2
Lecture 1: Introduction
42. Supervised Learning – Summary
42
Training data + learner hypothesis
Learner incorporates inductive bias
Test data + hypothesis estimated generalization
Test data must be unseen
Next lectures:
Different learners in LT with different inductive biases
Lecture 1: Introduction
43. Anatomy of a Supervised Learner
(Dimensions of a supervised machine learning
algorithm)
Model:
Loss function:
Optimization
procedure:
g x |q
E q | X L rt
,g xt
|q t
Lecture 1: Introduction43
q* arg min
q
E q | X
44. Reading
Lecture 1: Introduction44
Alpaydin (2010): Ch. 1-2; 19 (mathematical
underpinnings)
Witten and Frank (2005): Ch. 1 (examples and
domains of application)
45. End of Lecture 1
Thanks for your attention
Lecture 1: Introduction45