Contenu connexe Similaire à How to do Predictive Analytics with Limited Data (20) How to do Predictive Analytics with Limited Data1. How to do Predictive
Analytics with
Limited Data
Ulrich Rueckert
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
3. Predictive Modeling
Traditional Supervised
Learning
•
•
•
•
•
Promotion on bookseller’s web page
Customers can rate books.
Will a new customer like this book?
Training set: observations on previous
customers
Test set: new customers
What happens if only few
customers rate a book?
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
Test Data
Age
Attributes
Income
60K
80K
95K
52K
45K
75K
51K
+
+
+
52
47K
-
47
25
38K
22K
-
33
47K
?
+
20
43
26
41K
-
35
?
LikesBook
+
65
60
67K
39
Age
24
LikesBook
22
Target
Label
Income
+
Training Data
Model
Age
22
Income
67K
LikesBook
+
39
41K
-
Prediction
4. Predictive Modeling
Test Data
Traditional Supervised
Learning
•
•
•
•
•
Age
Test set: new customers
What happens if only few
customers rate a book?
41K
?
95K
52K
?
?
20
45K
?
43
75K
?
26
52
51K
47K
?
?
47
38K
?
25
22K
?
33
Training set: observations on previous
customers
?
60
35
Will a new customer like this book?
67K
39
Customers can rate books.
LikesBook
22
Promotion on bookseller’s web page
Income
47K
?
Target
Label
Attributes
Age
Income
LikesBook
24
60K
+
65
80K
-
Model
Training Data
Prediction
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
5. Predictive Modeling
Traditional Supervised
Learning
•
•
•
•
•
Promotion on bookseller’s web page
Customers can rate books.
Will a new customer like this book?
Training set: observations on previous
customers
Test set: new customers
What happens if only few
customers rate a book?
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
Test Data
Age
Attributes
Income
60K
80K
95K
52K
45K
75K
51K
?
+
?
52
47K
-
47
25
38K
22K
?
?
33
47K
?
?
20
43
26
41K
?
35
?
LikesBook
+
65
60
67K
39
Age
24
LikesBook
22
Target
Label
Income
?
Training Data
Model
Age
22
Income
67K
LikesBook
+
39
41K
-
Prediction
6. Supervised Learning
Classification
•
For now: each data instance is a
point in a 2D coordinate system
•
•
Color denotes target label
Model is given as decision boundary
What’s the correct model?
•
•
In theory: no way to tell
•
All learning systems have underlying
assumptions
Smoothness assumption: similar
instances have similar labels
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
7. Supervised Learning
Classification
•
For now: each data instance is a
point in a 2D coordinate system
•
•
Color denotes target label
Model is given as decision boundary
What’s the correct model?
•
•
In theory: no way to tell
•
All learning systems have underlying
assumptions
Smoothness assumption: similar
instances have similar labels
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
8. Supervised Learning
Classification
•
For now: each data instance is a
point in a 2D coordinate system
•
•
Color denotes target label
Model is given as decision boundary
What’s the correct model?
•
•
In theory: no way to tell
•
All learning systems have underlying
assumptions
Smoothness assumption: similar
instances have similar labels
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
9. Supervised Learning
Classification
•
For now: each data instance is a
point in a 2D coordinate system
•
•
Color denotes target label
Model is given as decision boundary
What’s the correct model?
•
•
In theory: no way to tell
•
All learning systems have underlying
assumptions
Smoothness assumption: similar
instances have similar labels
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
10. Supervised Learning
Classification
•
For now: each data instance is a
point in a 2D coordinate system
•
•
Color denotes target label
Model is given as decision boundary
What’s the correct model?
•
•
In theory: no way to tell
•
All learning systems have underlying
assumptions
Smoothness assumption: similar
instances have similar labels
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
11. Semi-Supervised Learning
Can we make use of the
unlabeled data?
•
•
In theory: no
... but we can make assumptions
Popular Assumptions
•
•
•
Clustering assumption
Low density assumption
Manifold assumption
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
12. Semi-Supervised Learning
Can we make use of the
unlabeled data?
•
•
In theory: no
... but we can make assumptions
Popular Assumptions
•
•
•
Clustering assumption
Low density assumption
Manifold assumption
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
13. The Clustering Assumption
Clustering
•
Partition instances into groups (clusters)
of similar instances
•
Many different algorithms: k-Means, EM,
DBSCAN, etc.
•
Available e.g. on Mahout
Clustering Assumption
•
The two classification targets are distinct
clusters
•
Simple semi-supervised learning: cluster,
then perform majority vote
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
14. The Clustering Assumption
Clustering
•
Partition instances into groups (clusters)
of similar instances
•
Many different algorithms: k-Means, EM,
DBSCAN, etc.
•
Available e.g. on Mahout
Clustering Assumption
•
The two classification targets are distinct
clusters
•
Simple semi-supervised learning: cluster,
then perform majority vote
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
15. The Clustering Assumption
Clustering
•
Partition instances into groups (clusters)
of similar instances
•
Many different algorithms: k-Means, EM,
DBSCAN, etc.
•
Available e.g. on Mahout
Clustering Assumption
•
The two classification targets are distinct
clusters
•
Simple semi-supervised learning: cluster,
then perform majority vote
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
16. The Clustering Assumption
Clustering
•
Partition instances into groups (clusters)
of similar instances
•
Many different algorithms: k-Means, EM,
DBSCAN, etc.
•
Available e.g. on Mahout
Clustering Assumption
•
The two classification targets are distinct
clusters
•
Simple semi-supervised learning: cluster,
then perform majority vote
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
17. Generative Models
Mixture of Gaussians
•
Assumption: the data in each cluster is
generated by a normal distribution
•
Find most probable location and shape
of clusters given data
Expectation-Maximization
•
•
Two step optimization procedure
•
•
Each step is one MapReduce job
Keeps estimates of cluster assignment
probabilities for each instance
Might converge to local optimum
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
18. Generative Models
Mixture of Gaussians
•
Assumption: the data in each cluster is
generated by a normal distribution
•
Find most probable location and shape
of clusters given data
Expectation-Maximization
•
•
Two step optimization procedure
•
•
Each step is one MapReduce job
Keeps estimates of cluster assignment
probabilities for each instance
Might converge to local optimum
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
19. Generative Models
Mixture of Gaussians
•
Assumption: the data in each cluster is
generated by a normal distribution
•
Find most probable location and shape
of clusters given data
Expectation-Maximization
•
•
Two step optimization procedure
•
•
Each step is one MapReduce job
Keeps estimates of cluster assignment
probabilities for each instance
Might converge to local optimum
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
20. Generative Models
Mixture of Gaussians
•
Assumption: the data in each cluster is
generated by a normal distribution
•
Find most probable location and shape
of clusters given data
Expectation-Maximization
•
•
Two step optimization procedure
•
•
Each step is one MapReduce job
Keeps estimates of cluster assignment
probabilities for each instance
Might converge to local optimum
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
21. Generative Models
Mixture of Gaussians
•
Assumption: the data in each cluster is
generated by a normal distribution
•
Find most probable location and shape
of clusters given data
Expectation-Maximization
•
•
Two step optimization procedure
•
•
Each step is one MapReduce job
Keeps estimates of cluster assignment
probabilities for each instance
Might converge to local optimum
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
22. Generative Models
Mixture of Gaussians
•
Assumption: the data in each cluster is
generated by a normal distribution
•
Find most probable location and shape
of clusters given data
Expectation-Maximization
•
•
Two step optimization procedure
•
•
Each step is one MapReduce job
Keeps estimates of cluster assignment
probabilities for each instance
Might converge to local optimum
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
23. Beyond Mixtures of Gaussians
Expectation-Maximization
•
•
Can be adjusted to all kinds of mixture models
E.g. use Naive Bayes as mixture model for text classification
Self-Training
•
•
•
•
Learn model on labeled instances only
Apply model to unlabeled instances
Learn new model on all instances
Repeat until convergence
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
24. The Low Density Assumption
Assumption
•
The area between the two classes has
low density
•
Does not assume any specific form of
cluster
Support Vector Machine
•
•
•
Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
25. The Low Density Assumption
Assumption
•
The area between the two classes has
low density
•
Does not assume any specific form of
cluster
Support Vector Machine
•
•
•
Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
26. The Low Density Assumption
Assumption
•
The area between the two classes has
low density
•
Does not assume any specific form of
cluster
Support Vector Machine
•
•
•
Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
27. The Low Density Assumption
Semi-Supervised Support
Vector Machine
•
Minimize distance to labeled and
unlabeled instances
•
Parameter to fine-tune influence of
unlabeled instances
•
Additional constraint: keep class balance
correct
Implementation
•
•
Simple extension of SVM
But non-convex optimization problem
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
28. The Low Density Assumption
Semi-Supervised Support
Vector Machine
•
Minimize distance to labeled and
unlabeled instances
•
Parameter to fine-tune influence of
unlabeled instances
•
Additional constraint: keep class balance
correct
Implementation
•
•
Simple extension of SVM
But non-convex optimization problem
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
29. The Low Density Assumption
Semi-Supervised Support
Vector Machine
•
Minimize distance to labeled and
unlabeled instances
•
Parameter to fine-tune influence of
unlabeled instances
•
Additional constraint: keep class balance
correct
Implementation
•
•
Simple extension of SVM
But non-convex optimization problem
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
30. The Low Density Assumption
Semi-Supervised Support
Vector Machine
•
•
•
Minimize distance to labeled and
unlabeled instances
Parameter to fine-tune influence of
unlabeled instances
Additional constraint: keep class balance
correct
Implementation
•
•
Margins of
labeled
instances
Regularizer
2
min kwk
w
+C
+C
n
X
`(yi (wT xi + b))
i=1
n+m
X
?
`(|wT xi + b|)
i=n+1
Simple extension of SVM
But non-convex optimization problem
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
Margins of
unlabeled
instances
31. Semi-Supervised SVM
Stochastic Gradient Descent
•
•
One run over the data in random order
•
Steps get smaller over time
Each misclassified or unlabeled instance
moves classifier a bit
Implementation on Hadoop
•
Mapper: send data to reducer in random
order
•
Reducer: update linear classifier for
unlabeled or misclassified instances
•
Many random runs to find best one
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
32. Semi-Supervised SVM
Stochastic Gradient Descent
•
•
One run over the data in random order
•
Steps get smaller over time
Each misclassified or unlabeled instance
moves classifier a bit
Implementation on Hadoop
•
Mapper: send data to reducer in random
order
•
Reducer: update linear classifier for
unlabeled or misclassified instances
•
Many random runs to find best one
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
33. Semi-Supervised SVM
Stochastic Gradient Descent
•
•
One run over the data in random order
•
Steps get smaller over time
Each misclassified or unlabeled instance
moves classifier a bit
Implementation on Hadoop
•
Mapper: send data to reducer in random
order
•
Reducer: update linear classifier for
unlabeled or misclassified instances
•
Many random runs to find best one
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
34. Semi-Supervised SVM
Stochastic Gradient Descent
•
•
One run over the data in random order
•
Steps get smaller over time
Each misclassified or unlabeled instance
moves classifier a bit
Implementation on Hadoop
•
Mapper: send data to reducer in random
order
•
Reducer: update linear classifier for
unlabeled or misclassified instances
•
Many random runs to find best one
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
35. Semi-Supervised SVM
Stochastic Gradient Descent
•
•
One run over the data in random order
•
Steps get smaller over time
Each misclassified or unlabeled instance
moves classifier a bit
Implementation on Hadoop
•
Mapper: send data to reducer in random
order
•
Reducer: update linear classifier for
unlabeled or misclassified instances
•
Many random runs to find best one
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
36. The Manifold Assumption
The Assumption
•
Training data is (roughly) contained in a
low dimensional manifold
•
One can perform learning in a more
meaningful low-dimensional space
•
Avoids curse of dimensionality
Similarity Graphs
•
Idea: compute similarity scores between
instances
•
Create network where the nearest
neighbors are connected
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
37. The Manifold Assumption
The Assumption
•
Training data is (roughly) contained in a
low dimensional manifold
•
One can perform learning in a more
meaningful low-dimensional space
•
Avoids curse of dimensionality
Similarity Graphs
•
Idea: compute similarity scores between
instances
•
Create network where the nearest
neighbors are connected
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
38. The Manifold Assumption
The Assumption
•
Training data is (roughly) contained in a
low dimensional manifold
•
One can perform learning in a more
meaningful low-dimensional space
•
Avoids curse of dimensionality
Similarity Graphs
•
Idea: compute similarity scores between
instances
•
Create network where the nearest
neighbors are connected
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
39. The Manifold Assumption
The Assumption
•
Training data is (roughly) contained in a
low dimensional manifold
•
One can perform learning in a more
meaningful low-dimensional space
•
Avoids curse of dimensionality
Similarity Graphs
•
Idea: compute similarity scores between
instances
•
Create network where the nearest
neighbors are connected
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
40. Label Propagation
Main Idea
•
Propagate label information to
neighboring instances
•
•
Then repeat until convergence
Similar to PageRank
Theory
•
Known to converge under weak
conditions
•
Equivalent to matrix inversion
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
41. Label Propagation
Main Idea
•
Propagate label information to
neighboring instances
•
•
Then repeat until convergence
Similar to PageRank
Theory
•
Known to converge under weak
conditions
•
Equivalent to matrix inversion
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
42. Label Propagation
Main Idea
•
Propagate label information to
neighboring instances
•
•
Then repeat until convergence
Similar to PageRank
Theory
•
Known to converge under weak
conditions
•
Equivalent to matrix inversion
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
43. Label Propagation
Main Idea
•
Propagate label information to
neighboring instances
•
•
Then repeat until convergence
Similar to PageRank
Theory
•
Known to converge under weak
conditions
•
Equivalent to matrix inversion
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
44. Nearest Neighbor Join
Block Nested Loop Join
•
•
1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors
Reducers
Use spatial information to avoid
unnecessary comparisons
•
R trees, space filling curves, locality
sensitive hashing
•
Example implementations available online
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
1
1 1
2
1
3
Smarter Approaches
•
Data
1
4
1
1
1
1
45. Nearest Neighbor Join
Block Nested Loop Join
•
•
1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors
Reducers
Use spatial information to avoid
unnecessary comparisons
•
R trees, space filling curves, locality
sensitive hashing
•
Example implementations available online
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
1
1 1
1 2
1
1
2
2 1
2 2
2
2
3
Smarter Approaches
•
Data
1
2
4
1
2
46. Nearest Neighbor Join
Block Nested Loop Join
•
•
1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors
Reducers
Use spatial information to avoid
unnecessary comparisons
•
R trees, space filling curves, locality
sensitive hashing
•
Example implementations available online
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
1
1 1
1 2
1 3
1 4
2
2 1
2 2
2 3
2 4
3
Smarter Approaches
•
Data
3 1
3 2
3 3
3 4
4
4 1
4 2
4 3
4 4
47. Nearest Neighbor Join
Block Nested Loop Join
•
•
1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors
Smarter Approaches
•
Data
1
2
3
Use spatial information to avoid
unnecessary comparisons
•
R trees, space filling curves, locality
sensitive hashing
•
Example implementations available online
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
4
Nearest Neighbors
48. Nearest Neighbor Join
Block Nested Loop Join
•
•
1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors
Smarter Approaches
•
Data
1
2
3
Use spatial information to avoid
unnecessary comparisons
•
R trees, space filling curves, locality
sensitive hashing
•
Example implementations available online
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
4
Nearest Neighbors
Result
49. Nearest Neighbor Join
Block Nested Loop Join
•
•
1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors
Smarter Approaches
•
Data
1
1 1
1 2
2
2 1
2 2
2 3
3 2
3 3
3 4
4 3
4 4
3
Use spatial information to avoid
unnecessary comparisons
•
R trees, space filling curves, locality
sensitive hashing
•
Example implementations available online
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
Reducers
4
50. Nearest Neighbor Join
Block Nested Loop Join
•
1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
•
2nd MR job: filter out the overall nearest
neighbors
Smarter Approaches
•
Use spatial information to avoid
unnecessary comparisons
•
R trees, space filling curves, locality
sensitive hashing
•
Example implementations available online
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
51. Nearest Neighbor Join
Block Nested Loop Join
•
1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
•
2nd MR job: filter out the overall nearest
neighbors
Smarter Approaches
•
Use spatial information to avoid
unnecessary comparisons
•
R trees, space filling curves, locality
sensitive hashing
•
Example implementations available online
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
52. Label Propagation on Hadoop
Iterative Procedure
•
•
Mapper: Emit neighbor-label pairs
•
Repeat until convergence
Reducer: Collect incoming labels per
node and combine them into new label
Improvement
•
Cluster data and perform within-cluster
propagation locally
•
Sometimes it’s more efficient to perform
matrix inversion instead
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
53. Label Propagation on Hadoop
Iterative Procedure
•
•
Mapper: Emit neighbor-label pairs
•
Repeat until convergence
Reducer: Collect incoming labels per
node and combine them into new label
Improvement
•
Cluster data and perform within-cluster
propagation locally
•
Sometimes it’s more efficient to perform
matrix inversion instead
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
54. Label Propagation on Hadoop
Iterative Procedure
•
•
Mapper: Emit neighbor-label pairs
•
Repeat until convergence
Reducer: Collect incoming labels per
node and combine them into new label
Improvement
•
Cluster data and perform within-cluster
propagation locally
•
Sometimes it’s more efficient to perform
matrix inversion instead
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
55. Label Propagation on Hadoop
Iterative Procedure
•
•
Mapper: Emit neighbor-label pairs
•
Repeat until convergence
Reducer: Collect incoming labels per
node and combine them into new label
Improvement
•
Cluster data and perform within-cluster
propagation locally
•
Sometimes it’s more efficient to perform
matrix inversion instead
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
56. Label Propagation on Hadoop
Iterative Procedure
•
•
Mapper: Emit neighbor-label pairs
•
Repeat until convergence
Reducer: Collect incoming labels per
node and combine them into new label
Improvement
•
Cluster data and perform within-cluster
propagation locally
•
Sometimes it’s more efficient to perform
matrix inversion instead
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13
57. Conclusion
Semi-Supervised Learning
•
Only few training instances have labels
•
Unlabeled instances can still provide valuable signal
Different assumptions lead to different approaches
•
•
•
Cluster assumption: generative models
Low density assumption: semi-supervised support vector machines
Manifold assumption: label propagation
Code available: https://github.com/Datameer-Inc/lesel
© 2013 Datameer, Inc. All rights reserved.
Thursday, October 31, 13