This summary covers the key topics discussed in a morning machine learning class:
- The class covered the state of machine learning including common problems, tasks, features, and technology used. Decision trees and ensembles of decision trees were explained in detail.
- Data transformations and feature engineering techniques were also discussed including discretization, normalization, projections, and handling imbalanced datasets. Evaluating machine learning algorithms and dealing with imbalanced data were additional topics.
3. State of the Art in ML
• History
• Machine Learning problems and Tasks
➔
Supervised Learning: Classi$cation, Regression, Multi-label classi$cation
➔
Unsupervised Learning: Clusters, Anomaly Detectors
➔
Semi-supervised Learning: Inference from partially labeled
• Features: numeric, categorical, date-time, text
text analysis: frequency-weighted bag of words
Poul Petersen (BigML)
Explicit rules
Di1cult to $nd
and re-train
Explicit rules
Di1cult to $nd
and re-train
Explicit rules
Di1cult to $nd
and re-train
Implicit rules
(data rules)
Easy to re-train
4. • Technology
• Teaching computers to learn:
too general vs. too speci$c (under-$tting vs. over-$tting)
Missing values handling: new category, averages, mutiple choices
State of the Art in ML
Storage
low prices, big data
APIs
Combination and
accessibility
Cloud
Computational
power
Predictive APIs
5. • Supervised learning:
Classi$cation (output in a set of classes)
Regression (output is a number)
• Unsupervised learning: no output info
• Training / Test separation: partioning data, boostrap or
cross-validation
• Classi$cation: Confusion Matrix
Evaluating ML Algorithms
Cèsar Ferri (UPV)
6. • Classi$cation metrics: Accuracy, Precision, Recall, F-measure
Extending to multi-class problems (averaging)
• Regression metrics:
Mean Absolute error
Mean Squared error (more sensitive to extreme errors)
Root Mean Squared Error
Normalized for classi$ers comparison:
Relative Mean Squared Error
Relative Mean Absolute error
R2
• Unsupervised evaluation: no estimations, association rules,
support
• Clustering: distance and shape based evaluation (border, centers,
distribution)
Evaluating ML Algorithms
Cèsar Ferri (UPV)
7. • History
• Classi$cation and Regression Trees
Structure where data is repeatedly separated in groups
according to attribute values to minimize error / maximize
information gain (split criterion: gini impurity)
Decision Trees
Gonzalo Martinez (UAM)
Expert Based
Systems
Human experts' rules
Automatized Knowledge
Acquisition
Mining archives of cases
(scalable)MYCIN: 600 rules
XCON: 2500 rules Rules:CHAID, CART, ID3, C4.5
8. Decision Trees
Automatized Knowledge
Acquisition
Mining archives of cases
MYCIN: 600 rules
XCON: 2500 rules
CHAID, CART, ID3, C4.5
PROs
● Convertible to rules
● Categorical and numeric
attributes
● Handle uninformative or
redundant attributes
● Handle missing values
● Non-parametric (no prede$ned
idea of concept to learn)
● Easy to tune (small number of
parameters)
CONs
● Complex features interactions
● Replication problem
9. Decision Trees
Predicates
Rules are based on the split
predicates
Missing values
Oblique splits (compare features)
Stopping criteria
All instances in one class
No split found
Small number of instances
Gain below threshold
Maximum depth
Pruning
To avoid over-$tting
CART is slower (more trees
needed, avoids complexity)
C4.5 faster but no con$dence
threshold (avoids small nodes)
Parameters Number of
nodes, depth, pruning (on/oD
and con$dence), minimum
number of instances to split
10. Ensembles of Decision Trees
Gonzalo Martinez (UAM)
• Ensembles of models
Randomizing to
decrease errors
and over-$tting:
data, features or
algorithms
New Instance: x
1 1 2 1 2 11
Combined with voting or non-voting strategies (aggregators)
Best overall performance
(SVN)
Almost parameter-less
On trees, very fast to
train and test
Slower than a single
classifier (mitigated with
pruning)
11. Ensembles of Decision Trees
• Robust
• Improves error
• Parallelizable
Original dataset
Bootstrap
sample 1
Repeated example
Removed example
…
…
Bootstrap
sample T
BAGGING
12. Ensembles of Decision Trees
BOOSTING
Original dataset
Iteration 1
…
…
Iteration 2
Good average generalization
error
Not robust (noise)
Can increment error of the
base classifier
Not parallelizable
13. Ensembles of Decision Trees
• Robust
• Improves error
• Parallelizable
• Better than boosting
• Very fast to train
Original dataset
Bootstrap
sample 1
Repeated example
Removed example
Random feature subset
…
…
Bootstrap
sample T
RANDOM FORESTS
14. Ensembles of Decision Trees
CLASS SWITCHING
Original dataset
Random
noise 1
…
…
Random
noise T
p=30%
Can improve results
for cases where
normal decision trees
are not specially good
15. • Human knowledge used to compensate data
problems: broken data (remove corner cases, defaults), missing
values (have meaning), reduce complexity (grouping classes), distances
• Discretization: signi$cant bins against concrete values
• Delta: diDerence or distance between features can be signi$cant
• Standarization: Mean of zero and standard deviation of one
• Normalizing: Feature vectors with unit norm
• Windowing: Previous points distributed in time
Data Transformations and FE
Charles Parker (BigML)
16. • Projections: Combining to have a new feature basis (lowering
dimensionality)
New axis: Principal component analysis
Keep neighbours: Spectral embeddings , Combination methods (Large
Margin Nearest Neighbor, Xing’s Method)
• Sparsity: compressing sparse text and images data by sampling and
grouping
Data Transformations and FE
17. • Sub-sampling and Over-sampling: Restore balance by
eliminating over-sampled categories or giving higher weight to under-
represented categories
• Evaluating Unbalanced Datasets
Good accuracy is not enough. Look at precision and recall
Precision vs. Recall trade-oD: you must de$ne the cost for each
(letting out positives against letting in negatives)
Unbalanced Datasets
Poul Petersen (BigML)
Fraud Not Fraud
0
750
1500
2250
3000
3750
18. • Automatic balancing: equal representation per class
• Weighting: Which instances are more important. Adds new
information to the dataset. Per class or per instance.
Unbalanced Datasets