MLHEP Lectures - day 3, basic track

Machine Learning in High Energy Physics
Lectures 5 & 6
Alex Rogozhnikov
Lund, MLHEP 2016
1 / 101

Linear models: linear regression
Minimizing MSE:
d(x) =< w, x > +w0
 = ( , ) → min
1
N
∑i
Lmse xi yi
( , ) = (d( ) −Lmse xi yi xi yi )
2
1 / 101

Linear models: logistic regression
Minimizing logistic loss:
Penalty for single observation
:
d(x) =< w, x > +w0
 = ( , ) → min∑i
Llogistic xi yi
± 1yi
( , ) = ln(1 + )Llogistic xi yi e
− d( )y
i
xi
2 / 101

Linear models: support vector machine (SVM)
Margin no penalty
( , ) = max(0, 1 − d( ))Lhinge xi yi yi xi
d( ) > 1 →yi xi
3 / 101

Kernel trick
we can project data into higher-dimensional space, e.g. by adding new features.
Hopefully, in the new space distributions are separable
4 / 101

Kernel trick
is a projection operator:
We need only kernel:
Popular choices: polynomial kernel and
RBF kernel.
P
w = P( )∑i
αi xi
d(x) = < w, P(x) >new
d(x) = K( , x)∑i
αi xi
K(x, ) =< P(x), P( )x̃ x̃ >new
5 / 101

Regularizations
regularization :
regularization:
:
 = L( , ) + → min
1
N ∑
i
xi yi reg
L2 = α |reg ∑j
wj |
2
L1 = β | |reg ∑j
wj
+L1 L2 = α | + β | |reg ∑j
wj |
2
∑j
wj
6 / 101

Stochastic optimization
methods
Stochastic gradient descent
take — random event from training
data
(can be applied to additive loss
functions)
i
w ← w − η
∂L( , )xi yi
∂w
7 / 101

Decision trees
NP complex to build
heuristic: use greedy optimization
optimization criterions (impurities): misclassiﬁcation, Gini, entropy
8 / 101

Decision trees for regression
Optimizing MSE, prediction inside a leaf is constant.
9 / 101

Overfitting in decision tree
pre-stopping
post-pruning
unstable to the changes in training dataset
10 / 101

Random Forest
Many trees built independently
bagging of samples
subsampling of features
Simple voting is used to get prediction of an
ensemble
11 / 101

Random Forest
overfitted (in the sense that predictions for train and test are different)
doesn't overfit: increasing complexity (adding more trees) doesn't spoil a
classifier 12 / 101

Random Forest
simple and parallelizable
doesn't require much tuning
hardly interpretable
but feature importances can be computed
doesn't ﬁx samples poorly classiﬁes at previous stages
13 / 101

Ensembles
Averaging decision functions
Weighted decision
D(x) = (x)
1
J
∑J
j=1
dj
D(x) = (x)∑j
αj dj
14 / 101

Sample weights in ML
Can be used with many estimators. We now have triples
weight corresponds to frequency of observation
expected behavior: is the same as having copies of th event
global normalization of weights doesn't matter
, , i − index of an eventxi yi wi
= nwi n i
15 / 101

Sample weights in ML
Can be used with many estimators. We now have triples
weight corresponds to frequency of observation
expected behavior: is the same as having copies of th event
global normalization of weights doesn't matter
Example for logistic regression:
, , i − index of an eventxi yi wi
= nwi n i
 = L( , ) → min
∑
i
wi xi yi
16 / 101

Weights (parameters) of a classifier sample weights
In code:
tree = DecisionTreeClassifier(max_depth=4)
tree.fit(X, y, sample_weight=weights)
Sample weights are convenient way to regulate importance of training events.
Only sample weights when talking about AdaBoost.
≠
17 / 101

AdaBoost [Freund, Shapire, 1995]
Bagging: information from previous trees not taken into account.
Adaptive Boosting is a weighted composition of weak learners:
We assume , labels ,
th weak learner misclassiﬁed th event iff
D(x) = (x)
∑
j
αj dj
(x) = ±1dj = ±1yi
j i ( ) = −1yi dj xi
18 / 101

AdaBoost
Weak learners are built in sequence, each classifier is trained using different
weights
initially = 1 for each training sample
After building th base classifier:
1. compute the total weight of correctly and wrongly classified events
2. increase weight of misclassified samples
D(x) = (x)
∑
j
αj dj
wi
j
= ln
( )
αj
1
2
wcorrect
wwrong
← ×wi wi e
− ( )αj y
i
dj xi
19 / 101

AdaBoost example
Decision trees of depth will be
used.
1
20 / 101

AdaBoost secret
sample weight is equal to penalty for event
is obtained as a result of analytical optimization
Exercise: prove formula for
D(x) = (x)
∑
j
αj dj
 = L( , ) = exp(− D( )) → min
∑
i
xi yi
∑
i
yi xi
= L( , ) = exp(− D( ))wi xi yi yi xi
αj
αj
24 / 101

Loss function of AdaBoost
25 / 101

AdaBoost summary
is able to combine many weak learners
takes mistakes into account
simple, overhead for boosting is negligible
too sensitive to outliers
In scikit-learn, one can run AdaBoost over other algorithms.
26 / 101

Decision trees for regression
28 / 101

Gradient boosting to minimize MSE
Say, we're trying to build an ensemble to minimize MSE:
When ensemble's prediction is obtained by taking weighted sum
Assuming that we already built estimators, how do we train a next one?
(D( ) − → min
∑
i
xi yi )
2
D(x) = (x)∑j
dj
(x) = (x) = (x) + (x)Dj ∑j
=1j
′ dj
′ Dj−1 dj
j − 1
29 / 101

Natural solution is to greedily minimize MSE:
Introduce residual: , now we need to simply minimize
MSE
So the th estimator (tree) is trained using the following data:
( ( ) − = ( ( ) + ( ) − → min
∑
i
Dj xi yi )
2
∑
i
Dj−1 xi dj xi yi )
2
( ) = − ( )Rj xi yi Dj−1 xi
( ( ) − ( ) → min
∑
i
dj xi Rj xi )
2
j
, ( )xi Rj xi
30 / 101

Example: regression with GB
using regression trees of depth=2
31 / 101

number of trees = 1, 2, 3, 100 32 / 101

Gradient Boosting visualization
33 / 101

Gradient Boosting [Friedman, 1999]
composition of weak regressors,
Borrow an approach to encode probabilities from logistic regression
Optimization of log-likelihood ( ):
D(x) = (x)
∑
j
αj dj
(x)p+1
(x)p−1
=
=
σ(D(x))
σ(−D(x))
= ±1yi
 = L( , ) = ln(1 + ) → min
∑
i
xi yi
∑
i
e
− D( )y
i
xi
34 / 101

Gradient Boosting
Optimization problem: ﬁnd all and weak leaners
Mission impossible
D(x) = (x)
∑
j
αj dj
 = ln(1 + ) → min
∑
i
e
− D( )y
i
xi
αj dj
35 / 101

Gradient Boosting
Optimization problem: ﬁnd all and weak leaners
Mission impossible
Main point: greedy optimization of loss function by training one more weak
learner
Each new estimator follows the gradient of loss function
D(x) = (x)
∑
j
αj dj
 = ln(1 + ) → min
∑
i
e
− D( )y
i
xi
αj dj
dj
36 / 101

Gradient Boosting
Gradient boosting ~ steepest gradient descent.
At jth iteration:
compute pseudo-residual
train regressor to minimize MSE:
ﬁnd optimal
Important exercise: compute pseudo-residuals for MSE and logistic losses.
(x) = (x)Dj ∑j
=1j
′ αj
′ dj
′
(x) = (x) + (x)Dj Dj−1 αj dj
R( ) = − xi
∂
∂D( )xi
∣∣D(x)= (x)Dj−1
dj ( ( ) − R( ) → min∑i
dj xi xi )
2
αj
37 / 101

Additional GB tricks
to make training more stable, add learning rate :
randomization to ﬁght noise and build different trees:
subsampling of features
subsampling of training samples
η
(x) = η (x)Dj
∑
j
αj dj
38 / 101

AdaBoost is a particular case of gradient boosting with different target loss
function*:
This loss function is called ExpLoss or AdaLoss.
*(also AdaBoost expects that )
= → minada
∑
i
e
− D( )y
i
xi
( ) = ±1dj xi
39 / 101

Loss functions
Gradient boosting can optimize different smooth loss function.
regression,
Mean Squared Error
Mean Absolute Error
binary classiﬁcation,
ExpLoss (ada AdaLoss)
LogLoss
y ∈ ℝ
(d( ) −∑i
xi yi )
2
d( ) −∑i
∣∣ xi yi
∣∣
= ±1yi
∑i
e
− d( )y
i
xi
log(1 + )∑i
e
− d( )y
i
xi
40 / 101

Usage of second-order information
For additive loss function apart from gradient , we can make use of second
derivatives .
E.g. select leaf value using second-order step:
gi
hi
 = L( ( ), ) = L( ( ) + (x), ) ≈
∑
i
Dj xi yi
∑
i
Dj−1 xi dj yi
≈ L( ( ), ) + (x) + (x)
∑
i
Dj−1 xi yi gi dj
hi
2
d
2
j
 = + ( + ) → minj−1
∑
leaf
gleaf wleaf
hleaf
2
w
2
leaf
43 / 101

Using second-order information
Independent optimization. Explicit solution for optimal values in the leaves:
 ≈ + ( + ) → minj−1
∑
leaf
gleaf wleaf
hleaf
2
w
2
leaf
where = , = .gleaf
∑
i∈leaf
gi hleaf
∑
i∈leaf
hi
= −wleaf
gleaf
hleaf
44 / 101

Using second-order information: recipe
On each iteration of gradient boosting
1. train a tree to follow gradient (minimize MSE with gradient)
2. change the values assigned in leaves to:
3. update predictions (no weight for estimator: )
This improvement is quite cheap and allows smaller GBs to be more effective.
We can use information about hessians on the tree building step (step 1),
← −wleaf
gleaf
hleaf
= 1aj
(x) = (x) + η (x)Dj Dj−1 dj
45 / 101

Multiclass classification: ensembling
One-vs-one, One-vs-rest,
scikit-learn implements those as meta-algorithms.
× ( − 1)nclasses nclasses
2
nclasses
46 / 101

Multiclass classification: modifying an algorithm
Most classiﬁers have natural generalizations to multiclass classiﬁcation.
Example for logistic regression: introduce for each class a
vector .
Converting to probabilities using softmax function:
And minimize LogLoss:
c ∈ 1, 2, …, C
wc
(x) =< , x >dc wc
(x) =pc
e
(x)dc
∑c̃
e
(x)d
c̃
 = − log ( )∑i
py
i
xi
47 / 101

Softmax function
Typical way to convert numbers to probabilities.
Mapping is surjective, but not injective ( dimensions to dimension).
Invariant to global shift:
For the case of two classes:
Coincides with logistic function for
n n
n n − 1
(x) → (x) + constdc dc
(x) = =p1
e
(x)d1
+e
(x)d1
e
(x)d2
1
1 + e
(x)− (x)d2 d1
d(x) = (x) − (x)d1 d2
48 / 101

Loss function: ranking example
In ranking we need to order items by :
We can penalize for misordering:
yi
< ⇒ d( ) < d( )yi yj xi xj
 = L( , , , )
∑
i,i
̃
xi x
i
̃ yi y
i
̃
L(x, , y, ) =
{
x̃ ỹ
σ(d( ) − d(x)),x̃
0,
y < ỹ
otherwise
49 / 101

Adapting boosting
By modifying boosting or changing loss function we can solve different problems
classiﬁcation
regression
ranking
HEP-speciﬁc examples in Tatiana's lecture tomorrow.
50 / 101

Gradient Boosting classification playground
51 / 101

Recapitulation: AdaBoost
Minimizes
by increasing weights of misclassiﬁed samples:
= → minada
∑
i
e
− D( )y
i
xi
← ×wi wi e
− ( )αj y
i
dj xi
53 / 101

Gradient Boosting overview
A powerful ensembling technique (typically used over trees, GBDT)
a general way to optimize differentiable losses
can be adapted to other problems
'following' the gradient of loss at each step
making steps in the space of functions
gradient of poorly-classiﬁed events is higher
increasing number of trees can drive to overﬁtting
(= getting worse quality on new data)
requires tuning, better when trees are not complex
widely used in practice
54 / 101

Feature engineering
Feature engineering = creating features to get the best result with ML
important step
mostly relying on domain knowledge
requires some understanding
most of practitioners' time is spent at this step
55 / 101

Feature engineering
Analyzing available features
scale and shape of features
Analyze which information lacks
challenge example: maybe subleading jets matter?
Validate your guesses
56 / 101

Feature engineering
Analyzing available features
scale and shape of features
Analyze which information lacks
challenge example: maybe subleading jets matter?
Validate your guesses
Machine learning is a proper tool for checking your understanding of data
57 / 101

Linear models example
Single event with sufﬁciently large value of feature can break almost all linear
models.
Heavy-tailed distributions are harmful, pretransforming required
logarithm
power transform
and throwing out outliers
Same tricks actually help to more advanced methods.
Which transformation is the best for Random Forest?
58 / 101

One-hot encoding
Categorical features (= 'not orderable'), being one-hot encoded, are easier for
ML to operate with
59 / 101

Decision tree example
is hard for tree to use, since provides no good splitting
Don't forget that a tree can't reconstruct linear combinations — take care of
this.
ηlepton
60 / 101

Example of feature: invariant mass
Using HEP coordinates, invariant mass of two products is:
Can't be recovered with ensembles of trees of depth < 4 when using only
canonical features.
What about invariant mass of 3 particles?
≈ 2 (cosh( − ) − cos( − ))m
2
inv
pT1 pT2 η1 η2 ϕ1 ϕ2
61 / 101

Example of feature: invariant mass
Using HEP coordinates, invariant mass of two products is:
Can't be recovered with ensembles of trees of depth < 4 when using only
canonical features.
What about invariant mass of 3 particles? (see Vicens' talk today).
Good features are ones that are explainable by physics. Start from simplest and
most natural.
Mind the cost of computing the features.
≈ 2 (cosh( − ) − cos( − ))m
2
inv
pT1 pT2 η1 η2 ϕ1 ϕ2
62 / 101

Output engineering
Typically not discussed, but the target of learning plays an important role.
Example: predicting number of upvotes for
comment. Assuming the error is MSE
Consider two cases:
100 comments when predicted 0
1200 comments when predicted 1000
 = (d( ) − → min
∑
i
xi yi )
2
63 / 101

Output engineering
Ridiculously large impact of highly-commented articles.
We need to predict order, not exact number of comments.
Possible solutions:
alternate loss function. E.g. use MAPE
apply logarithm to the target, predict
Evaluation score should be changed accordingly.
log(# comments)
64 / 101

Sample weights
Typically used to estimate the contribution of event (how often we expect this
to happen).
Sample weights in some situations also matter.
highly misbalanced dataset (e.g. 99 % of events in class 0) tend to have
problems during optimization.
changing sample weights to balance the dataset frequently helps.
65 / 101

Feature selection
Why?
speed up training / prediction
reduce time of data preparation in the pipeline
help algorithm to 'focus' on ﬁnding reliable dependencies
useful when amount of training data is limited
Problem:
ﬁnd a subset of features, which provides best quality
66 / 101

Feature selection
Exhaustive search: cross-validations.
incredible amount of resources
too many cross-validation cycles drive to overly-optimistic quality on the
test data
2
d
67 / 101

Feature selection
test data
basic nice solution: estimate importance with RF / GBDT.
2
d
68 / 101

Feature selection
test data
basic nice solution: estimate importance with RF / GBDT.
Filtering methods
Eliminate variables which seem not to carry statistical information about the
target.
E.g. by measuring Pearson correlation or mutual information.
Example: all angles will be thrown out.
2
d
ϕ 69 / 101

Feature selection: embedded methods
Feature selection is a part of training.
Example: — regularized linear models.
Forward selection
Start from empty set of features.
For each feature in the dataset check if adding this feature improves the quality.
Backward elimination
Almost the same, but this time we iteratively eliminate features.
Bidirectional combination of the above is possible, some algorithms can use
previously trained model as the new starting point for optimization.
L1
70 / 101

Unsupervised dimensionality reduction
71 / 101

Principal component analysis [Pearson, 1901]
PCA is ﬁnding axes along which variance is maximal
72 / 101

PCA description
PCA is based on the principal axis theorem
is covariance matrix of the dataset, is orthogonal matrix, is diagonal
matrix.
Q = ΛUU
T
Q U Λ
Λ = diag( , , …, ), ≥ ≥ ⋯ ≥λ1 λ2 λn λ1 λ2 λn
73 / 101

PCA optimization visualized
74 / 101

PCA: eigenfaces
Emotion = α[scared] + β[laughs] + γ[angry]+. . .
75 / 101

Locally linear embedding
handles the case of non-linear dimensionality reduction
Express each sample as a convex combination of neighbours
− →∑i
∣∣xi ∑i
̃ w
ii
̃ x
i
̃
∣∣ minw
76 / 101

Locally linear embedding
subject to constraints: , , and if are not
neighbors.
Finding an optimal mapping for all points simultaneously ( are images —
positions in the new space):
− →∑
i
∣
∣
∣
∣xi
∑
i
̃
w
ii
̃ x
i
̃
∣
∣
∣
∣ min
w
= 1∑i
̃ w
ii
̃ >= 0w
ii
̃ = 0w
ii
̃ i, i
̃
yi
− →∑
i
∣
∣
∣
∣yi
∑
i
̃
w
ii
̃ y
i
̃
∣
∣
∣
∣ min
y
77 / 101

Isomap
Isomap is targeted to preserve geodesic distance on the manifold between two
points
79 / 101

Supervised dimensionality reduction
80 / 101

Fisher's LDA (Linear Discriminant Analysis) [1936]
Original idea: ﬁnd a projection to discriminate classes best
81 / 101

Fisher's LDA
Mean and variance within a single class ( ):
Total within-class variance:
Total between-class variance:
Goal: ﬁnd a projection to maximize a ratio
c c ∈ {1, 2, …, C}
=< xμk >events of class c
=< ||x − |σk μk |
2
>events of class c
=σwithin ∑c
pc σc
= || − μ|σbetween ∑c
pc μc |
2
σbetween
σwithin
82 / 101

LDA: solving optimization problem
We are interested in finding -dimensional projection :
Naturally connected to the generalized eigenvalue problem
Projection vector corresponds to the highest generalized eigenvalue
Finds a subspace of components when applied to a classification
problem with classes
Fisher's LDA is a basic popular binary classification technique.
1 w
→ww
T
Σ
within
ww
T
Σ
between
max
w
C − 1
C
84 / 101

Common spacial patterns
When we expect that each class is close to some linear subspace, we can
(naively) ﬁnd for each class this subspace by PCA
(better idea) take into account variation of other data and optimize
Natural generalization is to take several components: is a
projection matrix; and are number of dimensions in original and new
spaces
Frequently used in neural sciences, in particular in BCI based on EEG / MEG.
W ∈ ℝn×n1
n n1
tr W → maxW
T
Σ
class
subject to W = IW
T
Σ
total
85 / 101

Common spacial patterns
Patters found describe the projection into 6-dimensional space
86 / 101

Dimensionality reduction summary
is capable of extracting sensible features from highly-dimensional data
frequently used to 'visualize' the data
nonlinear methods rely on the distance in the space
works well with highly-dimensional spaces with features of same nature
87 / 101

Finding optimal hyperparameters
some algorithms have many parameters (regularizations, depth, learning rate,
...)
not all the parameters are guessed
checking all combinations takes too long
88 / 101

Finding optimal hyperparameters
some algorithms have many parameters (regularizations, depth, learning rate,
...)
not all the parameters are guessed
checking all combinations takes too long
We need automated hyperparameter optimization!
89 / 101

Finding optimal parameters
randomly picking parameters is a partial solution
given a target optimal value we can optimize it
90 / 101

no gradient with respect to parameters
noisy results
91 / 101

noisy results
function reconstruction is a problem
92 / 101

noisy results
function reconstruction is a problem
Before running grid optimization make sure your metric is stable (i.e. by
train/testing on different subsets).
Overﬁtting (=getting too optimistic estimate of quality on a holdout) by using
many attempts is a real issue.
93 / 101

Optimal grid search
stochastic optimization (Metropolis-Hastings, annealing)
requires too many evaluations, using only last checked combination
regression techniques, reusing all known information
(ML to optimize ML!)
94 / 101

Optimal grid search using regression
General algorithm (point of grid = set of parameters):
1. evaluations at random points
2. build regression model based on known results
3. select the point with best expected quality according to trained model
4. evaluate quality at this points
5. Go to 2 if not enough evaluations
Why not using linear regression?
Exploration vs. exploitation trade-off: should we try explore poorly-covered
regions or try to enhance currently seen to be optimal?
95 / 101

Gaussian processes for regression
Some deﬁnitions: , where and are functions of mean and
covariance: ,
represents our prior expectation of quality (may be taken
constant)
represents inﬂuence of known results on the
expectation of values in new points
RBF kernel is used here too:
Another popular choice:
We can model the posterior distribution of results in each point.
Y ∼ GP(m, K) m K
m(x) K(x, )x̃
m(x) = Y(x)
K(x, ) = Y(x)Y( )x̃ x̃
K(x, ) = exp(−c||x − | )x̃ x̃ |
2
K(x, ) = exp(−c||x − ||)x̃ x̃
96 / 101

Gaussian Process Demo on Mathematica
97 / 101

Gaussian processes
Gaussian processes model posterior distribution at each point of the grid.
we know at which point we have already well-estimated quality
and we are able to ﬁnd regions which need exploration. See also this demo.
98 / 101

Summary about hyperoptimization
parameters can be tuned automatically
be sure that metric being optimized is stable
mind the optimistic quality estimation (resolved by one more holdout)
99 / 101

Summary about hyperoptimization
parameters can be tuned automatically
be sure that metric being optimized is stable
mind the optimistic quality estimation (resolved by one more holdout)
but this is not what you should spend your time on
the gain from properly cooking features / reconsidering problem is much
higher
100 / 101

MLHEP Lectures - day 3, basic track

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (15)

Similaire à MLHEP Lectures - day 3, basic track

Similaire à MLHEP Lectures - day 3, basic track (20)

Dernier

Dernier (20)

MLHEP Lectures - day 3, basic track