SlideShare une entreprise Scribd logo
1  sur  102
Télécharger pour lire hors ligne
Machine Learning in High Energy Physics
Lectures 5 & 6
Alex Rogozhnikov
Lund, MLHEP 2016
1 / 101
Linear models: linear regression
Minimizing MSE:
d(x) =< w, x > +w0
 = ( , ) → min
1
N
∑i
Lmse xi yi
( , ) = (d( ) −Lmse xi yi xi yi )
2
1 / 101
Linear models: logistic regression
Minimizing logistic loss:
Penalty for single observation
:
d(x) =< w, x > +w0
 = ( , ) → min∑i
Llogistic xi yi
± 1yi
( , ) = ln(1 + )Llogistic xi yi e
− d( )y
i
xi
2 / 101
Linear models: support vector machine (SVM)
Margin no penalty
( , ) = max(0, 1 − d( ))Lhinge xi yi yi xi
d( ) > 1 →yi xi
3 / 101
Kernel trick
we can project data into higher-dimensional space, e.g. by adding new features.
Hopefully, in the new space distributions are separable
4 / 101
Kernel trick
is a projection operator:
We need only kernel:
Popular choices: polynomial kernel and
RBF kernel.
P
w = P( )∑i
αi xi
d(x) = < w, P(x) >new
d(x) = K( , x)∑i
αi xi
K(x, ) =< P(x), P( )x̃  x̃  >new
5 / 101
Regularizations
regularization :
regularization:
:
 = L( , ) + → min
1
N ∑
i
xi yi reg
L2 = α |reg ∑j
wj |
2
L1 = β | |reg ∑j
wj
+L1 L2 = α | + β | |reg ∑j
wj |
2
∑j
wj
6 / 101
Stochastic optimization
methods
Stochastic gradient descent
take — random event from training
data
(can be applied to additive loss
functions)
i
w ← w − η
∂L( , )xi yi
∂w
7 / 101
Decision trees
NP complex to build
heuristic: use greedy optimization
optimization criterions (impurities): misclassification, Gini, entropy
8 / 101
Decision trees for regression
Optimizing MSE, prediction inside a leaf is constant.
9 / 101
Overfitting in decision tree
pre-stopping
post-pruning
unstable to the changes in training dataset
10 / 101
Random Forest
Many trees built independently
bagging of samples
subsampling of features
Simple voting is used to get prediction of an
ensemble
11 / 101
Random Forest
overfitted (in the sense that predictions for train and test are different)
doesn't overfit: increasing complexity (adding more trees) doesn't spoil a
classifier 12 / 101
Random Forest
simple and parallelizable
doesn't require much tuning
hardly interpretable
but feature importances can be computed
doesn't fix samples poorly classifies at previous stages
13 / 101
Ensembles
Averaging decision functions
Weighted decision
D(x) = (x)
1
J
∑J
j=1
dj
D(x) = (x)∑j
αj dj
14 / 101
Sample weights in ML
Can be used with many estimators. We now have triples
weight corresponds to frequency of observation
expected behavior: is the same as having copies of th event
global normalization of weights doesn't matter
, , i −  index of an eventxi yi wi
= nwi n i
15 / 101
Sample weights in ML
Can be used with many estimators. We now have triples
weight corresponds to frequency of observation
expected behavior: is the same as having copies of th event
global normalization of weights doesn't matter
Example for logistic regression:
, , i −  index of an eventxi yi wi
= nwi n i
 = L( , ) → min
∑
i
wi xi yi
16 / 101
Weights (parameters) of a classifier sample weights
In code:
tree = DecisionTreeClassifier(max_depth=4)
tree.fit(X, y, sample_weight=weights)
Sample weights are convenient way to regulate importance of training events.
Only sample weights when talking about AdaBoost.
≠
17 / 101
AdaBoost [Freund, Shapire, 1995]
Bagging: information from previous trees not taken into account.
Adaptive Boosting is a weighted composition of weak learners:
We assume , labels ,
th weak learner misclassified th event iff
D(x) = (x)
∑
j
αj dj
(x) = ±1dj = ±1yi
j i ( ) = −1yi dj xi
18 / 101
AdaBoost
Weak learners are built in sequence, each classifier is trained using different
weights
initially = 1 for each training sample
After building th base classifier:
1. compute the total weight of correctly and wrongly classified events
2. increase weight of misclassified samples
D(x) = (x)
∑
j
αj dj
wi
j
= ln
( )
αj
1
2
wcorrect
wwrong
← ×wi wi e
− ( )αj y
i
dj xi
19 / 101
AdaBoost example
Decision trees of depth will be
used.
1
20 / 101
21 / 101
22 / 101
(1, 2, 3, 100 trees) 23 / 101
AdaBoost secret
sample weight is equal to penalty for event
is obtained as a result of analytical optimization
Exercise: prove formula for
D(x) = (x)
∑
j
αj dj
 = L( , ) = exp(− D( )) → min
∑
i
xi yi
∑
i
yi xi
= L( , ) = exp(− D( ))wi xi yi yi xi
αj
αj
24 / 101
Loss function of AdaBoost
25 / 101
AdaBoost summary
is able to combine many weak learners
takes mistakes into account
simple, overhead for boosting is negligible
too sensitive to outliers
In scikit-learn, one can run AdaBoost over other algorithms.
26 / 101
Gradient Boosting
27 / 101
Decision trees for regression
28 / 101
Gradient boosting to minimize MSE
Say, we're trying to build an ensemble to minimize MSE:
When ensemble's prediction is obtained by taking weighted sum
Assuming that we already built estimators, how do we train a next one?
(D( ) − → min
∑
i
xi yi )
2
D(x) = (x)∑j
dj
(x) = (x) = (x) + (x)Dj ∑j
=1j
′ dj
′ Dj−1 dj
j − 1
29 / 101
Natural solution is to greedily minimize MSE:
Introduce residual: , now we need to simply minimize
MSE
So the th estimator (tree) is trained using the following data:
( ( ) − = ( ( ) + ( ) − → min
∑
i
Dj xi yi )
2
∑
i
Dj−1 xi dj xi yi )
2
( ) = − ( )Rj xi yi Dj−1 xi
( ( ) − ( ) → min
∑
i
dj xi Rj xi )
2
j
, ( )xi Rj xi
30 / 101
Example: regression with GB
using regression trees of depth=2
31 / 101
number of trees = 1, 2, 3, 100 32 / 101
Gradient Boosting visualization
33 / 101
Gradient Boosting [Friedman, 1999]
composition of weak regressors,
Borrow an approach to encode probabilities from logistic regression
Optimization of log-likelihood ( ):
D(x) = (x)
∑
j
αj dj
(x)p+1
(x)p−1
=
=
σ(D(x))
σ(−D(x))
= ±1yi
 = L( , ) = ln(1 + ) → min
∑
i
xi yi
∑
i
e
− D( )y
i
xi
34 / 101
Gradient Boosting
Optimization problem: find all and weak leaners
Mission impossible
D(x) = (x)
∑
j
αj dj
 = ln(1 + ) → min
∑
i
e
− D( )y
i
xi
αj dj
35 / 101
Gradient Boosting
Optimization problem: find all and weak leaners
Mission impossible
Main point: greedy optimization of loss function by training one more weak
learner
Each new estimator follows the gradient of loss function
D(x) = (x)
∑
j
αj dj
 = ln(1 + ) → min
∑
i
e
− D( )y
i
xi
αj dj
dj
36 / 101
Gradient Boosting
Gradient boosting ~ steepest gradient descent.
At jth iteration:
compute pseudo-residual
train regressor to minimize MSE:
find optimal
Important exercise: compute pseudo-residuals for MSE and logistic losses.
(x) = (x)Dj ∑j
=1j
′ αj
′ dj
′
(x) = (x) + (x)Dj Dj−1 αj dj
R( ) = − xi
∂
∂D( )xi
∣∣D(x)= (x)Dj−1
dj ( ( ) − R( ) → min∑i
dj xi xi )
2
αj
37 / 101
Additional GB tricks
to make training more stable, add learning rate :
randomization to fight noise and build different trees:
subsampling of features
subsampling of training samples
η
(x) = η (x)Dj
∑
j
αj dj
38 / 101
AdaBoost is a particular case of gradient boosting with different target loss
function*:
This loss function is called ExpLoss or AdaLoss.
*(also AdaBoost expects that )
= → minada
∑
i
e
− D( )y
i
xi
( ) = ±1dj xi
39 / 101
Loss functions
Gradient boosting can optimize different smooth loss function.
regression,
Mean Squared Error
Mean Absolute Error
binary classification,
ExpLoss (ada AdaLoss)
LogLoss
y ∈ ℝ
(d( ) −∑i
xi yi )
2
d( ) −∑i
∣∣ xi yi
∣∣
= ±1yi
∑i
e
− d( )y
i
xi
log(1 + )∑i
e
− d( )y
i
xi
40 / 101
41 / 101
42 / 101
Usage of second-order information
For additive loss function apart from gradient , we can make use of second
derivatives .
E.g. select leaf value using second-order step:
gi
hi
 = L( ( ), ) = L( ( ) + (x), ) ≈
∑
i
Dj xi yi
∑
i
Dj−1 xi dj yi
≈ L( ( ), ) + (x) + (x)
∑
i
Dj−1 xi yi gi dj
hi
2
d
2
j
 = + ( + ) → minj−1
∑
leaf
gleaf wleaf
hleaf
2
w
2
leaf
43 / 101
Using second-order information
Independent optimization. Explicit solution for optimal values in the leaves:
 ≈ + ( + ) → minj−1
∑
leaf
gleaf wleaf
hleaf
2
w
2
leaf
where = , = .gleaf
∑
i∈leaf
gi hleaf
∑
i∈leaf
hi
= −wleaf
gleaf
hleaf
44 / 101
Using second-order information: recipe
On each iteration of gradient boosting
1. train a tree to follow gradient (minimize MSE with gradient)
2. change the values assigned in leaves to:
3. update predictions (no weight for estimator: )
This improvement is quite cheap and allows smaller GBs to be more effective.
We can use information about hessians on the tree building step (step 1),
← −wleaf
gleaf
hleaf
= 1aj
(x) = (x) + η (x)Dj Dj−1 dj
45 / 101
Multiclass classification: ensembling
One-vs-one, One-vs-rest,
scikit-learn implements those as meta-algorithms.
× ( − 1)nclasses nclasses
2
nclasses
46 / 101
Multiclass classification: modifying an algorithm
Most classifiers have natural generalizations to multiclass classification.
Example for logistic regression: introduce for each class a
vector .
Converting to probabilities using softmax function:
And minimize LogLoss:
c ∈ 1, 2, …, C
wc
(x) =< , x >dc wc
(x) =pc
e
(x)dc
∑c̃ 
e
(x)d
c̃ 
 = − log ( )∑i
py
i
xi
47 / 101
Softmax function
Typical way to convert numbers to probabilities.
Mapping is surjective, but not injective ( dimensions to dimension).
Invariant to global shift:
For the case of two classes:
Coincides with logistic function for
n n
n n − 1
(x) → (x) + constdc dc
(x) = =p1
e
(x)d1
+e
(x)d1
e
(x)d2
1
1 + e
(x)− (x)d2 d1
d(x) = (x) − (x)d1 d2
48 / 101
Loss function: ranking example
In ranking we need to order items by :
We can penalize for misordering:
yi
< ⇒ d( ) < d( )yi yj xi xj
 = L( , , , )
∑
i,i
̃ 
xi x
i
̃  yi y
i
̃ 
L(x, , y, ) =
{
x̃  ỹ 
σ(d( ) − d(x)),x̃ 
0,
y < ỹ 
otherwise
49 / 101
Adapting boosting
By modifying boosting or changing loss function we can solve different problems
classification
regression
ranking
HEP-specific examples in Tatiana's lecture tomorrow.
50 / 101
Gradient Boosting classification playground
51 / 101
-minutes breakn3
52 / 101
Recapitulation: AdaBoost
Minimizes
by increasing weights of misclassified samples:
= → minada
∑
i
e
− D( )y
i
xi
← ×wi wi e
− ( )αj y
i
dj xi
53 / 101
Gradient Boosting overview
A powerful ensembling technique (typically used over trees, GBDT)
a general way to optimize differentiable losses
can be adapted to other problems
'following' the gradient of loss at each step
making steps in the space of functions
gradient of poorly-classified events is higher
increasing number of trees can drive to overfitting
(= getting worse quality on new data)
requires tuning, better when trees are not complex
widely used in practice
54 / 101
Feature engineering
Feature engineering = creating features to get the best result with ML
important step
mostly relying on domain knowledge
requires some understanding
most of practitioners' time is spent at this step
55 / 101
Feature engineering
Analyzing available features
scale and shape of features
Analyze which information lacks
challenge example: maybe subleading jets matter?
Validate your guesses
56 / 101
Feature engineering
Analyzing available features
scale and shape of features
Analyze which information lacks
challenge example: maybe subleading jets matter?
Validate your guesses
Machine learning is a proper tool for checking your understanding of data
57 / 101
Linear models example
Single event with sufficiently large value of feature can break almost all linear
models.
Heavy-tailed distributions are harmful, pretransforming required
logarithm
power transform
and throwing out outliers
Same tricks actually help to more advanced methods.
Which transformation is the best for Random Forest?
58 / 101
One-hot encoding
Categorical features (= 'not orderable'), being one-hot encoded, are easier for
ML to operate with
59 / 101
Decision tree example
is hard for tree to use, since provides no good splitting
Don't forget that a tree can't reconstruct linear combinations — take care of
this.
ηlepton
60 / 101
Example of feature: invariant mass
Using HEP coordinates, invariant mass of two products is:
Can't be recovered with ensembles of trees of depth < 4 when using only
canonical features.
What about invariant mass of 3 particles?
≈ 2 (cosh( − ) − cos( − ))m
2
inv
pT1 pT2 η1 η2 ϕ1 ϕ2
61 / 101
Example of feature: invariant mass
Using HEP coordinates, invariant mass of two products is:
Can't be recovered with ensembles of trees of depth < 4 when using only
canonical features.
What about invariant mass of 3 particles? (see Vicens' talk today).
Good features are ones that are explainable by physics. Start from simplest and
most natural.
Mind the cost of computing the features.
≈ 2 (cosh( − ) − cos( − ))m
2
inv
pT1 pT2 η1 η2 ϕ1 ϕ2
62 / 101
Output engineering
Typically not discussed, but the target of learning plays an important role.
Example: predicting number of upvotes for
comment. Assuming the error is MSE
Consider two cases:
100 comments when predicted 0
1200 comments when predicted 1000
 = (d( ) − → min
∑
i
xi yi )
2
63 / 101
Output engineering
Ridiculously large impact of highly-commented articles.
We need to predict order, not exact number of comments.
Possible solutions:
alternate loss function. E.g. use MAPE
apply logarithm to the target, predict
Evaluation score should be changed accordingly.
log(# comments)
64 / 101
Sample weights
Typically used to estimate the contribution of event (how often we expect this
to happen).
Sample weights in some situations also matter.
highly misbalanced dataset (e.g. 99 % of events in class 0) tend to have
problems during optimization.
changing sample weights to balance the dataset frequently helps.
65 / 101
Feature selection
Why?
speed up training / prediction
reduce time of data preparation in the pipeline
help algorithm to 'focus' on finding reliable dependencies
useful when amount of training data is limited
Problem:
find a subset of features, which provides best quality
66 / 101
Feature selection
Exhaustive search: cross-validations.
incredible amount of resources
too many cross-validation cycles drive to overly-optimistic quality on the
test data
2
d
67 / 101
Feature selection
Exhaustive search: cross-validations.
incredible amount of resources
too many cross-validation cycles drive to overly-optimistic quality on the
test data
basic nice solution: estimate importance with RF / GBDT.
2
d
68 / 101
Feature selection
Exhaustive search: cross-validations.
incredible amount of resources
too many cross-validation cycles drive to overly-optimistic quality on the
test data
basic nice solution: estimate importance with RF / GBDT.
Filtering methods
Eliminate variables which seem not to carry statistical information about the
target.
E.g. by measuring Pearson correlation or mutual information.
Example: all angles will be thrown out.
2
d
ϕ 69 / 101
Feature selection: embedded methods
Feature selection is a part of training.
Example: — regularized linear models.
Forward selection
Start from empty set of features.
For each feature in the dataset check if adding this feature improves the quality.
Backward elimination
Almost the same, but this time we iteratively eliminate features.
Bidirectional combination of the above is possible, some algorithms can use
previously trained model as the new starting point for optimization.
L1
70 / 101
Unsupervised dimensionality reduction
71 / 101
Principal component analysis [Pearson, 1901]
PCA is finding axes along which variance is maximal
72 / 101
PCA description
PCA is based on the principal axis theorem
is covariance matrix of the dataset, is orthogonal matrix, is diagonal
matrix.
Q = ΛUU
T
Q U Λ
Λ = diag( , , …, ), ≥ ≥ ⋯ ≥λ1 λ2 λn λ1 λ2 λn
73 / 101
PCA optimization visualized
74 / 101
PCA: eigenfaces
Emotion = α[scared] + β[laughs] + γ[angry]+. . .
75 / 101
Locally linear embedding
handles the case of non-linear dimensionality reduction
Express each sample as a convex combination of neighbours
− →∑i
∣∣xi ∑i
̃ w
ii
̃ x
i
̃ 
∣∣ minw
76 / 101
Locally linear embedding
subject to constraints: , , and if are not
neighbors.
Finding an optimal mapping for all points simultaneously ( are images —
positions in the new space):
− →∑
i
∣
∣
∣
∣xi
∑
i
̃ 
w
ii
̃ x
i
̃ 
∣
∣
∣
∣ min
w
= 1∑i
̃ w
ii
̃  >= 0w
ii
̃  = 0w
ii
̃  i, i
̃ 
yi
− →∑
i
∣
∣
∣
∣yi
∑
i
̃ 
w
ii
̃ y
i
̃ 
∣
∣
∣
∣ min
y
77 / 101
PCA and LLE
78 / 101
Isomap
Isomap is targeted to preserve geodesic distance on the manifold between two
points
79 / 101
Supervised dimensionality reduction
80 / 101
Fisher's LDA (Linear Discriminant Analysis) [1936]
Original idea: find a projection to discriminate classes best
81 / 101
Fisher's LDA
Mean and variance within a single class ( ):
Total within-class variance:
Total between-class variance:
Goal: find a projection to maximize a ratio
c c ∈ {1, 2, …, C}
=< xμk >events of class c
=< ||x − |σk μk |
2
>events of class c
=σwithin ∑c
pc σc
= || − μ|σbetween ∑c
pc μc |
2
σbetween
σwithin
82 / 101
Fisher's LDA
83 / 101
LDA: solving optimization problem
We are interested in finding -dimensional projection :
Naturally connected to the generalized eigenvalue problem
Projection vector corresponds to the highest generalized eigenvalue
Finds a subspace of components when applied to a classification
problem with classes
Fisher's LDA is a basic popular binary classification technique.
1 w
→ww
T
Σ
within
ww
T
Σ
between
max
w
C − 1
C
84 / 101
Common spacial patterns
When we expect that each class is close to some linear subspace, we can
(naively) find for each class this subspace by PCA
(better idea) take into account variation of other data and optimize
Natural generalization is to take several components: is a
projection matrix; and are number of dimensions in original and new
spaces
Frequently used in neural sciences, in particular in BCI based on EEG / MEG.
W ∈ ℝn×n1
n n1
tr W → maxW
T
Σ
class
subject to  W = IW
T
Σ
total
85 / 101
Common spacial patterns
Patters found describe the projection into 6-dimensional space
86 / 101
Dimensionality reduction summary
is capable of extracting sensible features from highly-dimensional data
frequently used to 'visualize' the data
nonlinear methods rely on the distance in the space
works well with highly-dimensional spaces with features of same nature
87 / 101
Finding optimal hyperparameters
some algorithms have many parameters (regularizations, depth, learning rate,
...)
not all the parameters are guessed
checking all combinations takes too long
88 / 101
Finding optimal hyperparameters
some algorithms have many parameters (regularizations, depth, learning rate,
...)
not all the parameters are guessed
checking all combinations takes too long
We need automated hyperparameter optimization!
89 / 101
Finding optimal parameters
randomly picking parameters is a partial solution
given a target optimal value we can optimize it
90 / 101
Finding optimal parameters
randomly picking parameters is a partial solution
given a target optimal value we can optimize it
no gradient with respect to parameters
noisy results
91 / 101
Finding optimal parameters
randomly picking parameters is a partial solution
given a target optimal value we can optimize it
no gradient with respect to parameters
noisy results
function reconstruction is a problem
92 / 101
Finding optimal parameters
randomly picking parameters is a partial solution
given a target optimal value we can optimize it
no gradient with respect to parameters
noisy results
function reconstruction is a problem
Before running grid optimization make sure your metric is stable (i.e. by
train/testing on different subsets).
Overfitting (=getting too optimistic estimate of quality on a holdout) by using
many attempts is a real issue.
93 / 101
Optimal grid search
stochastic optimization (Metropolis-Hastings, annealing)
requires too many evaluations, using only last checked combination
regression techniques, reusing all known information
(ML to optimize ML!)
94 / 101
Optimal grid search using regression
General algorithm (point of grid = set of parameters):
1. evaluations at random points
2. build regression model based on known results
3. select the point with best expected quality according to trained model
4. evaluate quality at this points
5. Go to 2 if not enough evaluations
Why not using linear regression?
Exploration vs. exploitation trade-off: should we try explore poorly-covered
regions or try to enhance currently seen to be optimal?
95 / 101
Gaussian processes for regression
Some definitions: , where and are functions of mean and
covariance: ,
represents our prior expectation of quality (may be taken
constant)
represents influence of known results on the
expectation of values in new points
RBF kernel is used here too:
Another popular choice:
We can model the posterior distribution of results in each point.
Y ∼ GP(m, K) m K
m(x) K(x, )x̃ 
m(x) = Y(x)
K(x, ) = Y(x)Y( )x̃  x̃ 
K(x, ) = exp(−c||x − | )x̃  x̃ |
2
K(x, ) = exp(−c||x − ||)x̃  x̃ 
96 / 101
Gaussian Process Demo on Mathematica
97 / 101
Gaussian processes
Gaussian processes model posterior distribution at each point of the grid.
we know at which point we have already well-estimated quality
and we are able to find regions which need exploration. See also this demo.
98 / 101
Summary about hyperoptimization
parameters can be tuned automatically
be sure that metric being optimized is stable
mind the optimistic quality estimation (resolved by one more holdout)
99 / 101
Summary about hyperoptimization
parameters can be tuned automatically
be sure that metric being optimized is stable
mind the optimistic quality estimation (resolved by one more holdout)
but this is not what you should spend your time on
the gain from properly cooking features / reconsidering problem is much
higher
100 / 101
101 / 101

Contenu connexe

Tendances

20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic netVivian S. Zhang
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function범준 김
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture modelsVu Pham
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 

Tendances (15)

20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 

Similaire à MLHEP Lectures - day 3, basic track

Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsFrank Nielsen
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
Hybrid dynamics in large-scale logistics networks
Hybrid dynamics in large-scale logistics networksHybrid dynamics in large-scale logistics networks
Hybrid dynamics in large-scale logistics networksMKosmykov
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Sangwoo Mo
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsCharles Deledalle
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 

Similaire à MLHEP Lectures - day 3, basic track (20)

Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributions
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
Hybrid dynamics in large-scale logistics networks
Hybrid dynamics in large-scale logistics networksHybrid dynamics in large-scale logistics networks
Hybrid dynamics in large-scale logistics networks
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
02 basics i-handout
02 basics i-handout02 basics i-handout
02 basics i-handout
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Softmax
SoftmaxSoftmax
Softmax
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
 
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
 
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 

Dernier

pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 

Dernier (20)

pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 

MLHEP Lectures - day 3, basic track

  • 1. Machine Learning in High Energy Physics Lectures 5 & 6 Alex Rogozhnikov Lund, MLHEP 2016 1 / 101
  • 2. Linear models: linear regression Minimizing MSE: d(x) =< w, x > +w0  = ( , ) → min 1 N ∑i Lmse xi yi ( , ) = (d( ) −Lmse xi yi xi yi ) 2 1 / 101
  • 3. Linear models: logistic regression Minimizing logistic loss: Penalty for single observation : d(x) =< w, x > +w0  = ( , ) → min∑i Llogistic xi yi ± 1yi ( , ) = ln(1 + )Llogistic xi yi e − d( )y i xi 2 / 101
  • 4. Linear models: support vector machine (SVM) Margin no penalty ( , ) = max(0, 1 − d( ))Lhinge xi yi yi xi d( ) > 1 →yi xi 3 / 101
  • 5. Kernel trick we can project data into higher-dimensional space, e.g. by adding new features. Hopefully, in the new space distributions are separable 4 / 101
  • 6. Kernel trick is a projection operator: We need only kernel: Popular choices: polynomial kernel and RBF kernel. P w = P( )∑i αi xi d(x) = < w, P(x) >new d(x) = K( , x)∑i αi xi K(x, ) =< P(x), P( )x̃  x̃  >new 5 / 101
  • 7. Regularizations regularization : regularization: :  = L( , ) + → min 1 N ∑ i xi yi reg L2 = α |reg ∑j wj | 2 L1 = β | |reg ∑j wj +L1 L2 = α | + β | |reg ∑j wj | 2 ∑j wj 6 / 101
  • 8. Stochastic optimization methods Stochastic gradient descent take — random event from training data (can be applied to additive loss functions) i w ← w − η ∂L( , )xi yi ∂w 7 / 101
  • 9. Decision trees NP complex to build heuristic: use greedy optimization optimization criterions (impurities): misclassification, Gini, entropy 8 / 101
  • 10. Decision trees for regression Optimizing MSE, prediction inside a leaf is constant. 9 / 101
  • 11. Overfitting in decision tree pre-stopping post-pruning unstable to the changes in training dataset 10 / 101
  • 12. Random Forest Many trees built independently bagging of samples subsampling of features Simple voting is used to get prediction of an ensemble 11 / 101
  • 13. Random Forest overfitted (in the sense that predictions for train and test are different) doesn't overfit: increasing complexity (adding more trees) doesn't spoil a classifier 12 / 101
  • 14. Random Forest simple and parallelizable doesn't require much tuning hardly interpretable but feature importances can be computed doesn't fix samples poorly classifies at previous stages 13 / 101
  • 15. Ensembles Averaging decision functions Weighted decision D(x) = (x) 1 J ∑J j=1 dj D(x) = (x)∑j αj dj 14 / 101
  • 16. Sample weights in ML Can be used with many estimators. We now have triples weight corresponds to frequency of observation expected behavior: is the same as having copies of th event global normalization of weights doesn't matter , , i −  index of an eventxi yi wi = nwi n i 15 / 101
  • 17. Sample weights in ML Can be used with many estimators. We now have triples weight corresponds to frequency of observation expected behavior: is the same as having copies of th event global normalization of weights doesn't matter Example for logistic regression: , , i −  index of an eventxi yi wi = nwi n i  = L( , ) → min ∑ i wi xi yi 16 / 101
  • 18. Weights (parameters) of a classifier sample weights In code: tree = DecisionTreeClassifier(max_depth=4) tree.fit(X, y, sample_weight=weights) Sample weights are convenient way to regulate importance of training events. Only sample weights when talking about AdaBoost. ≠ 17 / 101
  • 19. AdaBoost [Freund, Shapire, 1995] Bagging: information from previous trees not taken into account. Adaptive Boosting is a weighted composition of weak learners: We assume , labels , th weak learner misclassified th event iff D(x) = (x) ∑ j αj dj (x) = ±1dj = ±1yi j i ( ) = −1yi dj xi 18 / 101
  • 20. AdaBoost Weak learners are built in sequence, each classifier is trained using different weights initially = 1 for each training sample After building th base classifier: 1. compute the total weight of correctly and wrongly classified events 2. increase weight of misclassified samples D(x) = (x) ∑ j αj dj wi j = ln ( ) αj 1 2 wcorrect wwrong ← ×wi wi e − ( )αj y i dj xi 19 / 101
  • 21. AdaBoost example Decision trees of depth will be used. 1 20 / 101
  • 24. (1, 2, 3, 100 trees) 23 / 101
  • 25. AdaBoost secret sample weight is equal to penalty for event is obtained as a result of analytical optimization Exercise: prove formula for D(x) = (x) ∑ j αj dj  = L( , ) = exp(− D( )) → min ∑ i xi yi ∑ i yi xi = L( , ) = exp(− D( ))wi xi yi yi xi αj αj 24 / 101
  • 26. Loss function of AdaBoost 25 / 101
  • 27. AdaBoost summary is able to combine many weak learners takes mistakes into account simple, overhead for boosting is negligible too sensitive to outliers In scikit-learn, one can run AdaBoost over other algorithms. 26 / 101
  • 29. Decision trees for regression 28 / 101
  • 30. Gradient boosting to minimize MSE Say, we're trying to build an ensemble to minimize MSE: When ensemble's prediction is obtained by taking weighted sum Assuming that we already built estimators, how do we train a next one? (D( ) − → min ∑ i xi yi ) 2 D(x) = (x)∑j dj (x) = (x) = (x) + (x)Dj ∑j =1j ′ dj ′ Dj−1 dj j − 1 29 / 101
  • 31. Natural solution is to greedily minimize MSE: Introduce residual: , now we need to simply minimize MSE So the th estimator (tree) is trained using the following data: ( ( ) − = ( ( ) + ( ) − → min ∑ i Dj xi yi ) 2 ∑ i Dj−1 xi dj xi yi ) 2 ( ) = − ( )Rj xi yi Dj−1 xi ( ( ) − ( ) → min ∑ i dj xi Rj xi ) 2 j , ( )xi Rj xi 30 / 101
  • 32. Example: regression with GB using regression trees of depth=2 31 / 101
  • 33. number of trees = 1, 2, 3, 100 32 / 101
  • 35. Gradient Boosting [Friedman, 1999] composition of weak regressors, Borrow an approach to encode probabilities from logistic regression Optimization of log-likelihood ( ): D(x) = (x) ∑ j αj dj (x)p+1 (x)p−1 = = σ(D(x)) σ(−D(x)) = ±1yi  = L( , ) = ln(1 + ) → min ∑ i xi yi ∑ i e − D( )y i xi 34 / 101
  • 36. Gradient Boosting Optimization problem: find all and weak leaners Mission impossible D(x) = (x) ∑ j αj dj  = ln(1 + ) → min ∑ i e − D( )y i xi αj dj 35 / 101
  • 37. Gradient Boosting Optimization problem: find all and weak leaners Mission impossible Main point: greedy optimization of loss function by training one more weak learner Each new estimator follows the gradient of loss function D(x) = (x) ∑ j αj dj  = ln(1 + ) → min ∑ i e − D( )y i xi αj dj dj 36 / 101
  • 38. Gradient Boosting Gradient boosting ~ steepest gradient descent. At jth iteration: compute pseudo-residual train regressor to minimize MSE: find optimal Important exercise: compute pseudo-residuals for MSE and logistic losses. (x) = (x)Dj ∑j =1j ′ αj ′ dj ′ (x) = (x) + (x)Dj Dj−1 αj dj R( ) = − xi ∂ ∂D( )xi ∣∣D(x)= (x)Dj−1 dj ( ( ) − R( ) → min∑i dj xi xi ) 2 αj 37 / 101
  • 39. Additional GB tricks to make training more stable, add learning rate : randomization to fight noise and build different trees: subsampling of features subsampling of training samples η (x) = η (x)Dj ∑ j αj dj 38 / 101
  • 40. AdaBoost is a particular case of gradient boosting with different target loss function*: This loss function is called ExpLoss or AdaLoss. *(also AdaBoost expects that ) = → minada ∑ i e − D( )y i xi ( ) = ±1dj xi 39 / 101
  • 41. Loss functions Gradient boosting can optimize different smooth loss function. regression, Mean Squared Error Mean Absolute Error binary classification, ExpLoss (ada AdaLoss) LogLoss y ∈ ℝ (d( ) −∑i xi yi ) 2 d( ) −∑i ∣∣ xi yi ∣∣ = ±1yi ∑i e − d( )y i xi log(1 + )∑i e − d( )y i xi 40 / 101
  • 44. Usage of second-order information For additive loss function apart from gradient , we can make use of second derivatives . E.g. select leaf value using second-order step: gi hi  = L( ( ), ) = L( ( ) + (x), ) ≈ ∑ i Dj xi yi ∑ i Dj−1 xi dj yi ≈ L( ( ), ) + (x) + (x) ∑ i Dj−1 xi yi gi dj hi 2 d 2 j  = + ( + ) → minj−1 ∑ leaf gleaf wleaf hleaf 2 w 2 leaf 43 / 101
  • 45. Using second-order information Independent optimization. Explicit solution for optimal values in the leaves:  ≈ + ( + ) → minj−1 ∑ leaf gleaf wleaf hleaf 2 w 2 leaf where = , = .gleaf ∑ i∈leaf gi hleaf ∑ i∈leaf hi = −wleaf gleaf hleaf 44 / 101
  • 46. Using second-order information: recipe On each iteration of gradient boosting 1. train a tree to follow gradient (minimize MSE with gradient) 2. change the values assigned in leaves to: 3. update predictions (no weight for estimator: ) This improvement is quite cheap and allows smaller GBs to be more effective. We can use information about hessians on the tree building step (step 1), ← −wleaf gleaf hleaf = 1aj (x) = (x) + η (x)Dj Dj−1 dj 45 / 101
  • 47. Multiclass classification: ensembling One-vs-one, One-vs-rest, scikit-learn implements those as meta-algorithms. × ( − 1)nclasses nclasses 2 nclasses 46 / 101
  • 48. Multiclass classification: modifying an algorithm Most classifiers have natural generalizations to multiclass classification. Example for logistic regression: introduce for each class a vector . Converting to probabilities using softmax function: And minimize LogLoss: c ∈ 1, 2, …, C wc (x) =< , x >dc wc (x) =pc e (x)dc ∑c̃  e (x)d c̃   = − log ( )∑i py i xi 47 / 101
  • 49. Softmax function Typical way to convert numbers to probabilities. Mapping is surjective, but not injective ( dimensions to dimension). Invariant to global shift: For the case of two classes: Coincides with logistic function for n n n n − 1 (x) → (x) + constdc dc (x) = =p1 e (x)d1 +e (x)d1 e (x)d2 1 1 + e (x)− (x)d2 d1 d(x) = (x) − (x)d1 d2 48 / 101
  • 50. Loss function: ranking example In ranking we need to order items by : We can penalize for misordering: yi < ⇒ d( ) < d( )yi yj xi xj  = L( , , , ) ∑ i,i ̃  xi x i ̃  yi y i ̃  L(x, , y, ) = { x̃  ỹ  σ(d( ) − d(x)),x̃  0, y < ỹ  otherwise 49 / 101
  • 51. Adapting boosting By modifying boosting or changing loss function we can solve different problems classification regression ranking HEP-specific examples in Tatiana's lecture tomorrow. 50 / 101
  • 52. Gradient Boosting classification playground 51 / 101
  • 54. Recapitulation: AdaBoost Minimizes by increasing weights of misclassified samples: = → minada ∑ i e − D( )y i xi ← ×wi wi e − ( )αj y i dj xi 53 / 101
  • 55. Gradient Boosting overview A powerful ensembling technique (typically used over trees, GBDT) a general way to optimize differentiable losses can be adapted to other problems 'following' the gradient of loss at each step making steps in the space of functions gradient of poorly-classified events is higher increasing number of trees can drive to overfitting (= getting worse quality on new data) requires tuning, better when trees are not complex widely used in practice 54 / 101
  • 56. Feature engineering Feature engineering = creating features to get the best result with ML important step mostly relying on domain knowledge requires some understanding most of practitioners' time is spent at this step 55 / 101
  • 57. Feature engineering Analyzing available features scale and shape of features Analyze which information lacks challenge example: maybe subleading jets matter? Validate your guesses 56 / 101
  • 58. Feature engineering Analyzing available features scale and shape of features Analyze which information lacks challenge example: maybe subleading jets matter? Validate your guesses Machine learning is a proper tool for checking your understanding of data 57 / 101
  • 59. Linear models example Single event with sufficiently large value of feature can break almost all linear models. Heavy-tailed distributions are harmful, pretransforming required logarithm power transform and throwing out outliers Same tricks actually help to more advanced methods. Which transformation is the best for Random Forest? 58 / 101
  • 60. One-hot encoding Categorical features (= 'not orderable'), being one-hot encoded, are easier for ML to operate with 59 / 101
  • 61. Decision tree example is hard for tree to use, since provides no good splitting Don't forget that a tree can't reconstruct linear combinations — take care of this. ηlepton 60 / 101
  • 62. Example of feature: invariant mass Using HEP coordinates, invariant mass of two products is: Can't be recovered with ensembles of trees of depth < 4 when using only canonical features. What about invariant mass of 3 particles? ≈ 2 (cosh( − ) − cos( − ))m 2 inv pT1 pT2 η1 η2 ϕ1 ϕ2 61 / 101
  • 63. Example of feature: invariant mass Using HEP coordinates, invariant mass of two products is: Can't be recovered with ensembles of trees of depth < 4 when using only canonical features. What about invariant mass of 3 particles? (see Vicens' talk today). Good features are ones that are explainable by physics. Start from simplest and most natural. Mind the cost of computing the features. ≈ 2 (cosh( − ) − cos( − ))m 2 inv pT1 pT2 η1 η2 ϕ1 ϕ2 62 / 101
  • 64. Output engineering Typically not discussed, but the target of learning plays an important role. Example: predicting number of upvotes for comment. Assuming the error is MSE Consider two cases: 100 comments when predicted 0 1200 comments when predicted 1000  = (d( ) − → min ∑ i xi yi ) 2 63 / 101
  • 65. Output engineering Ridiculously large impact of highly-commented articles. We need to predict order, not exact number of comments. Possible solutions: alternate loss function. E.g. use MAPE apply logarithm to the target, predict Evaluation score should be changed accordingly. log(# comments) 64 / 101
  • 66. Sample weights Typically used to estimate the contribution of event (how often we expect this to happen). Sample weights in some situations also matter. highly misbalanced dataset (e.g. 99 % of events in class 0) tend to have problems during optimization. changing sample weights to balance the dataset frequently helps. 65 / 101
  • 67. Feature selection Why? speed up training / prediction reduce time of data preparation in the pipeline help algorithm to 'focus' on finding reliable dependencies useful when amount of training data is limited Problem: find a subset of features, which provides best quality 66 / 101
  • 68. Feature selection Exhaustive search: cross-validations. incredible amount of resources too many cross-validation cycles drive to overly-optimistic quality on the test data 2 d 67 / 101
  • 69. Feature selection Exhaustive search: cross-validations. incredible amount of resources too many cross-validation cycles drive to overly-optimistic quality on the test data basic nice solution: estimate importance with RF / GBDT. 2 d 68 / 101
  • 70. Feature selection Exhaustive search: cross-validations. incredible amount of resources too many cross-validation cycles drive to overly-optimistic quality on the test data basic nice solution: estimate importance with RF / GBDT. Filtering methods Eliminate variables which seem not to carry statistical information about the target. E.g. by measuring Pearson correlation or mutual information. Example: all angles will be thrown out. 2 d ϕ 69 / 101
  • 71. Feature selection: embedded methods Feature selection is a part of training. Example: — regularized linear models. Forward selection Start from empty set of features. For each feature in the dataset check if adding this feature improves the quality. Backward elimination Almost the same, but this time we iteratively eliminate features. Bidirectional combination of the above is possible, some algorithms can use previously trained model as the new starting point for optimization. L1 70 / 101
  • 73. Principal component analysis [Pearson, 1901] PCA is finding axes along which variance is maximal 72 / 101
  • 74. PCA description PCA is based on the principal axis theorem is covariance matrix of the dataset, is orthogonal matrix, is diagonal matrix. Q = ΛUU T Q U Λ Λ = diag( , , …, ), ≥ ≥ ⋯ ≥λ1 λ2 λn λ1 λ2 λn 73 / 101
  • 76. PCA: eigenfaces Emotion = α[scared] + β[laughs] + γ[angry]+. . . 75 / 101
  • 77. Locally linear embedding handles the case of non-linear dimensionality reduction Express each sample as a convex combination of neighbours − →∑i ∣∣xi ∑i ̃ w ii ̃ x i ̃  ∣∣ minw 76 / 101
  • 78. Locally linear embedding subject to constraints: , , and if are not neighbors. Finding an optimal mapping for all points simultaneously ( are images — positions in the new space): − →∑ i ∣ ∣ ∣ ∣xi ∑ i ̃  w ii ̃ x i ̃  ∣ ∣ ∣ ∣ min w = 1∑i ̃ w ii ̃  >= 0w ii ̃  = 0w ii ̃  i, i ̃  yi − →∑ i ∣ ∣ ∣ ∣yi ∑ i ̃  w ii ̃ y i ̃  ∣ ∣ ∣ ∣ min y 77 / 101
  • 79. PCA and LLE 78 / 101
  • 80. Isomap Isomap is targeted to preserve geodesic distance on the manifold between two points 79 / 101
  • 82. Fisher's LDA (Linear Discriminant Analysis) [1936] Original idea: find a projection to discriminate classes best 81 / 101
  • 83. Fisher's LDA Mean and variance within a single class ( ): Total within-class variance: Total between-class variance: Goal: find a projection to maximize a ratio c c ∈ {1, 2, …, C} =< xμk >events of class c =< ||x − |σk μk | 2 >events of class c =σwithin ∑c pc σc = || − μ|σbetween ∑c pc μc | 2 σbetween σwithin 82 / 101
  • 85. LDA: solving optimization problem We are interested in finding -dimensional projection : Naturally connected to the generalized eigenvalue problem Projection vector corresponds to the highest generalized eigenvalue Finds a subspace of components when applied to a classification problem with classes Fisher's LDA is a basic popular binary classification technique. 1 w →ww T Σ within ww T Σ between max w C − 1 C 84 / 101
  • 86. Common spacial patterns When we expect that each class is close to some linear subspace, we can (naively) find for each class this subspace by PCA (better idea) take into account variation of other data and optimize Natural generalization is to take several components: is a projection matrix; and are number of dimensions in original and new spaces Frequently used in neural sciences, in particular in BCI based on EEG / MEG. W ∈ ℝn×n1 n n1 tr W → maxW T Σ class subject to  W = IW T Σ total 85 / 101
  • 87. Common spacial patterns Patters found describe the projection into 6-dimensional space 86 / 101
  • 88. Dimensionality reduction summary is capable of extracting sensible features from highly-dimensional data frequently used to 'visualize' the data nonlinear methods rely on the distance in the space works well with highly-dimensional spaces with features of same nature 87 / 101
  • 89. Finding optimal hyperparameters some algorithms have many parameters (regularizations, depth, learning rate, ...) not all the parameters are guessed checking all combinations takes too long 88 / 101
  • 90. Finding optimal hyperparameters some algorithms have many parameters (regularizations, depth, learning rate, ...) not all the parameters are guessed checking all combinations takes too long We need automated hyperparameter optimization! 89 / 101
  • 91. Finding optimal parameters randomly picking parameters is a partial solution given a target optimal value we can optimize it 90 / 101
  • 92. Finding optimal parameters randomly picking parameters is a partial solution given a target optimal value we can optimize it no gradient with respect to parameters noisy results 91 / 101
  • 93. Finding optimal parameters randomly picking parameters is a partial solution given a target optimal value we can optimize it no gradient with respect to parameters noisy results function reconstruction is a problem 92 / 101
  • 94. Finding optimal parameters randomly picking parameters is a partial solution given a target optimal value we can optimize it no gradient with respect to parameters noisy results function reconstruction is a problem Before running grid optimization make sure your metric is stable (i.e. by train/testing on different subsets). Overfitting (=getting too optimistic estimate of quality on a holdout) by using many attempts is a real issue. 93 / 101
  • 95. Optimal grid search stochastic optimization (Metropolis-Hastings, annealing) requires too many evaluations, using only last checked combination regression techniques, reusing all known information (ML to optimize ML!) 94 / 101
  • 96. Optimal grid search using regression General algorithm (point of grid = set of parameters): 1. evaluations at random points 2. build regression model based on known results 3. select the point with best expected quality according to trained model 4. evaluate quality at this points 5. Go to 2 if not enough evaluations Why not using linear regression? Exploration vs. exploitation trade-off: should we try explore poorly-covered regions or try to enhance currently seen to be optimal? 95 / 101
  • 97. Gaussian processes for regression Some definitions: , where and are functions of mean and covariance: , represents our prior expectation of quality (may be taken constant) represents influence of known results on the expectation of values in new points RBF kernel is used here too: Another popular choice: We can model the posterior distribution of results in each point. Y ∼ GP(m, K) m K m(x) K(x, )x̃  m(x) = Y(x) K(x, ) = Y(x)Y( )x̃  x̃  K(x, ) = exp(−c||x − | )x̃  x̃ | 2 K(x, ) = exp(−c||x − ||)x̃  x̃  96 / 101
  • 98. Gaussian Process Demo on Mathematica 97 / 101
  • 99. Gaussian processes Gaussian processes model posterior distribution at each point of the grid. we know at which point we have already well-estimated quality and we are able to find regions which need exploration. See also this demo. 98 / 101
  • 100. Summary about hyperoptimization parameters can be tuned automatically be sure that metric being optimized is stable mind the optimistic quality estimation (resolved by one more holdout) 99 / 101
  • 101. Summary about hyperoptimization parameters can be tuned automatically be sure that metric being optimized is stable mind the optimistic quality estimation (resolved by one more holdout) but this is not what you should spend your time on the gain from properly cooking features / reconsidering problem is much higher 100 / 101