SlideShare une entreprise Scribd logo
1  sur  124
Télécharger pour lire hors ligne
Machine Learning in Science and Industry
Day 2, lectures 3 & 4
Alex Rogozhnikov, Tatiana Likhomanenko
Heidelberg, GradDays 2017
Recapitulation
Statistical ML: applications and problems
k nearest neighbours classifier and regressor
Binary classification: ROC curve, ROC AUC
1 / 123
Recapitulation
Optimal regression and optimal classification
Bayes optimal classifier
=
Generative approach: density estimation, QDA,
Gaussian mixtures
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0) p(x ∣ y = 0)
p(y = 1) p(x ∣ y = 1)
2 / 123
Recapitulation
Clustering — unsupervised technique
3 / 123
Recapitulation
Discriminative approach: logistic regression (classification model), linear regression
with d(x) =< w, x > +w
p(y = +1∣x) = p (x)+1
p(y = −1∣x) = p (x)−1
= σ(d(x))
= σ(−d(x)) 0
4 / 123
Recapitulation
Regularization
L = L(x , y ) + L → min
L = ∣w ∣
Today: continue discriminative approach
reconstruct p(y = 1 ∣ x), p(y = 0 ∣ x)
ignore p(x, y)
N
1
i
∑ i i reg
p
i
∑ i
p
5 / 123
Decision Trees
6 / 123
Decision tree
Example: predict outside play based on weather conditions.
7 / 123
Decision tree: binary tree
8 / 123
Decision tree: splitting space
9 / 123
Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
10 / 123
Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
building a tree using a greedy optimization
start from the root (a tree with only one leaf)
each time split one leaf into two
repeat process for children if needed
11 / 123
Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
building a tree using a greedy optimization
start from the root (a tree with only one leaf)
each time split one leaf into two
repeat process for children if needed
need a criterion to select best splitting (feature and threshold)
12 / 123
Splitting criterion
TreeImpurity = impurity(leaf) × size(leaf)
Several impurity functions:
where p is a portion of signal observations in a leaf, and 1 − p is a portion of background
observations, size(leaf) is number of training observations in a leaf.
leaf
∑
Misclassification
Gini Index
Entropy
= min(p, 1 − p)
= p(1 − p)
= −p log p − (1 − p) log(1 − p)
13 / 123
Splitting criterion
Impurity as a function of p
14 / 123
Splitting criterion: why not misclassification?
15 / 123
Decision trees for regression
Greedy optimization (minimizing MSE):
TreeMSE ∼ (y − )
Can be rewritten as:
TreeMSE ∼ MSE(leaf) × size(leaf)
MSE(leaf) is like an 'impurity' of the leaf:
MSE(leaf) = (y − )
i
∑ i y^i
2
leaf
∑
size(leaf)
1
i∈leaf
∑ i y^i
2
16 / 123
17 / 123
18 / 123
19 / 123
20 / 123
21 / 123
Decision trees instability
Little variation in training dataset produce different classification rule.
22 / 123
Tree keeps splitting until each observation is correctly classified:
23 / 123
Pre-stopping
We can stop the process of splitting by imposing different restrictions:
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in a leaf
more advanced: maximal number of leaves in the tree
24 / 123
Pre-stopping
We can stop the process of splitting by imposing different restrictions:
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in a leaf
more advanced: maximal number of leaves in the tree
Any combinations of rules above is possible.
25 / 123
no pre-stopping maximal depth
min # of samples in a leaf maximal number of leaves 26 / 123
Decision tree overview
1. Very intuitive algorithm for regression and classification
2. Fast prediction
3. Scale-independent
4. Supports multiclassification
But
1. Training optimal tree is NP-complex
Trained greedily by optimizing Gini index or entropy (fast!)
2. Non-stable
3. Uses only trivial conditions
27 / 123
Feature importances
Different approaches exist to measure an importance of feature in the final model.
Importance of feature ≠ quality provided by one feature
28 / 123
Computing feature importances
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
model-agnostic recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
model-agnostic recipe: feature shuffling take one column in test dataset and shuffle it.
Compare quality with/without shuffling.
29 / 123
Ensembles
30 / 123
Composition of models
Basic motivation: improve quality of classification by reusing strong sides of different
classifiers / regressors.
31 / 123
Simple Voting
Averaging predictions
= [−1, +1, +1, +1, −1] ⇒ P = 0.6, P = 0.4
Averaging predicted probabilities
P (x) = p (x)
Averaging decision functions
D(x) = d (x)
y^ +1 −1
±1 J
1
∑j=1
J
±1,j
J
1
∑j=1
J
j
32 / 123
Weighted voting
The way to introduce importance of classifiers
D(x) = α d (x)
General case of ensembling
D(x) = f(d (x), d (x), ..., d (x))
∑j j j
1 2 J
33 / 123
Problems
very close base classifiers
need to keep variation
and still have good quality of basic classifiers
34 / 123
Decision tree reminder
35 / 123
Generating training subset
subsampling
taking fixed part of samples (sampling without replacement)
bagging (Bootstrap AGGregating)
sampling with replacement,
If #generated samples = length of the dataset,
the fraction of unique samples in new dataset is 1 − ∼ 63.2%e
1
36 / 123
Random subspace model (RSM)
Generating subspace of features by taking random subset of features
37 / 123
Random Forest [Leo Breiman, 2001]
Random forest is a composition of decision trees.
Each individual tree is trained on a subset of training data obtained by
bagging samples
taking random subset of features
Predictions of random forest are obtained via simple voting.
38 / 123
data optimal boundary
39 / 123
data optimal boundary
50 trees 2000 trees
40 / 123
Overfitting
doesn't overfit: increasing complexity (adding more trees) doesn't spoil a classifier
41 / 123
Works with features of different nature
Stable to outliers in data.
42 / 123
Works with features of different nature
Stable to outliers in data.
From 'Testing 179 Classifiers on 121 Datasets'
The classifiers most likely to be the bests are the random forest (RF) versions, the best of
which [...] achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the
data sets.
Applications:
a good approach for tabular data
Random forest is used to predict movements of individual body parts in Kinect
43 / 123
Random Forest overview
Impressively simple
Trees can be trained in parallel
Doesn't overfit
44 / 123
Random Forest overview
Impressively simple
Trees can be trained in parallel
Doesn't overfit
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation: N =used √Nfeatures
45 / 123
Random Forest overview
Impressively simple
Trees can be trained in parallel
Doesn't overfit
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation: N =
But
Hardly interpretable
Trained trees take much space, some kind of pre-stopping is required in practice
Doesn't fix mistakes done by previous trees
used √Nfeatures
46 / 123
Sample weights in ML
Can be used with many estimators. We now have triples
x , y , w i −  index of an observation
weight corresponds to expected frequency of an observation
expected behavior: w = n is the same as having n copies of ith observation
global normalization of weights doesn't matter
i i i
i
47 / 123
Sample weights in ML
Can be used with many estimators. We now have triples
x , y , w i −  index of an observation
weight corresponds to expected frequency of an observation
expected behavior: w = n is the same as having n copies of ith observation
global normalization of weights doesn't matter
Example for logistic regression:
L = w L(x , y ) → min
i i i
i
i
∑ i i i
48 / 123
Sample weights in ML
Weights (parameters) of a classifier ≠ sample weights
In code:
tree = DecisionTreeClassifier(max_depth=4)
tree.fit(X, y, sample_weight=weights)
Sample weights are convenient way to regulate importance of training observations.
Only sample weights when talking about AdaBoost.
49 / 123
AdaBoost [Freund, Shapire, 1995]
Bagging: information from previous trees not taken into account.
Adaptive Boosting is a weighted composition of weak learners:
D(x) = α d (x)
We assume d (x) = ±1, labels y = ±1,
jth weak learner misclassified ith observations iff y d (x ) = −1
j
∑ j j
j i
i j i
50 / 123
AdaBoost
D(x) = α d (x)
Weak learners are built in sequence, each classifier is trained using different weights
initially w = 1 for each training sample
After building jth base classifier:
1. compute the total weight of correctly and wrongly classified observations
α = ln
2. increase weight of misclassified samples
w ← w × e
j
∑ j j
i
j 2
1
( wwrong
wcorrect
)
i i
−α y d (x )j i j i
51 / 123
AdaBoost example
Decision trees of depth 1 will be used.
52 / 123
53 / 123
54 / 123
(1, 2, 3, 100 trees) 55 / 123
AdaBoost secret
D(x) = α d (x)
L = L(x , y ) = exp(−y D(x )) → min
sample weight is equal to penalty for an observation
w = L(x , y ) = exp(−y D(x ))
α is obtained as a result of analytical optimization
Exercise: prove formula for α
j
∑ j j
i
∑ i i
i
∑ i i
i i i i i
j
j
56 / 123
Loss function of AdaBoost
57 / 123
AdaBoost summary
is able to combine many weak learners
takes mistakes into account
simple, overhead for boosting is negligible
too sensitive to outliers
In scikit-learn, one can run AdaBoost over other algorithms.
58 / 123
n-minutes break
59 / 123
Recapitulation
1. Decision tree:
splitting criteria
pre-stopping
2. Random Forest: averaging over trees
3. AdaBoost: exponential loss, takes mistakes into account
60 / 123
Reweighting
61 / 123
Reweighting in HEP
Reweighting in HEP is used to minimize the
difference between real samples (target) and
Monte Carlo (MC) simulated samples (original)
The goal is to assign weights to original s.t.
original and target distributions coincide
62 / 123
Reweighting in HEP: typical approach
variable(s) is split into bins
in each bin the weight for original samples is multiplied by:
multiplier =
where w and w are total weights of samples in a bin for target and
original distributions.
bin
wbin, original
wbin, target
bin, target bin, original
63 / 123
Reweighting in HEP: typical approach
variable(s) is split into bins
in each bin the weight for original samples is multiplied by:
multiplier =
where w and w are total weights of samples in a bin for target and
original distributions.
1. simple and fast
2. number of variables is very limited by statistics (typically only one, two)
3. reweighting in one variable may bring disagreement in others
4. which variable is preferable for reweighting?
bin
wbin, original
wbin, target
bin, target bin, original
64 / 123
Reweighting in HEP: typical approach
Problems arise when there are too few observations in a
bin
This can be detected on a holdout (see the latest row)
Issues:
few bins - rule is rough
many bins - rule is not reliable
65 / 123
Reweighting in HEP: typical approach
Problems arise when there are too few observations in a
bin
This can be detected on a holdout (see the latest row)
Issues:
few bins - rule is rough
many bins - rule is not reliable
Reweighting rule must be checked on a holdout!
66 / 123
Reweighting quality
How to check the quality of reweighting?
One dimensional case: two samples tests (Kolmogorov-Smirnov test, Mann-Whitney
test, …)
Two or more dimensions?
Comparing 1d projections is not a way
67 / 123
Reweighting quality
How to check the quality of reweighting?
One dimensional case: two samples tests (Kolmogorov-Smirnov test, Mann-Whitney
test, …)
Two or more dimensions?
Comparing 1d projections is not a way
68 / 123
Reweighting quality with ML
Final goal: classifier doesn’t use orignal/target
disagreement information = classifier cannot
discriminate original and target samples
Comparison of distributions shall be done using ML:
train a classifier to discriminate original and target
samples
output of the classifier is one-dimensional variable
looking at the ROC curve (alternative of two sample
test) on a holdout 

AUC score should be 0.5 if the classifier cannot discriminate original and target
samples
69 / 123
Reweighting with density ratio estimation
We need to estimate density ratio:
Classifier trained to discriminate original and target samples should reconstruct 

probabilities p(original∣x) and p(target∣x)
For reweighting we can use
∼
f (x)target
f (x)original
f (x)target
f (x)original
p(target∣x)
p(original∣x)
70 / 123
Reweighting with density ratio estimation
We need to estimate density ratio:
Classifier trained to discriminate original and target samples should reconstruct 

probabilities p(original∣x) and p(target∣x)
For reweighting we can use
∼
1. Approach is able to reweight in many variables
2. It is successfully tried in HEP, see D. Martschei et al, "Advanced event reweighting using
multivariate analysis", 2012
3. There is poor reconstruction when ratio is too small / high
4. It is slower than histogram approach
f (x)target
f (x)original
f (x)target
f (x)original
p(target∣x)
p(original∣x)
71 / 123
Reminder
In histogram approach few bins are bad, many bins are bad too.
Tree splits the space of variables with orthogonal cuts
Each tree leaf can be considered as a region, or a bin
There are different criteria to construct a tree (MSE, Gini index, entropy, …)
72 / 123
Decision tree for reweighting
Write ML algorithm to solve directly reweighting
problem:
73 / 123
Decision tree for reweighting
Write ML algorithm to solve directly reweighting
problem:
use tree as a better approach for binning of
variables space
to find regions with the highest difference
between original and target distributions use
symmetrized χ criteria2
74 / 123
Decision tree for reweighting
Write ML algorithm to solve directly reweighting
problem:
use tree as a better approach for binning of
variables space
to find regions with the highest difference
between original and target distributions use
symmetrized χ criteria
χ =
where w and w are total weights of samples in a leaf for target and
original distributions.
2
2
leaf
∑
w + wleaf, original leaf, target
w − w( leaf, original leaf, target)2
leaf, original leaf, target
75 / 123
Reweighting with boosted decision trees
Many times repeat the following steps:
76 / 123
Reweighting with boosted decision trees
Many times repeat the following steps:
build a shallow tree to maximize symmetrized χ2
77 / 123
Reweighting with boosted decision trees
Many times repeat the following steps:
build a shallow tree to maximize symmetrized χ
compute predictions in leaves:
leaf_pred = log
2
wleaf, original
wleaf, target
78 / 123
Reweighting with boosted decision trees
Many times repeat the following steps:
build a shallow tree to maximize symmetrized χ
compute predictions in leaves:
leaf_pred = log
reweight distributions (compare with AdaBoost):
w =
2
wleaf, original
wleaf, target
{
w,
w × e ,pred
if sample from target distribution
if sample from original distribution
79 / 123
Reweighting with boosted decision trees
Comparison with AdaBoost:
different tree splitting criterion
different boosting procedure
80 / 123
Reweighting with boosted decision trees
Comparison with AdaBoost:
different tree splitting criterion
different boosting procedure
Summary
uses each time few large bins (construction is done intellectually)
is able to handle many variables
requires less data (for the same performance)
... but slow (being ML algorithm)
81 / 123
Reweighting applications
high energy physics (HEP): introducing corrections for simulation samples applied to
search for rare decays
low-mass dielectrons study
82 / 123
Reweighting applications
high energy physics (HEP): introducing corrections for simulation samples applied to
search for rare decays
low-mass dielectrons study
astronomy: introducing corrections to fight with bias in objects observations
objects that are expected to show more interesting properties are preferred when it
comes to costly high-quality spectroscopic follow-up observations
83 / 123
Reweighting applications
high energy physics (HEP): introducing corrections for simulation samples applied to
search for rare decays
low-mass dielectrons study
astronomy: introducing corrections to fight with bias in objects observations
objects that are expected to show more interesting properties are preferred when it
comes to costly high-quality spectroscopic follow-up observations
polling: introducing corrections to fight non-response bias
assigning higher weight to answers from groups with low response.
84 / 123
Gradient Boosting
85 / 123
Decision trees for regression
86 / 123
Gradient boosting to minimize MSE
Say, we're trying to build an ensemble to minimize MSE:
(D(x ) − y ) → min
and ensemble's prediction is obtained by taking sum over trees
D(x) = d (x)
How do we train this ensemble?
i
∑ i i
2
∑j j
87 / 123
Introduce D (x) = d (x) = D (x) + d (x)
Natural solution is to greedily minimize MSE:
(D (x ) − y ) = (D (x ) + d (x ) − y ) → min
assuming that we already built j − 1 estimators.
j ∑j =1′
j
j′ j−1 j
i
∑ j i i
2
i
∑ j−1 i j i i
2
88 / 123
Introduce D (x) = d (x) = D (x) + d (x)
Natural solution is to greedily minimize MSE:
(D (x ) − y ) = (D (x ) + d (x ) − y ) → min
assuming that we already built j − 1 estimators.
Introduce residual: R (x ) = y − D (x ), and rewrite MSE
(d (x ) − R (x )) → min
So the jth estimator (tree) is trained using the following data:
x , R (x )
j ∑j =1′
j
j′ j−1 j
i
∑ j i i
2
i
∑ j−1 i j i i
2
j i i j−1 i
i
∑ j i j i
2
i j i
89 / 123
Gradient Boosting visualization
90 / 123
Gradient Boosting [Friedman, 1999]
composition of weak regressors,
D(x) = α d (x)
Borrow an approach to encode probabilities from logistic regression
Optimization of log-likelihood (y = ±1):
L = L(x , y ) = ln 1 + e → min
j
∑ j j
p (x)+1
p (x)−1
= σ(D(x))
= σ(−D(x))
i
i
∑ i i
i
∑ ( −y D(x )i i
)
91 / 123
Gradient Boosting
D(x) = α d (x)
L = ln 1 + e → min
Optimization problem: find all α and weak leaners d
Mission impossible
j
∑ j j
i
∑ ( −y D(x )i i
)
j j
92 / 123
Gradient Boosting
D(x) = α d (x)
L = ln 1 + e → min
Optimization problem: find all α and weak leaners d
Mission impossible
Main point: greedy optimization of loss function by training one more weak learner d
Each new estimator follows the gradient of loss function
j
∑ j j
i
∑ ( −y D(x )i i
)
j j
j
93 / 123
Gradient Boosting
Gradient boosting ~ steepest gradient descent.
D (x) = α d (x)
D (x) = D (x) + α d (x)
At jth iteration:
compute pseudo-residual
R(x ) = − L
train regressor d to minimize MSE:
(d (x ) − R(x )) → min
find optimal α
Important exercise: compute pseudo-residuals for MSE and logistic losses.
j ∑j =1′
j
j′ j′
j j−1 j j
i ∂D(x )i
∂
∣
∣
D(x)=D (x)j−1
j
∑i j i i
2
j
94 / 123
Additional GB tricks
to make training more stable, add
learning rate η:
D (x) = ηα d (x)
randomization to fight noise and build different trees:
subsampling of features
subsampling of training samples
j
j
∑ j j
95 / 123
AdaBoost is a particular case of gradient boosting with different target loss function*:
L = e → min
This loss function is called ExpLoss or AdaLoss.
*(also AdaBoost expects that d (x ) = ±1)
ada
i
∑ −y D(x )i i
j i
96 / 123
Loss functions
Gradient boosting can optimize different smooth loss function (apart from the gradients
second-order derivatives can be used).
regression, y ∈ R
Mean Squared Error (d(x ) − y )
Mean Absolute Error d(x ) − y
binary classification, y = ±1
ExpLoss (ada AdaLoss) e
LogLoss log(1 + e )
∑i i i
2
∑i ∣ i i∣
i
∑i
−y d(x )i i
∑i
−y d(x )i i
97 / 123
98 / 123
99 / 123
Gradient Boosting classification playground
100 / 123
Multiclass classification: ensembling
One-vs-one,
One-vs-rest, n
scikit-learn implements those as meta-algorithms.
2
n × (n − 1)classes classes classes
101 / 123
Multiclass classification: modifying an algorithm
Most classifiers have natural generalizations to multiclass classification.
Example for logistic regression: introduce for each class c ∈ 1, 2, ..., C a vector w .
d (x) =< w , x >
Converting to probabilities using softmax function:
p (x) =
And minimize LogLoss: L = − log p (x )
c
c c
c
e∑c~ d (x)c~
ed (x)c
∑i yi i
102 / 123
Softmax function
Typical way to convert n numbers to n probabilities.
Mapping is surjective, but not injective (n dimensions to n − 1 dimension). Invariant to
global shift:
d (x) → d (x) + const
For the case of two classes: p (x) = =
Coincides with logistic function for d(x) = d (x) − d (x)
c c
1
e + ed (x)1 d (x)2
ed (x)1
1 + ed (x)−d (x)2 1
1
1 2
103 / 123
Loss function: ranking example
In ranking we need to order items by y :
y < y ⇒ d(x ) < d(x )
We can penalize for misordering:
L = L(x , x , y , y )
L(x, , y, ) =
i
i j i j
i,i
~
∑ i i
~ i i
~
x~ y~ {
σ(d( ) − d(x)),x~
0,
y < y~
otherwise
104 / 123
Boosting applications
By modifying boosting or changing loss function we can solve different problems.
It is powerful for tabular data of different nature.
Among the 29 challenge winning solutions published at Kaggle’s blog during 2015, 17
solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the
model, while most others combined XGBoost with neural nets in ensembles.
105 / 123
Boosting applications
ranking
search engine
search engine Yandex uses own variant of gradient boosting which uses oblivious
decision trees
Bing's search uses RankNet, gradient boosting for ranking invented by Microsoft
Research
recommendations
as the final ranking model combining properties of user, text, image contents
106 / 123
Boosting applications
classification and regression
predicting clicks on ads at Facebook
in high energy physics to combine geometric and kinematic properties
B-meson flavour tagging
trigger system
particle identification
rare decays search
store sales prediction
motion detection
customer behavior prediction
web text classification
107 / 123
Trigger system
The goal is to select interesting proton-proton
collisions based on detailed online analysis of
measured physics information
Trigger system can consists of two parts:
hardware and software
It selects several thousands collisions among 40
millions per second → should be fast
We have improved LHCb software topological
trigger to select beauty and charm hadrons
Yandex gradient boosting over oblivious
decision trees (MatrixNet) is used
108 / 123
Boosting to uniformity
In HEP applications additional restriction can
arise: classifier should have constant
efficiency (FPR/TPR) against some variable.
109 / 123
Boosting to uniformity
In HEP applications additional restriction can
arise: classifier should have constant
efficiency (FPR/TPR) against some variable.
select particles independently on its life
time (trigger system)
identify particle type independently on its
momentum (particle identification)
select particles independently on its
invariant mass (rare decays search)
110 / 123
Non-uniformity measure
difference in the efficiency can be detected analyzing distributions
uniformity = no dependence between mass and predictions
111 / 123
Non-uniformity measure
Average contributions (difference between global and local distributions) from different
regions 
in the variable using Cramer-von Mises measure (integral characteristic)
CvM = F (s) − F (s) dF (s)
region
∑ ∫ ∣ region global ∣2
global
112 / 123
Minimizing non-uniformity measure
why not minimizing CvM as a loss function with GB?
113 / 123
Minimizing non-uniformity measure
why not minimizing CvM as a loss function with GB?
… because we can’t compute the gradient
and minimizing CvM doesn't encounter classification problem:

the minimum of CvM is achieved i.e. on a classifier with random predictions
114 / 123
Minimizing non-uniformity measure
why not minimizing CvM as a loss function with GB?
… because we can’t compute the gradient
and minimizing CvM doesn't encounter classification problem:

the minimum of CvM is achieved i.e. on a classifier with random predictions
ROC AUC, classification accuracy are not differentiable too; however, logistic loss, for
example, works well
115 / 123
Flatness loss
Put an additional term in the loss function which will penalize for non-uniformity
predictions
L = L + αL
where α is a trade off classification quality and uniformity
AdaLoass FlatnessLoss
116 / 123
Flatness loss
Put an additional term in the loss function which will penalize for non-uniformity
predictions
L = L + αL
where α is a trade off classification quality and uniformity
Flatness loss approximates non-differentiable CvM measure
L = F (s) − F (s) ds
L ∼ 2 F (s) − F (s) ∣
AdaLoass FlatnessLoss
FlatnessLoss
region
∑ ∫ ∣ region global ∣2
∂D(x )i
∂
FlatnessLoss [ region global ] s=D(x )i
117 / 123
Particle Identification
Problem: identify charged particle associated with
a track 
(multiclass classification problem)
particle "types": Ghost, Electron, Muon, Pion,
Kaon, Proton.
118 / 123
Particle Identification
Problem: identify charged particle associated with
a track 
(multiclass classification problem)
particle "types": Ghost, Electron, Muon, Pion,
Kaon, Proton.
LHCb detector provides diverse plentiful
information, collected by subdetectors:
calorimeters, (Ring Imaging) Cherenkov detector,
muon chambers and tracker observables
this information can be efficiently combined
using ML
119 / 123
Particle Identification
Problem: identify charged particle associated with
a track 
(multiclass classification problem)
particle "types": Ghost, Electron, Muon, Pion,
Kaon, Proton.
LHCb detector provides diverse plentiful
information, collected by subdetectors:
calorimeters, (Ring Imaging) Cherenkov detector,
muon chambers and tracker observables
this information can be efficiently combined
using ML
Identification should be independent on particle momentum, pseudorapidity, etc.
120 / 123
Flatness loss for particle identification
121 / 123
Gradient Boosting overview
A powerful ensembling technique (typically used over trees, GBDT)
a general way to optimize differentiable losses
can be adapted to other problems
'following' the gradient of loss at each step
making steps in the space of functions
gradient of poorly-classified observations is higher
increasing number of trees can drive to overfitting
(= getting worse quality on new data)
requires tuning, better when trees are not complex
widely used in practice
122 / 123
123 / 123

Contenu connexe

Tendances

Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Syed Atif Naseem
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine LearningVARUN KUMAR
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
Kaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal KaurKaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal KaurSawinder Pal Kaur
 
Chapter 09 class advanced
Chapter 09 class advancedChapter 09 class advanced
Chapter 09 class advancedHouw Liong The
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Mostafa G. M. Mostafa
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature GenerationAndres Mendez-Vazquez
 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada BoostAndres Mendez-Vazquez
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and morehsharmasshare
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machinesNawal Sharma
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Marina Santini
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 

Tendances (19)

Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Kaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal KaurKaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal Kaur
 
Chapter 09 class advanced
Chapter 09 class advancedChapter 09 class advanced
Chapter 09 class advanced
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost
 
PCA and SVD in brief
PCA and SVD in briefPCA and SVD in brief
PCA and SVD in brief
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and more
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machines
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 

Similaire à Machine learning in science and industry — day 2

13 random forest
13 random forest13 random forest
13 random forestVishal Dutt
 
Algoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nyaAlgoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nyabatubao
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningAbhishek Vijayvargia
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackarogozhnikov
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseasesijsrd.com
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesMohammed Bennamoun
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionMargaret Wang
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesXavier Rafael Palou
 
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...IOSR Journals
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
 
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAminaRepo
 
Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.pptRohit Raj
 

Similaire à Machine learning in science and industry — day 2 (20)

13 random forest
13 random forest13 random forest
13 random forest
 
Algoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nyaAlgoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nya
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Clustering
ClusteringClustering
Clustering
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
 
Random forest
Random forestRandom forest
Random forest
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniques
 
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble Learning
 
Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.ppt
 

Dernier

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 

Dernier (20)

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 

Machine learning in science and industry — day 2

  • 1. Machine Learning in Science and Industry Day 2, lectures 3 & 4 Alex Rogozhnikov, Tatiana Likhomanenko Heidelberg, GradDays 2017
  • 2. Recapitulation Statistical ML: applications and problems k nearest neighbours classifier and regressor Binary classification: ROC curve, ROC AUC 1 / 123
  • 3. Recapitulation Optimal regression and optimal classification Bayes optimal classifier = Generative approach: density estimation, QDA, Gaussian mixtures p(y = 0 ∣ x) p(y = 1 ∣ x) p(y = 0) p(x ∣ y = 0) p(y = 1) p(x ∣ y = 1) 2 / 123
  • 5. Recapitulation Discriminative approach: logistic regression (classification model), linear regression with d(x) =< w, x > +w p(y = +1∣x) = p (x)+1 p(y = −1∣x) = p (x)−1 = σ(d(x)) = σ(−d(x)) 0 4 / 123
  • 6. Recapitulation Regularization L = L(x , y ) + L → min L = ∣w ∣ Today: continue discriminative approach reconstruct p(y = 1 ∣ x), p(y = 0 ∣ x) ignore p(x, y) N 1 i ∑ i i reg p i ∑ i p 5 / 123
  • 8. Decision tree Example: predict outside play based on weather conditions. 7 / 123
  • 9. Decision tree: binary tree 8 / 123
  • 10. Decision tree: splitting space 9 / 123
  • 11. Decision tree fast & intuitive prediction but building an optimal decision tree is an NP complete problem 10 / 123
  • 12. Decision tree fast & intuitive prediction but building an optimal decision tree is an NP complete problem building a tree using a greedy optimization start from the root (a tree with only one leaf) each time split one leaf into two repeat process for children if needed 11 / 123
  • 13. Decision tree fast & intuitive prediction but building an optimal decision tree is an NP complete problem building a tree using a greedy optimization start from the root (a tree with only one leaf) each time split one leaf into two repeat process for children if needed need a criterion to select best splitting (feature and threshold) 12 / 123
  • 14. Splitting criterion TreeImpurity = impurity(leaf) × size(leaf) Several impurity functions: where p is a portion of signal observations in a leaf, and 1 − p is a portion of background observations, size(leaf) is number of training observations in a leaf. leaf ∑ Misclassification Gini Index Entropy = min(p, 1 − p) = p(1 − p) = −p log p − (1 − p) log(1 − p) 13 / 123
  • 15. Splitting criterion Impurity as a function of p 14 / 123
  • 16. Splitting criterion: why not misclassification? 15 / 123
  • 17. Decision trees for regression Greedy optimization (minimizing MSE): TreeMSE ∼ (y − ) Can be rewritten as: TreeMSE ∼ MSE(leaf) × size(leaf) MSE(leaf) is like an 'impurity' of the leaf: MSE(leaf) = (y − ) i ∑ i y^i 2 leaf ∑ size(leaf) 1 i∈leaf ∑ i y^i 2 16 / 123
  • 23. Decision trees instability Little variation in training dataset produce different classification rule. 22 / 123
  • 24. Tree keeps splitting until each observation is correctly classified: 23 / 123
  • 25. Pre-stopping We can stop the process of splitting by imposing different restrictions: limit the depth of tree set minimal number of samples needed to split the leaf limit the minimal number of samples in a leaf more advanced: maximal number of leaves in the tree 24 / 123
  • 26. Pre-stopping We can stop the process of splitting by imposing different restrictions: limit the depth of tree set minimal number of samples needed to split the leaf limit the minimal number of samples in a leaf more advanced: maximal number of leaves in the tree Any combinations of rules above is possible. 25 / 123
  • 27. no pre-stopping maximal depth min # of samples in a leaf maximal number of leaves 26 / 123
  • 28. Decision tree overview 1. Very intuitive algorithm for regression and classification 2. Fast prediction 3. Scale-independent 4. Supports multiclassification But 1. Training optimal tree is NP-complex Trained greedily by optimizing Gini index or entropy (fast!) 2. Non-stable 3. Uses only trivial conditions 27 / 123
  • 29. Feature importances Different approaches exist to measure an importance of feature in the final model. Importance of feature ≠ quality provided by one feature 28 / 123
  • 30. Computing feature importances tree: counting number of splits made over this feature tree: counting gain in purity (e.g. Gini) fast and adequate model-agnostic recipe: train without one feature, compare quality on test with/without one feature requires many evaluations model-agnostic recipe: feature shuffling take one column in test dataset and shuffle it. Compare quality with/without shuffling. 29 / 123
  • 32. Composition of models Basic motivation: improve quality of classification by reusing strong sides of different classifiers / regressors. 31 / 123
  • 33. Simple Voting Averaging predictions = [−1, +1, +1, +1, −1] ⇒ P = 0.6, P = 0.4 Averaging predicted probabilities P (x) = p (x) Averaging decision functions D(x) = d (x) y^ +1 −1 ±1 J 1 ∑j=1 J ±1,j J 1 ∑j=1 J j 32 / 123
  • 34. Weighted voting The way to introduce importance of classifiers D(x) = α d (x) General case of ensembling D(x) = f(d (x), d (x), ..., d (x)) ∑j j j 1 2 J 33 / 123
  • 35. Problems very close base classifiers need to keep variation and still have good quality of basic classifiers 34 / 123
  • 37. Generating training subset subsampling taking fixed part of samples (sampling without replacement) bagging (Bootstrap AGGregating) sampling with replacement, If #generated samples = length of the dataset, the fraction of unique samples in new dataset is 1 − ∼ 63.2%e 1 36 / 123
  • 38. Random subspace model (RSM) Generating subspace of features by taking random subset of features 37 / 123
  • 39. Random Forest [Leo Breiman, 2001] Random forest is a composition of decision trees. Each individual tree is trained on a subset of training data obtained by bagging samples taking random subset of features Predictions of random forest are obtained via simple voting. 38 / 123
  • 41. data optimal boundary 50 trees 2000 trees 40 / 123
  • 42. Overfitting doesn't overfit: increasing complexity (adding more trees) doesn't spoil a classifier 41 / 123
  • 43. Works with features of different nature Stable to outliers in data. 42 / 123
  • 44. Works with features of different nature Stable to outliers in data. From 'Testing 179 Classifiers on 121 Datasets' The classifiers most likely to be the bests are the random forest (RF) versions, the best of which [...] achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. Applications: a good approach for tabular data Random forest is used to predict movements of individual body parts in Kinect 43 / 123
  • 45. Random Forest overview Impressively simple Trees can be trained in parallel Doesn't overfit 44 / 123
  • 46. Random Forest overview Impressively simple Trees can be trained in parallel Doesn't overfit Doesn't require much tuning Effectively only one parameter: number of features used in each tree Recommendation: N =used √Nfeatures 45 / 123
  • 47. Random Forest overview Impressively simple Trees can be trained in parallel Doesn't overfit Doesn't require much tuning Effectively only one parameter: number of features used in each tree Recommendation: N = But Hardly interpretable Trained trees take much space, some kind of pre-stopping is required in practice Doesn't fix mistakes done by previous trees used √Nfeatures 46 / 123
  • 48. Sample weights in ML Can be used with many estimators. We now have triples x , y , w i −  index of an observation weight corresponds to expected frequency of an observation expected behavior: w = n is the same as having n copies of ith observation global normalization of weights doesn't matter i i i i 47 / 123
  • 49. Sample weights in ML Can be used with many estimators. We now have triples x , y , w i −  index of an observation weight corresponds to expected frequency of an observation expected behavior: w = n is the same as having n copies of ith observation global normalization of weights doesn't matter Example for logistic regression: L = w L(x , y ) → min i i i i i ∑ i i i 48 / 123
  • 50. Sample weights in ML Weights (parameters) of a classifier ≠ sample weights In code: tree = DecisionTreeClassifier(max_depth=4) tree.fit(X, y, sample_weight=weights) Sample weights are convenient way to regulate importance of training observations. Only sample weights when talking about AdaBoost. 49 / 123
  • 51. AdaBoost [Freund, Shapire, 1995] Bagging: information from previous trees not taken into account. Adaptive Boosting is a weighted composition of weak learners: D(x) = α d (x) We assume d (x) = ±1, labels y = ±1, jth weak learner misclassified ith observations iff y d (x ) = −1 j ∑ j j j i i j i 50 / 123
  • 52. AdaBoost D(x) = α d (x) Weak learners are built in sequence, each classifier is trained using different weights initially w = 1 for each training sample After building jth base classifier: 1. compute the total weight of correctly and wrongly classified observations α = ln 2. increase weight of misclassified samples w ← w × e j ∑ j j i j 2 1 ( wwrong wcorrect ) i i −α y d (x )j i j i 51 / 123
  • 53. AdaBoost example Decision trees of depth 1 will be used. 52 / 123
  • 56. (1, 2, 3, 100 trees) 55 / 123
  • 57. AdaBoost secret D(x) = α d (x) L = L(x , y ) = exp(−y D(x )) → min sample weight is equal to penalty for an observation w = L(x , y ) = exp(−y D(x )) α is obtained as a result of analytical optimization Exercise: prove formula for α j ∑ j j i ∑ i i i ∑ i i i i i i i j j 56 / 123
  • 58. Loss function of AdaBoost 57 / 123
  • 59. AdaBoost summary is able to combine many weak learners takes mistakes into account simple, overhead for boosting is negligible too sensitive to outliers In scikit-learn, one can run AdaBoost over other algorithms. 58 / 123
  • 61. Recapitulation 1. Decision tree: splitting criteria pre-stopping 2. Random Forest: averaging over trees 3. AdaBoost: exponential loss, takes mistakes into account 60 / 123
  • 63. Reweighting in HEP Reweighting in HEP is used to minimize the difference between real samples (target) and Monte Carlo (MC) simulated samples (original) The goal is to assign weights to original s.t. original and target distributions coincide 62 / 123
  • 64. Reweighting in HEP: typical approach variable(s) is split into bins in each bin the weight for original samples is multiplied by: multiplier = where w and w are total weights of samples in a bin for target and original distributions. bin wbin, original wbin, target bin, target bin, original 63 / 123
  • 65. Reweighting in HEP: typical approach variable(s) is split into bins in each bin the weight for original samples is multiplied by: multiplier = where w and w are total weights of samples in a bin for target and original distributions. 1. simple and fast 2. number of variables is very limited by statistics (typically only one, two) 3. reweighting in one variable may bring disagreement in others 4. which variable is preferable for reweighting? bin wbin, original wbin, target bin, target bin, original 64 / 123
  • 66. Reweighting in HEP: typical approach Problems arise when there are too few observations in a bin This can be detected on a holdout (see the latest row) Issues: few bins - rule is rough many bins - rule is not reliable 65 / 123
  • 67. Reweighting in HEP: typical approach Problems arise when there are too few observations in a bin This can be detected on a holdout (see the latest row) Issues: few bins - rule is rough many bins - rule is not reliable Reweighting rule must be checked on a holdout! 66 / 123
  • 68. Reweighting quality How to check the quality of reweighting? One dimensional case: two samples tests (Kolmogorov-Smirnov test, Mann-Whitney test, …) Two or more dimensions? Comparing 1d projections is not a way 67 / 123
  • 69. Reweighting quality How to check the quality of reweighting? One dimensional case: two samples tests (Kolmogorov-Smirnov test, Mann-Whitney test, …) Two or more dimensions? Comparing 1d projections is not a way 68 / 123
  • 70. Reweighting quality with ML Final goal: classifier doesn’t use orignal/target disagreement information = classifier cannot discriminate original and target samples Comparison of distributions shall be done using ML: train a classifier to discriminate original and target samples output of the classifier is one-dimensional variable looking at the ROC curve (alternative of two sample test) on a holdout 
 AUC score should be 0.5 if the classifier cannot discriminate original and target samples 69 / 123
  • 71. Reweighting with density ratio estimation We need to estimate density ratio: Classifier trained to discriminate original and target samples should reconstruct 
 probabilities p(original∣x) and p(target∣x) For reweighting we can use ∼ f (x)target f (x)original f (x)target f (x)original p(target∣x) p(original∣x) 70 / 123
  • 72. Reweighting with density ratio estimation We need to estimate density ratio: Classifier trained to discriminate original and target samples should reconstruct 
 probabilities p(original∣x) and p(target∣x) For reweighting we can use ∼ 1. Approach is able to reweight in many variables 2. It is successfully tried in HEP, see D. Martschei et al, "Advanced event reweighting using multivariate analysis", 2012 3. There is poor reconstruction when ratio is too small / high 4. It is slower than histogram approach f (x)target f (x)original f (x)target f (x)original p(target∣x) p(original∣x) 71 / 123
  • 73. Reminder In histogram approach few bins are bad, many bins are bad too. Tree splits the space of variables with orthogonal cuts Each tree leaf can be considered as a region, or a bin There are different criteria to construct a tree (MSE, Gini index, entropy, …) 72 / 123
  • 74. Decision tree for reweighting Write ML algorithm to solve directly reweighting problem: 73 / 123
  • 75. Decision tree for reweighting Write ML algorithm to solve directly reweighting problem: use tree as a better approach for binning of variables space to find regions with the highest difference between original and target distributions use symmetrized χ criteria2 74 / 123
  • 76. Decision tree for reweighting Write ML algorithm to solve directly reweighting problem: use tree as a better approach for binning of variables space to find regions with the highest difference between original and target distributions use symmetrized χ criteria χ = where w and w are total weights of samples in a leaf for target and original distributions. 2 2 leaf ∑ w + wleaf, original leaf, target w − w( leaf, original leaf, target)2 leaf, original leaf, target 75 / 123
  • 77. Reweighting with boosted decision trees Many times repeat the following steps: 76 / 123
  • 78. Reweighting with boosted decision trees Many times repeat the following steps: build a shallow tree to maximize symmetrized χ2 77 / 123
  • 79. Reweighting with boosted decision trees Many times repeat the following steps: build a shallow tree to maximize symmetrized χ compute predictions in leaves: leaf_pred = log 2 wleaf, original wleaf, target 78 / 123
  • 80. Reweighting with boosted decision trees Many times repeat the following steps: build a shallow tree to maximize symmetrized χ compute predictions in leaves: leaf_pred = log reweight distributions (compare with AdaBoost): w = 2 wleaf, original wleaf, target { w, w × e ,pred if sample from target distribution if sample from original distribution 79 / 123
  • 81. Reweighting with boosted decision trees Comparison with AdaBoost: different tree splitting criterion different boosting procedure 80 / 123
  • 82. Reweighting with boosted decision trees Comparison with AdaBoost: different tree splitting criterion different boosting procedure Summary uses each time few large bins (construction is done intellectually) is able to handle many variables requires less data (for the same performance) ... but slow (being ML algorithm) 81 / 123
  • 83. Reweighting applications high energy physics (HEP): introducing corrections for simulation samples applied to search for rare decays low-mass dielectrons study 82 / 123
  • 84. Reweighting applications high energy physics (HEP): introducing corrections for simulation samples applied to search for rare decays low-mass dielectrons study astronomy: introducing corrections to fight with bias in objects observations objects that are expected to show more interesting properties are preferred when it comes to costly high-quality spectroscopic follow-up observations 83 / 123
  • 85. Reweighting applications high energy physics (HEP): introducing corrections for simulation samples applied to search for rare decays low-mass dielectrons study astronomy: introducing corrections to fight with bias in objects observations objects that are expected to show more interesting properties are preferred when it comes to costly high-quality spectroscopic follow-up observations polling: introducing corrections to fight non-response bias assigning higher weight to answers from groups with low response. 84 / 123
  • 87. Decision trees for regression 86 / 123
  • 88. Gradient boosting to minimize MSE Say, we're trying to build an ensemble to minimize MSE: (D(x ) − y ) → min and ensemble's prediction is obtained by taking sum over trees D(x) = d (x) How do we train this ensemble? i ∑ i i 2 ∑j j 87 / 123
  • 89. Introduce D (x) = d (x) = D (x) + d (x) Natural solution is to greedily minimize MSE: (D (x ) − y ) = (D (x ) + d (x ) − y ) → min assuming that we already built j − 1 estimators. j ∑j =1′ j j′ j−1 j i ∑ j i i 2 i ∑ j−1 i j i i 2 88 / 123
  • 90. Introduce D (x) = d (x) = D (x) + d (x) Natural solution is to greedily minimize MSE: (D (x ) − y ) = (D (x ) + d (x ) − y ) → min assuming that we already built j − 1 estimators. Introduce residual: R (x ) = y − D (x ), and rewrite MSE (d (x ) − R (x )) → min So the jth estimator (tree) is trained using the following data: x , R (x ) j ∑j =1′ j j′ j−1 j i ∑ j i i 2 i ∑ j−1 i j i i 2 j i i j−1 i i ∑ j i j i 2 i j i 89 / 123
  • 92. Gradient Boosting [Friedman, 1999] composition of weak regressors, D(x) = α d (x) Borrow an approach to encode probabilities from logistic regression Optimization of log-likelihood (y = ±1): L = L(x , y ) = ln 1 + e → min j ∑ j j p (x)+1 p (x)−1 = σ(D(x)) = σ(−D(x)) i i ∑ i i i ∑ ( −y D(x )i i ) 91 / 123
  • 93. Gradient Boosting D(x) = α d (x) L = ln 1 + e → min Optimization problem: find all α and weak leaners d Mission impossible j ∑ j j i ∑ ( −y D(x )i i ) j j 92 / 123
  • 94. Gradient Boosting D(x) = α d (x) L = ln 1 + e → min Optimization problem: find all α and weak leaners d Mission impossible Main point: greedy optimization of loss function by training one more weak learner d Each new estimator follows the gradient of loss function j ∑ j j i ∑ ( −y D(x )i i ) j j j 93 / 123
  • 95. Gradient Boosting Gradient boosting ~ steepest gradient descent. D (x) = α d (x) D (x) = D (x) + α d (x) At jth iteration: compute pseudo-residual R(x ) = − L train regressor d to minimize MSE: (d (x ) − R(x )) → min find optimal α Important exercise: compute pseudo-residuals for MSE and logistic losses. j ∑j =1′ j j′ j′ j j−1 j j i ∂D(x )i ∂ ∣ ∣ D(x)=D (x)j−1 j ∑i j i i 2 j 94 / 123
  • 96. Additional GB tricks to make training more stable, add learning rate η: D (x) = ηα d (x) randomization to fight noise and build different trees: subsampling of features subsampling of training samples j j ∑ j j 95 / 123
  • 97. AdaBoost is a particular case of gradient boosting with different target loss function*: L = e → min This loss function is called ExpLoss or AdaLoss. *(also AdaBoost expects that d (x ) = ±1) ada i ∑ −y D(x )i i j i 96 / 123
  • 98. Loss functions Gradient boosting can optimize different smooth loss function (apart from the gradients second-order derivatives can be used). regression, y ∈ R Mean Squared Error (d(x ) − y ) Mean Absolute Error d(x ) − y binary classification, y = ±1 ExpLoss (ada AdaLoss) e LogLoss log(1 + e ) ∑i i i 2 ∑i ∣ i i∣ i ∑i −y d(x )i i ∑i −y d(x )i i 97 / 123
  • 101. Gradient Boosting classification playground 100 / 123
  • 102. Multiclass classification: ensembling One-vs-one, One-vs-rest, n scikit-learn implements those as meta-algorithms. 2 n × (n − 1)classes classes classes 101 / 123
  • 103. Multiclass classification: modifying an algorithm Most classifiers have natural generalizations to multiclass classification. Example for logistic regression: introduce for each class c ∈ 1, 2, ..., C a vector w . d (x) =< w , x > Converting to probabilities using softmax function: p (x) = And minimize LogLoss: L = − log p (x ) c c c c e∑c~ d (x)c~ ed (x)c ∑i yi i 102 / 123
  • 104. Softmax function Typical way to convert n numbers to n probabilities. Mapping is surjective, but not injective (n dimensions to n − 1 dimension). Invariant to global shift: d (x) → d (x) + const For the case of two classes: p (x) = = Coincides with logistic function for d(x) = d (x) − d (x) c c 1 e + ed (x)1 d (x)2 ed (x)1 1 + ed (x)−d (x)2 1 1 1 2 103 / 123
  • 105. Loss function: ranking example In ranking we need to order items by y : y < y ⇒ d(x ) < d(x ) We can penalize for misordering: L = L(x , x , y , y ) L(x, , y, ) = i i j i j i,i ~ ∑ i i ~ i i ~ x~ y~ { σ(d( ) − d(x)),x~ 0, y < y~ otherwise 104 / 123
  • 106. Boosting applications By modifying boosting or changing loss function we can solve different problems. It is powerful for tabular data of different nature. Among the 29 challenge winning solutions published at Kaggle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets in ensembles. 105 / 123
  • 107. Boosting applications ranking search engine search engine Yandex uses own variant of gradient boosting which uses oblivious decision trees Bing's search uses RankNet, gradient boosting for ranking invented by Microsoft Research recommendations as the final ranking model combining properties of user, text, image contents 106 / 123
  • 108. Boosting applications classification and regression predicting clicks on ads at Facebook in high energy physics to combine geometric and kinematic properties B-meson flavour tagging trigger system particle identification rare decays search store sales prediction motion detection customer behavior prediction web text classification 107 / 123
  • 109. Trigger system The goal is to select interesting proton-proton collisions based on detailed online analysis of measured physics information Trigger system can consists of two parts: hardware and software It selects several thousands collisions among 40 millions per second → should be fast We have improved LHCb software topological trigger to select beauty and charm hadrons Yandex gradient boosting over oblivious decision trees (MatrixNet) is used 108 / 123
  • 110. Boosting to uniformity In HEP applications additional restriction can arise: classifier should have constant efficiency (FPR/TPR) against some variable. 109 / 123
  • 111. Boosting to uniformity In HEP applications additional restriction can arise: classifier should have constant efficiency (FPR/TPR) against some variable. select particles independently on its life time (trigger system) identify particle type independently on its momentum (particle identification) select particles independently on its invariant mass (rare decays search) 110 / 123
  • 112. Non-uniformity measure difference in the efficiency can be detected analyzing distributions uniformity = no dependence between mass and predictions 111 / 123
  • 113. Non-uniformity measure Average contributions (difference between global and local distributions) from different regions 
in the variable using Cramer-von Mises measure (integral characteristic) CvM = F (s) − F (s) dF (s) region ∑ ∫ ∣ region global ∣2 global 112 / 123
  • 114. Minimizing non-uniformity measure why not minimizing CvM as a loss function with GB? 113 / 123
  • 115. Minimizing non-uniformity measure why not minimizing CvM as a loss function with GB? … because we can’t compute the gradient and minimizing CvM doesn't encounter classification problem:
 the minimum of CvM is achieved i.e. on a classifier with random predictions 114 / 123
  • 116. Minimizing non-uniformity measure why not minimizing CvM as a loss function with GB? … because we can’t compute the gradient and minimizing CvM doesn't encounter classification problem:
 the minimum of CvM is achieved i.e. on a classifier with random predictions ROC AUC, classification accuracy are not differentiable too; however, logistic loss, for example, works well 115 / 123
  • 117. Flatness loss Put an additional term in the loss function which will penalize for non-uniformity predictions L = L + αL where α is a trade off classification quality and uniformity AdaLoass FlatnessLoss 116 / 123
  • 118. Flatness loss Put an additional term in the loss function which will penalize for non-uniformity predictions L = L + αL where α is a trade off classification quality and uniformity Flatness loss approximates non-differentiable CvM measure L = F (s) − F (s) ds L ∼ 2 F (s) − F (s) ∣ AdaLoass FlatnessLoss FlatnessLoss region ∑ ∫ ∣ region global ∣2 ∂D(x )i ∂ FlatnessLoss [ region global ] s=D(x )i 117 / 123
  • 119. Particle Identification Problem: identify charged particle associated with a track 
(multiclass classification problem) particle "types": Ghost, Electron, Muon, Pion, Kaon, Proton. 118 / 123
  • 120. Particle Identification Problem: identify charged particle associated with a track 
(multiclass classification problem) particle "types": Ghost, Electron, Muon, Pion, Kaon, Proton. LHCb detector provides diverse plentiful information, collected by subdetectors: calorimeters, (Ring Imaging) Cherenkov detector, muon chambers and tracker observables this information can be efficiently combined using ML 119 / 123
  • 121. Particle Identification Problem: identify charged particle associated with a track 
(multiclass classification problem) particle "types": Ghost, Electron, Muon, Pion, Kaon, Proton. LHCb detector provides diverse plentiful information, collected by subdetectors: calorimeters, (Ring Imaging) Cherenkov detector, muon chambers and tracker observables this information can be efficiently combined using ML Identification should be independent on particle momentum, pseudorapidity, etc. 120 / 123
  • 122. Flatness loss for particle identification 121 / 123
  • 123. Gradient Boosting overview A powerful ensembling technique (typically used over trees, GBDT) a general way to optimize differentiable losses can be adapted to other problems 'following' the gradient of loss at each step making steps in the space of functions gradient of poorly-classified observations is higher increasing number of trees can drive to overfitting (= getting worse quality on new data) requires tuning, better when trees are not complex widely used in practice 122 / 123