Turbhe Fantastic Escorts📞📞9833754194 Kopar Khairane Marathi Call Girls-Kopar ...
Side 2019 #7
1. Arthur Charpentier, SIDE Summer School, July 2019
# 7 Classification & Goodness of Fit (Theoretical)
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal)
Machine Learning & Econometrics
SIDE Summer School - July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
2. Arthur Charpentier, SIDE Summer School, July 2019
Machine Learning, in a Probabilistic Framework
Assume that training and validation data are drawn i.i.d. from P, or (Y, X) ∼ F
Consider y ∈ {−1, +1}. The true risk of a classifier is
R(m) = P(Y,X)∼F
m(X) = Y ) = E(Y,X)∼F
(m(X), Y )
Bayes classifier is
b(x) = sign E(Y,X)∼F
Y |X = x
which satisfies R(b) = inf
m∈H
{R(m)} (in the class H of all measurable functions),
called Bayes risk.
The empirical risk is
Rn(m) =
1
n
n
i=1
(yi, m(xi)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
3. Arthur Charpentier, SIDE Summer School, July 2019
Machine Learning, in a Probabilistic Framework
One might think of using regularized empirical risk minimization,
mn ∈ argmin
m∈M
Rn(m) + λ m
in a class of models M, where regularization term will control the complexity of
the model to prevent overfitting.
Let m denote the best model in M, m = argmin
m∈M
{R(m)}
R(mn) − R(b) = R(mn) − R(m )
estimation error
+ (R(m ) − R(b)
approximation error
Since R(mn) = Rn(mn)) + [R(mn) − Rn(mn))], we can write
R(mn) ≤ Rn(mn)) + something(m, F)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
4. Arthur Charpentier, SIDE Summer School, July 2019
Machine Learning, in a Probabilistic Framework
To quantify this something(m, F), we need Hoeffding inequality, see Hoeffding
(1963, Probability inequalities for sums of bounded random variables)
Let g(x, y) = (m(x), y), for some model m. Let
G = {g : (x, y) → (m(x), y), m ∈ M}
If Z = (Y, X), set R(g) = EZ∼F
g(Z) and Rn(g) =
1
n
n
i=1
g(zi).
Hoeffding inequality
If Z1, · · · , Zn are i.i.d. and if h is a bounded function (in [a, b]), then,
∀ > 0
Pn
1
n
n
i=1
h(Zi) − EF [h(Z)] ≥ ≤ 2 exp
−2n 2
(b − a)2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
5. Arthur Charpentier, SIDE Summer School, July 2019
Machine Learning, in a Probabilistic Framework
or equivalently (let δ denote the upper bound)
Pn
1
n
n
i=1
h(Zi) − EF [h(Z)] ≥ (b − a)
−1
2n
log(2δ) ≤ δ
We can actually derive a one side majoration, and with probability (at least) 1 − δ
R(g) ≤ Rn(g) +
−1
2n
log δ
R(m) − Rn(m)
For a fixed m ∈ M, R(m) − Rn(m) ∼
1
√
n
But it doesn’t help much, we need uniform deviations (or worst deviation).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
6. Arthur Charpentier, SIDE Summer School, July 2019
Machine Learning, in a Probabilistic Framework
Consider a finite set of models. Define the set of bad samples
Zj = (z1, · · · , zn) : R(gj) ≤ Rn(gj) ≥ 0
so that P[(Z1, · · · , Zn) ∈ Zj] ≤ δ, and then
P[(Z1, · · · , Zn) ∈ Z1 ∩ Z1] ≤ P[(Z1, · · · , Zn) ∈ Z1] + P[(Z1, · · · , Zn) ∈ Z2] ≤ 2δ
so that
P
(Z1, · · · , Zn) ∈
ν
j=1
Zj
≤
ν
j=1
P[(Z1, · · · , Zn) ∈ Zj] ≤ νδ
Hence,
P[∃g ∈ {g1, · · · , gν} : R(g) − Rn(g) ≥ ] ≤ ν · P[R(g) − Rn(g) ≥ ] ≤ ν · exp[−2n 2
]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
7. Arthur Charpentier, SIDE Summer School, July 2019
Machine Learning, in a Probabilistic Framework
If δ = ν exp[−2n 2
], we can write and with probability (at least) 1 − δ
∀g ∈ {g1, · · · , gν}, R(g) − Rn(g) ≤
1
n
(log ν − log δ)
Thus, we can write, for a finite set of models M = {m1, · · · , mν},
∀m ∈ {m1, · · · , mν}, R(m) ≤ Rn(m) +
1
n
(log ν − log δ)
R(m) − Rn(m) - M finite, ν = |M|
For the worst case scenario
sup
m∈Mν
R(m) − Rn(m) ∼
log ν
√
n
Now, what if M is infinite ?
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
8. Arthur Charpentier, SIDE Summer School, July 2019
Machine Learning, in a Probabilistic Framework
Write Hoeffding’s inequality as
P R(g) − Rn(g) ≥
−1
2n
log δg ≤ δg
so that, we a countable set G
P ∃g ∈ G : R(g) − Rn(g) ≥
−1
2n
log δg ≤
g∈G
δg
If δg = δ · µ(g) where µ is some measure on G, with probability (at least) 1 − δ,
∀g ∈ G, R(g) ≤ Rn(g) +
−1
2n
[log δ + log µ(g)]
(see previous computations with µ(g) = ν−1
)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
9. Arthur Charpentier, SIDE Summer School, July 2019
Machine Learning, in a Probabilistic Framework
More generally, given a sample z = {z1, · · · , zn}, let Mz denote the set of
classification that can be obtained,
Mz = (m(z1), · · · , m(zn))
The growth function is the maximum number of ways into which n points can be
classified by the function class M
GM
(n) = sup
z
Mz
Vapnik-Chervonenkis : with (at least) probability 1 − δ,
∀m ∈ M, R(m) ≤ Rn(m) + 2
2
n
[log GM
(2n) − log(4δ)]
The VC (Vapnik-Chervonenkis) dimension is the largest n such that
GM
(n) = 2n
. It will be denoted VC(M). Observe that GM
(n) ≤ 2n
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
10. Arthur Charpentier, SIDE Summer School, July 2019
Machine Learning, in a Probabilistic Framework
n ≤ VC(M) : n → GM
(n) increases exponentially GM
(n) = 2n
n ≥ VC(M) : n → GM
(n) increases at power speed GM
(n) ≤
en
VC(M)
VC(M)
Vapnik-Chervonenkis : with (at least) probability 1 − δ,
∀m ∈ M, R(m) ≤ Rn(m) + 2
2
n
[VC(M) log
en
VC(M)
− log(4δ)]
R(m) − Rn(m) - M infinite
For the worst case scenario sup
m∈M
R(m) − Rn(m) ∼
VC(M) · log n
n
To go further, see Bousquet, Boucheron & Lugosi (2005, Introduction to Learning
Theory)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10