Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Minimax statistical learning with Wasserstein distances
by Jaeho Lee and Maxim Raginsky
January 26, 2019
Presenter: Kenta ...
Kenta Oono (@delta2323 )
Profile
• 2011.3: MSc. (Mathematics)
• 2011.4-2014.10: Preferred Infrastructure (PFI)
• 2014.10-cu...
Summary
What this paper does.
• Develop a distributionally-robust risk minimization problem.
• Derive the excess-risk rate...
Problem Setting (Expected Risk)
Given
• Z: sample space
• P: (unknown) distribution over Z
• Dataset: D = (z1, . . . , zN)...
Problem Setting (Estimator)
Goal:
• Devise an algorithm A : D → ˆf = ˆf (D)
• We treat D as a random variable. So, is ˆf ....
Problem Setting (ERM Estimator)
Since we cannot compute the expected risk R, we compute empirical risk instead:
ˆRD(f ) =
...
Relation
7/18
Assumptions
+
OR
Ref. Lee and Raginsky (2018)
8/18
Example
Supervised learning
• Z = (X, Y ), X = RD: input space, Y = R: label space
• : Y × Y → R: loss function
• H ⊂ {X →...
Classical Result
Typically, we have
R(P, ˆf ) − inf
f ∈F
R(P, f ) = OP
complexity of F
√
n
Model complexity measure comple...
Covering number
Definition (Covering Number)
For F ⊂ F0 := {f : [−1, 1]D → R}, and ε > 0, the (external) covering number of...
Distributionally Robust Framework
Minimize the worst-case risk close to true distribution P.
minimize R(P, f )
↓
minimize ...
Estimator
Correspondingly, we change the estimator
ˆf ∈ inf
f ∈F
Rρ,p(Pn, f )
Want to evaluate
Rρ,p(P, ˆf ) − inf
f ∈F
Rρ,...
Main Theorems
Same excess-risk rate as the non-robust setting.
Ref. Lee and Raginsky (2018)
14/18
Strategy
From authors slide
Ref: https://nips.cc/media/Slides/nips/2018/517cd(05-09-45)
-05-10-20-12649-Minimax_Statist.pd...
Key Lemmas
Ref. Lee
and Raginsky (2018)
16/18
Why these lemmas are important?
(Complexity of ΨΛ,F ) ≈ (Complexity of F) × (Complexity of Λ)
17/18
Impression
• Duality form of risk (Rρ(P, f ) = infλ≥0 E[ψλ,f (Z)]) may be useful of its own.
• Mysterious assumption 4 (in...
Prochain SlideShare
Chargement dans…5
×

Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club)

738 vues

Publié le

NeurIPS2018 Reading Club@PFN
https://connpass.com/event/115476/

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club)

  1. 1. Minimax statistical learning with Wasserstein distances by Jaeho Lee and Maxim Raginsky January 26, 2019 Presenter: Kenta Oono @ NeurIPS 2018 Reading Club
  2. 2. Kenta Oono (@delta2323 ) Profile • 2011.3: MSc. (Mathematics) • 2011.4-2014.10: Preferred Infrastructure (PFI) • 2014.10-current: Preferred Networks (PFN) • 2018.4-current: Ph.D student @U.Tokyo Interests • Mathematics • Bioinformatics • Theory of Deep Learning 2/18
  3. 3. Summary What this paper does. • Develop a distributionally-robust risk minimization problem. • Derive the excess-risk rate O(n−1 2 ), same as the non-robust case. • Application to domain adaptation. Why I choose this paper? • Spotlight talk • Wanted to learn statistics learning theory • Especially minimax optimality of DL. But this paper turned out to not be about it. • Wanted to learn Wasserstein distance 3/18
  4. 4. Problem Setting (Expected Risk) Given • Z: sample space • P: (unknown) distribution over Z • Dataset: D = (z1, . . . , zN) ∼ P i.i.d. For a hypothesis f : Z → R, we evaluate its expected risk by • Expected Risk: R(P, f ) = EZ∼P[f (Z)] • Hypothesis space: F ⊂ {Z → R} 4/18
  5. 5. Problem Setting (Estimator) Goal: • Devise an algorithm A : D → ˆf = ˆf (D) • We treat D as a random variable. So, is ˆf . • If A is a random algorithm (e.g. SGD), randomness of ˆf (D) comes from A, too. • Evaluate excess risk: R(P, ˆf ) − inff ∈F R(P, f ) Typical form of theorems: • EA,D[R(P, ˆf ) − inff ∈F R(P, f )] = O(g(n)) • R(P, ˆf ) − inff ∈F R(P, f ) = O(g(n, δ)) with probability 1 − δ with respect to the choice of D (and A) 5/18
  6. 6. Problem Setting (ERM Estimator) Since we cannot compute the expected risk R, we compute empirical risk instead: ˆRD(f ) = 1 n n i=1 f (zi ) = R(Pn, f ) (Pn: empirical distribution). ERM (Empirical Risk Minimization) estimator for hypothesis space F is ˆf = ˆf (D) ∈ min f ∈F R(Pn, f ) 6/18
  7. 7. Relation 7/18
  8. 8. Assumptions + OR Ref. Lee and Raginsky (2018) 8/18
  9. 9. Example Supervised learning • Z = (X, Y ), X = RD: input space, Y = R: label space • : Y × Y → R: loss function • H ⊂ {X → Y }: set of models • F = {fh(x, y) = (h(x), y)|h ∈ H} Regression • X = RD, Y = R, (y, y) = (y − y)2 • H = (Function realized by a neural networks with a fixed architecture) 9/18
  10. 10. Classical Result Typically, we have R(P, ˆf ) − inf f ∈F R(P, f ) = OP complexity of F √ n Model complexity measure complexity of F (intuitively, how ”large” F is) 10/18
  11. 11. Covering number Definition (Covering Number) For F ⊂ F0 := {f : [−1, 1]D → R}, and ε > 0, the (external) covering number of F is N(F, ε) := inf N ∈ N ∃f1, . . . , fN ∈ F0 s.t. ∀f ∈ F, ∃n ∈ [N] s.t. f − fn ∞ ≤ ε . • Intuition: the minimum # of balls (with radius ε) to cover the space F. • Entropy integral: C(F) := ∞ 0 log N(F, u) du. 11/18
  12. 12. Distributionally Robust Framework Minimize the worst-case risk close to true distribution P. minimize R(P, f ) ↓ minimize Rρ,p(P, f ) := supQ∈Aρ,p(P) R(Q, f ) We consider p-Wasserstein distance: Aρ,p(P) = {Q|Wp(P, Q) ≤ ρ} Applications • Adversarial attack: ρ = noise level • Domain adaptation: ρ = discrepancy level of train/test dists. 12/18
  13. 13. Estimator Correspondingly, we change the estimator ˆf ∈ inf f ∈F Rρ,p(Pn, f ) Want to evaluate Rρ,p(P, ˆf ) − inf f ∈F Rρ,pR(P, f ) 13/18
  14. 14. Main Theorems Same excess-risk rate as the non-robust setting. Ref. Lee and Raginsky (2018) 14/18
  15. 15. Strategy From authors slide Ref: https://nips.cc/media/Slides/nips/2018/517cd(05-09-45) -05-10-20-12649-Minimax_Statist.pdf 15/18
  16. 16. Key Lemmas Ref. Lee and Raginsky (2018) 16/18
  17. 17. Why these lemmas are important? (Complexity of ΨΛ,F ) ≈ (Complexity of F) × (Complexity of Λ) 17/18
  18. 18. Impression • Duality form of risk (Rρ(P, f ) = infλ≥0 E[ψλ,f (Z)]) may be useful of its own. • Mysterious assumption 4 (incredibly local property of F). • Special structure of p=1-Wasserstein distance? 18/18

×