My presentation of one of the ICLR2017 best paper by Google Brain. (arxiv.org/abs/1611.03530). I believe that generalization deserves more attention as we go deep into over-parameterization zone.
2. Quick Facts
• Google Brain
• ICLR 2017 - Best Paper Award
• Love it / Hate it
• Interesting Experiments
• Questioning the Traditional Explanations
• A “This is also not useful!” paper
14. Implications
Rademacher complexity and VC-dimension:
Uniform stability:
The networks fit the training set with random labels perfectly.
How sensitive the algorithm is to replacement of a single example.
(from paper)
15. What do they mean? - Implications
Rademacher complexity and VC-dimension:
Uniform stability:
The networks fit the training set with random labels perfectly.
How sensitive the algorithm is to replacement of a single example.
16. What do they mean? - Implications
Rademacher complexity and VC-dimension:
Uniform stability:
The networks fit the training set with random labels perfectly.
How sensitive the algorithm is to replacement of a single example.
17. Explicit Regularizations
• “The original hypothesis space is too large to generalize,
so confine learning to a subset of the hypothesis space
with manageable complexity.”
• Data augmentation
• Weight decay
• Dropout
Can this be the reason why the networks generalize well?
20. Implicit Regularizations
• Early Stopping (helps in ImageNet but not in CIFAR10)
• Batch-normalization: Improves 3~4%
(see Table-2 at the appendix for details)
21. Implicit Regularizations
• Early Stopping (helps in ImageNet but not in CIFAR10)
• Batch-normalization: Improves 3~4%
(see Table-2 at the appendix for details)
23. Finite Sample Expressivity
a depth k network with O(n/k) also does the job.
(see Proof at the appendix for details)
24. An Appeal to Linear Models
• “Is there a way to determine when one global minimum
will generalize whereas another will not?”
• Common way: check the curvature of the loss at the minimum
• In linear case: Hessian is degenerate at all minima. NOT USEFUL!
• A good-old-friend: SGD!
• SGD provides “Kernel trick” as an implicit regularization.
• Often converges with minimum norm, providing guidance.
• But minimum norm is NOT -totally- predictive for generalization.
25. Conclusion
• Effective capacity of successful NNs is large enough to
shatter the training data.
“Rich enough to memorize the training data”.
• A conceptual challenge to statistical learning theory.
• Model complexity struggle to explain the generalization ability of large
ANNs.
• Optimization is easy for large neural networks.
• The source of optimization and generalization are different.
• We have yet to discover a precise formal measure
under which these enormous models are simple.