Stanford ICME Lecture on Why Deep Learning Works

calculation | consulting
heavy tailed self-regularization in
deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com

calculation|consulting
Stanford ICME Jan 23, 2020
heavy tailed self-regularization in
deep neural networks
(TM)

calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); ﬁrst $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Grifﬁn Advisors
Alt. Energy: Anthropocene Institute (Page Family)
www.calculationconsulting.com
(TM)
3

c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?

c|c
(TM)
Research: Implicit Self-Regularization
in Deep Learning
(TM)
5
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning
(archive, long version)
• Traditional and Heavy Tailed Self Regularization in Neural Network Models
(archive, short version, accepted ICML 2019)
• Rethinking generalization requires revisiting old ideas: statistical mechanics
approaches and complex learning behavior
(archive, long version, submitted JMLR 2019)
• Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large
Pre-Trained Deep Neural Networks
(archive, accepted ICML 2019)

c|c
(TM)
(TM)
6
Understanding deep learning requires rethinking generalization
Motivations: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overﬁt on randomly labeled data
Regularization can not prevent this

c|c
(TM)
(TM)
7
Motivations: Remembering Regularization
Statistical Mechanics (1990s) - (this) Overtraining -> Spin Glass Phase
Binary Classiﬁer with N Random Labelings:
2 over-trained solutions— locally convex, very high barriers
all unable to generalize
N
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior

c|c
(TM)
Motivations: still, what is Regularization ?
(TM)
8
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…

Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
9
Soften the rank of X, focus on large eigenvalues
Ridge Regression / Tikhonov-Phillips Regularization

c|c
(TM)
Set up: the Energy Landscape
(TM)
10
minimize Loss: but how avoid overtraining ?

c|c
(TM)
Motivations: how we study Regularization
(TM)
11
we can characterize the the learning process by studying W
the Energy Landscape is determined by the layer weights W L
L
i.e the eigenvalues of the correlation matrix
X= (1/N) W WL L
T
as in traditional regularization

c|c
(TM)
(TM)
12
Random Matrix Theory:
Universality Classes for
DNN Weight Matrices

c|c
(TM)
(TM)
13
Empirical Spectral Density: SVD Analysis
and compute the Empirical Spectral Density (ESD)
Equivalently, we can form the correlation matrix X
X = A A
T
i.e. histogram

c|c
(TM)
(TM)
14
Heavy Tails in DNNs: ESD of Toy MLP
Trained a toy 3 layer MLP on CIFAR10 and monitored
the ESD of the layer weight matrices (Q=1)
Spikes
Bulk

c|c
(TM)
(TM)
15
ESD of DNNs: detailed insight into W
Empirical Spectral Density (ESD: eigenvalues of X)
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
N,M = W.shape()
…
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
plt.hist(X, bin=100, density=True)

c|c
(TM)
(TM)
16
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes

c|c
(TM)
(TM)
17
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom ﬂuctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear

c|c
(TM)
(TM)
18
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see

c|c
(TM)
(TM)
19
Self-Regularization: 5+1 Phases of Training

c|c
(TM)
(TM)
20
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT

c|c
(TM)
(TM)
21
Self-Regularization: 5+1 Phases of Training

c|c
(TM)
(TM)
22
Experiments on pre-trained DNNs:
Traditional and Heavy Tailed
Implicit Self-Regularization

c|c
(TM)
(TM)
23
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC

c|c
(TM)
(TM)
24
LeNet 5 resembles the MP Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5

c|c
(TM)
(TM)
25
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in

c|c
(TM)
(TM)
26
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226

c|c
(TM)
(TM)
27
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signiﬁes over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
Q > 1

c|c
(TM)
(TM)
28
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT

c|c
(TM)
(TM)
29
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold

c|c
(TM)
(TM)
30
AlexNet,  
VGG,
ResNet,
Inception,
DenseNet,
…
Heavy Tailed RMT: Scale Free ESD
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free

c|c
(TM)
(TM)
31
Universality of Power Law Exponents:
I0,000 Weight Matrices

c|c
(TM)
(TM)
32
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality

c|c
(TM)
(TM)
33
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4

c|c
(TM)
(TM)
34
Rank Collapse: ImageNet and AllenNLP
The pretrained ImageNet and AllenNLP show (almost) no rank collapse

c|c
(TM)
(TM)
35
LastYear’s Prediction: BERT
The original pretrained BERT model has many outliers > 5-6

c|c
(TM)
(TM)
36
Current Results: BERT vs RoBERTa
The Robust BERT model has less large outliers

c|c
(TM)
(TM)
37
Current Results: BERT vs DistillBERT
The Distilled BERT model has less large outliers

c|c
(TM)
(TM)
38
Heavy Tailed Metrics: GPT vs GPT2
The original GPT is poorly trained on purpose; GPT2 is well trained

c|c
(TM)
(TM)
39
Correlation Flow: CV Models
We can study correlation ﬂow looking at vs. depth
VGG ResNet DenseNet

c|c
(TM)
(TM)
40
The power law exponent is like a measure of soft rank
Heavy Tailed RMT : measures of Soft Rank
For random, very Heavy Tailed
(& ﬁnite size fat tailed) matrices:
Is like the Stable Rank
VGG_softrank_mp_accs.png

c|c
(TM)
(TM)
41
We use RMT to deﬁne a new measure of soft rank
Heavy Tailed RMT : MP Soft Rank
Suitable for poorly correlated systems Scale invariant

c|c
(TM)
(TM)
42
Predicting Test / Generalization Accuracies:
Empirical results for pretrained DNNs

c|c
(TM)
(TM)
43
Universality: Capacity metrics
Universality suggests the power law exponent
would make a good, Universal. DNN capacity metric
Imagine a weighted average
where the weights b ~ |W| scale of the matrix
An Unsupervised, VC-like data dependent complexity metric for
predicting trends in average case generalization accuracy in DNNs

c|c
(TM)
(TM)
44
DNN Capacity metrics: Product norms
The product norm is a data-dependent,VC-like capacity metric for DNNs

c|c
(TM)
(TM)
45
Predicting test accuracies: Product norms
many product norms correlate with reported test accuracies
Frobenius
Norm

c|c
(TM)
(TM)
46
Predicting test accuracies: weighted alpha
Here we treat both Linear layers and Conv2D feature maps
The weights compensate for different size & scale weight matrices & feature maps
We extract (k ) (N x M) pre-Activation maps, and rescale by2

c|c
(TM)
(TM)
47
Predicting test accuracies: Product norms
empirical product norms perform pretty well
Frobenius Norm Spectral Norm Weighted Alpha
… and we are best :)

c|c
(TM)
(TM)
48
Predicting test accuracies: Soft Rank
measures of Soft Rank can be anti-correlated!
Alpha Stable Rank MP Soft Rank
… and we are worst :P
note: these 3 are not always so similar

c|c
(TM)
(TM)
49
Predicting test accuracies: Heavy tailed norms
The heavy tailed metrics perform best
Weighted Alpha Alpha (Shatten) Norm

c|c
(TM)
(TM)
50

c|c
(TM)
(TM)
51

c|c
(TM)
(TM)
52
Predicting test accuracies: 100 pretrained models
The heavy tailed metrics perform best
https://github.com/osmr/imgclsmob
From an open source sandbox of
nearly 500 pretrained CV models
(picked >=5 models / regression)

c|c
(TM)
(TM)
53
Reproducibility
pip install weightwatcher

c|c
(TM)
(TM)
54
Open source tool: weightwatcher
pip install weightwatcher
…
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze()
watcher.get_summary()
watcher.print_results()
All results can be reproduced using the python weightwatcher tool
python tool to analyze Fat Tails in Deep Neural Networks
https://github.com/CalculatedContent/WeightWatcher

c|c
(TM)
(TM)
55
Statistical Mechanics
derivation of the alpha-norm metric

c|c
(TM)
(TM)
56
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning Engle &Van den Broeck (2001)
Generalization error ~ phase space volume
Average error ~ overlap between T and J

c|c
(TM)
(TM)
57
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning Engle &Van den Broeck (2001)
Standard approach:
• Teacher (T) and Student (J) random Perceptron vectors
• Treat data as an external random Gaussian ﬁeld
• Apply Hubbard–Stratonovich to get mean-ﬁeld result
• Assume continuous or discrete J
• Solve for as a function of load (# data points / # parameters)

c|c
(TM)
(TM)
58
New Set Up: Matrix-generalized Student-Teacher
x
Continuous perceptron
Ising Perceptron
Uninteresting
Replica theory, shows phase behavior,
entropy collapse, etc

c|c
(TM)
(TM)
59
New Set Up: Matrix-generalized Student-Teacher
“Towards a new theory…” Martin & Mahoney (in preparation)
real DNN matrices:
NxM
Strongly correlated
Heavy tailed
correlation matrices
Solve for total integrated phase-space volume

c|c
(TM)
(TM)
60
New approach: HCIZ Matrix Integrals
Write the overlap as
Fix a Teacher. The integral is now over all random students J that overlap w/T
Use the following result in RMT
“Asymptotics of HCZI integrals …” Tanaka (2008)

c|c
(TM)
(TM)
61
RMT: Annealed vs Quenched averages
“A First Course in Random Matrix Theory” Potters and Bouchaud (2020)
good outside spin-glass phases
where system is trained well
We imagine averaging over all (random) students DNNs
with (correlations that) look like the teacher DNN

c|c
(TM)
(TM)
62
New interpretation: HCIZ Matrix Integrals
Generating functional
R-Transform (inverse Green’s function, via Contour Integral)
in terms of Teacher's eigenvalues , and Student’s cumulants

c|c
(TM)
(TM)
63
Some basic RMT: Greens functions
The Green’s function is the Stieltjes transform of eigenvalue distribution
Given the empirical spectral density (average eigenvalue density)
and using:

c|c
(TM)
(TM)
64
Some basic RMT: Moment generating functions
The Green’s function has poles
at the actual eigenvalues
But is analytic in the complex plane
up and away from the real axis
z
Expand in a series around z = ∞
moment generating function

c|c
(TM)
(TM)
65
Some basic RMT: R-Transforms
Which gives a (similar) moment generating function
The free-cumulant-generating function (R-transform)
is related to the Green function as
Gaussian random and very Heavy Tailed (Levy) random matrices
But which takes a simple form for both

c|c
(TM)
(TM)
66
Results: Gaussian Random Weight Matrices
“Random Matrix Theory (book)” Bouchaud and Potters (2020)
Recover the Frobenius Norm (squared) as the metric

c|c
(TM)
(TM)
67
Results: (very) Heavy Tailed Weight Matrices
“Heavy-tailed random matrices” Burda and Jukiewicz (2009)
Recover a Shatten Norm, in terms of the Heavy Tailed exponent

c|c
(TM)
(TM)
68
Application to: Heavy Tailed Weight Matrices
Some reasonable approximations give the weighted alpha metric
Q.E.D.

(TM)
c|c
(TM)
c | c

Stanford ICME Lecture on Why Deep Learning Works

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Stanford ICME Lecture on Why Deep Learning Works

Similaire à Stanford ICME Lecture on Why Deep Learning Works (20)

Plus de Charles Martin

Plus de Charles Martin (6)

Dernier

Dernier (20)

Stanford ICME Lecture on Why Deep Learning Works