SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
calculation | consulting
heavy tailed self-regularization in
deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com
calculation|consulting
Stanford ICME Jan 23, 2020
heavy tailed self-regularization in
deep neural networks
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Griffin Advisors
Alt. Energy: Anthropocene Institute (Page Family)
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
calculation | consulting why deep learning works
c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?
c|c
(TM)
Research: Implicit Self-Regularization
in Deep Learning
(TM)
5
calculation | consulting why deep learning works
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning
(archive, long version)
• Traditional and Heavy Tailed Self Regularization in Neural Network Models
(archive, short version, accepted ICML 2019)
• Rethinking generalization requires revisiting old ideas: statistical mechanics
approaches and complex learning behavior
(archive, long version, submitted JMLR 2019)
• Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large
Pre-Trained Deep Neural Networks
(archive, accepted ICML 2019)
c|c
(TM)
(TM)
6
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Motivations: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overfit on randomly labeled data
Regularization can not prevent this
c|c
(TM)
(TM)
7
calculation | consulting why deep learning works
Motivations: Remembering Regularization
Statistical Mechanics (1990s) - (this) Overtraining -> Spin Glass Phase
Binary Classifier with N Random Labelings:
2 over-trained solutions— locally convex, very high barriers
all unable to generalize
N
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior
c|c
(TM)
Motivations: still, what is Regularization ?
(TM)
8
calculation | consulting why deep learning works
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…
Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
9
calculation | consulting why deep learning works
Soften the rank of X, focus on large eigenvalues
Ridge Regression / Tikhonov-Phillips Regularization
c|c
(TM)
Set up: the Energy Landscape
(TM)
10
calculation | consulting why deep learning works
minimize Loss: but how avoid overtraining ?
c|c
(TM)
Motivations: how we study Regularization
(TM)
11
we can characterize the the learning process by studying W
the Energy Landscape is determined by the layer weights W L
L
i.e the eigenvalues of the correlation matrix
X= (1/N) W WL L
T
as in traditional regularization
c|c
(TM)
(TM)
12
calculation | consulting why deep learning works
Random Matrix Theory:
Universality Classes for
DNN Weight Matrices
c|c
(TM)
(TM)
13
calculation | consulting why deep learning works
Empirical Spectral Density: SVD Analysis
and compute the Empirical Spectral Density (ESD)
Equivalently, we can form the correlation matrix X
X = A A
T
i.e. histogram
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Heavy Tails in DNNs: ESD of Toy MLP
Trained a toy 3 layer MLP on CIFAR10 and monitored
the ESD of the layer weight matrices (Q=1)
Spikes
Bulk
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
ESD of DNNs: detailed insight into W
Empirical Spectral Density (ESD: eigenvalues of X)
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
N,M = W.shape()
…
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
plt.hist(X, bin=100, density=True)
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
c|c
(TM)
(TM)
21
calculation | consulting why deep learning works
Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
22
calculation | consulting why deep learning works
Experiments on pre-trained DNNs:
Traditional and Heavy Tailed
Implicit Self-Regularization
c|c
(TM)
(TM)
23
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC
c|c
(TM)
(TM)
24
calculation | consulting why deep learning works
LeNet 5 resembles the MP Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5
c|c
(TM)
(TM)
25
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signifies over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
Q > 1
c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT
c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold
c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
AlexNet, 

VGG,
ResNet,
Inception,
DenseNet,
…
Heavy Tailed RMT: Scale Free ESD
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Universality of Power Law Exponents:
I0,000 Weight Matrices
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
c|c
(TM)
(TM)
33
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4
c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Rank Collapse: ImageNet and AllenNLP
The pretrained ImageNet and AllenNLP show (almost) no rank collapse
c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
LastYear’s Prediction: BERT
The original pretrained BERT model has many outliers > 5-6
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Current Results: BERT vs RoBERTa
The Robust BERT model has less large outliers
c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Current Results: BERT vs DistillBERT
The Distilled BERT model has less large outliers
c|c
(TM)
(TM)
38
calculation | consulting why deep learning works
Heavy Tailed Metrics: GPT vs GPT2
The original GPT is poorly trained on purpose; GPT2 is well trained
c|c
(TM)
(TM)
39
calculation | consulting why deep learning works
Correlation Flow: CV Models
We can study correlation flow looking at vs. depth
VGG ResNet DenseNet
c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
The power law exponent is like a measure of soft rank
Heavy Tailed RMT : measures of Soft Rank
For random, very Heavy Tailed
(& finite size fat tailed) matrices:
Is like the Stable Rank
VGG_softrank_mp_accs.png
c|c
(TM)
(TM)
41
calculation | consulting why deep learning works
We use RMT to define a new measure of soft rank
Heavy Tailed RMT : MP Soft Rank
Suitable for poorly correlated systems Scale invariant
c|c
(TM)
(TM)
42
calculation | consulting why deep learning works
Predicting Test / Generalization Accuracies:
Empirical results for pretrained DNNs
c|c
(TM)
(TM)
43
calculation | consulting why deep learning works
Universality: Capacity metrics
Universality suggests the power law exponent
would make a good, Universal. DNN capacity metric
Imagine a weighted average
where the weights b ~ |W| scale of the matrix
An Unsupervised, VC-like data dependent complexity metric for
predicting trends in average case generalization accuracy in DNNs
c|c
(TM)
(TM)
44
calculation | consulting why deep learning works
DNN Capacity metrics: Product norms
The product norm is a data-dependent,VC-like capacity metric for DNNs
c|c
(TM)
(TM)
45
calculation | consulting why deep learning works
Predicting test accuracies: Product norms
many product norms correlate with reported test accuracies
Frobenius
Norm
c|c
(TM)
(TM)
46
calculation | consulting why deep learning works
Predicting test accuracies: weighted alpha
Here we treat both Linear layers and Conv2D feature maps
The weights compensate for different size & scale weight matrices & feature maps
We extract (k ) (N x M) pre-Activation maps, and rescale by2
c|c
(TM)
(TM)
47
calculation | consulting why deep learning works
Predicting test accuracies: Product norms
empirical product norms perform pretty well
Frobenius Norm Spectral Norm Weighted Alpha
… and we are best :)
c|c
(TM)
(TM)
48
calculation | consulting why deep learning works
Predicting test accuracies: Soft Rank
measures of Soft Rank can be anti-correlated!
Alpha Stable Rank MP Soft Rank
… and we are worst :P
note: these 3 are not always so similar
c|c
(TM)
(TM)
49
calculation | consulting why deep learning works
Predicting test accuracies: Heavy tailed norms
The heavy tailed metrics perform best
Weighted Alpha Alpha (Shatten) Norm
c|c
(TM)
(TM)
50
calculation | consulting why deep learning works
c|c
(TM)
(TM)
51
calculation | consulting why deep learning works
c|c
(TM)
(TM)
52
calculation | consulting why deep learning works
Predicting test accuracies: 100 pretrained models
The heavy tailed metrics perform best
https://github.com/osmr/imgclsmob
From an open source sandbox of
nearly 500 pretrained CV models
(picked >=5 models / regression)
c|c
(TM)
(TM)
53
calculation | consulting why deep learning works
Reproducibility
pip install weightwatcher
c|c
(TM)
(TM)
54
calculation | consulting why deep learning works
Open source tool: weightwatcher
pip install weightwatcher
…
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze()
watcher.get_summary()
watcher.print_results()
All results can be reproduced using the python weightwatcher tool
python tool to analyze Fat Tails in Deep Neural Networks
https://github.com/CalculatedContent/WeightWatcher
c|c
(TM)
(TM)
55
calculation | consulting why deep learning works
Statistical Mechanics
derivation of the alpha-norm metric
c|c
(TM)
(TM)
56
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning Engle &Van den Broeck (2001)
Generalization error ~ phase space volume
Average error ~ overlap between T and J
c|c
(TM)
(TM)
57
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning Engle &Van den Broeck (2001)
Standard approach:
• Teacher (T) and Student (J) random Perceptron vectors
• Treat data as an external random Gaussian field
• Apply Hubbard–Stratonovich to get mean-field result
• Assume continuous or discrete J
• Solve for as a function of load (# data points / # parameters)
c|c
(TM)
(TM)
58
calculation | consulting why deep learning works
New Set Up: Matrix-generalized Student-Teacher
x
Continuous perceptron
Ising Perceptron
Uninteresting
Replica theory, shows phase behavior,
entropy collapse, etc
c|c
(TM)
(TM)
59
calculation | consulting why deep learning works
New Set Up: Matrix-generalized Student-Teacher
“Towards a new theory…” Martin & Mahoney (in preparation)
real DNN matrices:
NxM
Strongly correlated
Heavy tailed
correlation matrices
Solve for total integrated phase-space volume
c|c
(TM)
(TM)
60
calculation | consulting why deep learning works
New approach: HCIZ Matrix Integrals
Write the overlap as
Fix a Teacher. The integral is now over all random students J that overlap w/T
Use the following result in RMT
“Asymptotics of HCZI integrals …” Tanaka (2008)
c|c
(TM)
(TM)
61
calculation | consulting why deep learning works
RMT: Annealed vs Quenched averages
“A First Course in Random Matrix Theory” Potters and Bouchaud (2020)
good outside spin-glass phases
where system is trained well
We imagine averaging over all (random) students DNNs
with (correlations that) look like the teacher DNN
c|c
(TM)
(TM)
62
calculation | consulting why deep learning works
New interpretation: HCIZ Matrix Integrals
Generating functional
R-Transform (inverse Green’s function, via Contour Integral)
in terms of Teacher's eigenvalues , and Student’s cumulants
c|c
(TM)
(TM)
63
calculation | consulting why deep learning works
Some basic RMT: Greens functions
The Green’s function is the Stieltjes transform of eigenvalue distribution
Given the empirical spectral density (average eigenvalue density)
and using:
c|c
(TM)
(TM)
64
calculation | consulting why deep learning works
Some basic RMT: Moment generating functions
The Green’s function has poles
at the actual eigenvalues
But is analytic in the complex plane
up and away from the real axis
z
Expand in a series around z = ∞
moment generating function
c|c
(TM)
(TM)
65
calculation | consulting why deep learning works
Some basic RMT: R-Transforms
Which gives a (similar) moment generating function
The free-cumulant-generating function (R-transform)
is related to the Green function as
Gaussian random and very Heavy Tailed (Levy) random matrices
But which takes a simple form for both
c|c
(TM)
(TM)
66
calculation | consulting why deep learning works
Results: Gaussian Random Weight Matrices
“Random Matrix Theory (book)” Bouchaud and Potters (2020)
Recover the Frobenius Norm (squared) as the metric
c|c
(TM)
(TM)
67
calculation | consulting why deep learning works
Results: (very) Heavy Tailed Weight Matrices
“Heavy-tailed random matrices” Burda and Jukiewicz (2009)
Recover a Shatten Norm, in terms of the Heavy Tailed exponent
c|c
(TM)
(TM)
68
calculation | consulting why deep learning works
Application to: Heavy Tailed Weight Matrices
Some reasonable approximations give the weighted alpha metric
Q.E.D.
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

Contenu connexe

Tendances

Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Charles Martin
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Charles Martin
 

Tendances (19)

AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start Up
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Cc stat phys draft
Cc stat phys draftCc stat phys draft
Cc stat phys draft
 
CC mmds talk 2106
CC mmds talk 2106CC mmds talk 2106
CC mmds talk 2106
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 
[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"
 
Winner of EY NextWave Data Science Challenge 2019
Winner of EY NextWave Data Science Challenge 2019Winner of EY NextWave Data Science Challenge 2019
Winner of EY NextWave Data Science Challenge 2019
 
Minimization of Assignment Problems
Minimization of Assignment ProblemsMinimization of Assignment Problems
Minimization of Assignment Problems
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 
F010333844
F010333844F010333844
F010333844
 
Collective Response Spike Prediction for Mutually Interacting Consumers
Collective Response Spike Prediction for Mutually Interacting ConsumersCollective Response Spike Prediction for Mutually Interacting Consumers
Collective Response Spike Prediction for Mutually Interacting Consumers
 
Large-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
Large-Scale Nonparametric Estimation of Vehicle Travel Time DistributionsLarge-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
Large-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototyping
 
Compositional Program Analysis using Max-SMT
Compositional Program Analysis using Max-SMTCompositional Program Analysis using Max-SMT
Compositional Program Analysis using Max-SMT
 
Unsupervised learning networks
Unsupervised learning networksUnsupervised learning networks
Unsupervised learning networks
 
IRJET- Novel based Stock Value Prediction Method
IRJET- Novel based Stock Value Prediction MethodIRJET- Novel based Stock Value Prediction Method
IRJET- Novel based Stock Value Prediction Method
 

Similaire à Stanford ICME Lecture on Why Deep Learning Works

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
The Statistical and Applied Mathematical Sciences Institute
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
iccv2009 tutorial: boosting and random forest - part II
iccv2009 tutorial: boosting and random forest - part IIiccv2009 tutorial: boosting and random forest - part II
iccv2009 tutorial: boosting and random forest - part II
zukun
 

Similaire à Stanford ICME Lecture on Why Deep Learning Works (20)

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
 
Capsule Networks
Capsule NetworksCapsule Networks
Capsule Networks
 
Lecture16 xing
Lecture16 xingLecture16 xing
Lecture16 xing
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdf
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
CoopLoc Technical Presentation
CoopLoc Technical PresentationCoopLoc Technical Presentation
CoopLoc Technical Presentation
 
iccv2009 tutorial: boosting and random forest - part II
iccv2009 tutorial: boosting and random forest - part IIiccv2009 tutorial: boosting and random forest - part II
iccv2009 tutorial: boosting and random forest - part II
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
Complex Models for Big Data
Complex Models for Big DataComplex Models for Big Data
Complex Models for Big Data
 
Fa19_P1.pptx
Fa19_P1.pptxFa19_P1.pptx
Fa19_P1.pptx
 

Plus de Charles Martin (6)

LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Stanford ICME Lecture on Why Deep Learning Works

  • 1. calculation | consulting heavy tailed self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  • 2. calculation|consulting Stanford ICME Jan 23, 2020 heavy tailed self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry, UIUC Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Griffin Advisors Alt. Energy: Anthropocene Institute (Page Family) www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
  • 5. c|c (TM) Research: Implicit Self-Regularization in Deep Learning (TM) 5 calculation | consulting why deep learning works • Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning (archive, long version) • Traditional and Heavy Tailed Self Regularization in Neural Network Models (archive, short version, accepted ICML 2019) • Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior (archive, long version, submitted JMLR 2019) • Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks (archive, accepted ICML 2019)
  • 6. c|c (TM) (TM) 6 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Motivations: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  • 7. c|c (TM) (TM) 7 calculation | consulting why deep learning works Motivations: Remembering Regularization Statistical Mechanics (1990s) - (this) Overtraining -> Spin Glass Phase Binary Classifier with N Random Labelings: 2 over-trained solutions— locally convex, very high barriers all unable to generalize N Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior
  • 8. c|c (TM) Motivations: still, what is Regularization ? (TM) 8 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  • 9. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 9 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues Ridge Regression / Tikhonov-Phillips Regularization
  • 10. c|c (TM) Set up: the Energy Landscape (TM) 10 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  • 11. c|c (TM) Motivations: how we study Regularization (TM) 11 we can characterize the the learning process by studying W the Energy Landscape is determined by the layer weights W L L i.e the eigenvalues of the correlation matrix X= (1/N) W WL L T as in traditional regularization
  • 12. c|c (TM) (TM) 12 calculation | consulting why deep learning works Random Matrix Theory: Universality Classes for DNN Weight Matrices
  • 13. c|c (TM) (TM) 13 calculation | consulting why deep learning works Empirical Spectral Density: SVD Analysis and compute the Empirical Spectral Density (ESD) Equivalently, we can form the correlation matrix X X = A A T i.e. histogram
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Heavy Tails in DNNs: ESD of Toy MLP Trained a toy 3 layer MLP on CIFAR10 and monitored the ESD of the layer weight matrices (Q=1) Spikes Bulk
  • 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works ESD of DNNs: detailed insight into W Empirical Spectral Density (ESD: eigenvalues of X) import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] N,M = W.shape() … X = np.dot(W.T, W)/N evals = np.linalg.eigvals(X) plt.hist(X, bin=100, density=True)
  • 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL DNN training induces breakdown of Gaussian random structure and the onset of a new kind of heavy tailed self-regularization Gaussian random matrix Bulk+ Spikes Heavy Tailed Small, older NNs Large, modern DNNs and/or Small batch sizes
  • 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q RMT says if W is a simple random Gaussian matrix, then the ESD will have a very simple , known form Shape depends on Q=N/M (and variance ~ 1) Eigenvalues tightly bounded a few spikes may appear
  • 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: Heavy Tailed But if W is heavy tailed, the ESD will also have heavy tails (i.e. its all spikes, bulk vanishes) If W is strongly correlated , then the ESD can be modeled as if W is drawn from a heavy tailed distribution Nearly all pre-trained DNNs display heavy tails…as shall soon see
  • 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  • 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Heavy Tailed RMT: Universality Classes The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
  • 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  • 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works Experiments on pre-trained DNNs: Traditional and Heavy Tailed Implicit Self-Regularization
  • 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  • 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works LeNet 5 resembles the MP Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  • 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  • 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  • 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~ Q > 1
  • 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  • 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  • 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works AlexNet, 
 VGG, ResNet, Inception, DenseNet, … Heavy Tailed RMT: Scale Free ESD All large, well trained, modern DNNs exhibit heavy tailed self-regularization scale free
  • 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Universality of Power Law Exponents: I0,000 Weight Matrices
  • 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality
  • 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality 500 matrices ~50 architectures Linear layers & Conv2D feature maps 80-90% < 4
  • 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Rank Collapse: ImageNet and AllenNLP The pretrained ImageNet and AllenNLP show (almost) no rank collapse
  • 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works LastYear’s Prediction: BERT The original pretrained BERT model has many outliers > 5-6
  • 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Current Results: BERT vs RoBERTa The Robust BERT model has less large outliers
  • 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Current Results: BERT vs DistillBERT The Distilled BERT model has less large outliers
  • 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works Heavy Tailed Metrics: GPT vs GPT2 The original GPT is poorly trained on purpose; GPT2 is well trained
  • 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Correlation Flow: CV Models We can study correlation flow looking at vs. depth VGG ResNet DenseNet
  • 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works The power law exponent is like a measure of soft rank Heavy Tailed RMT : measures of Soft Rank For random, very Heavy Tailed (& finite size fat tailed) matrices: Is like the Stable Rank VGG_softrank_mp_accs.png
  • 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works We use RMT to define a new measure of soft rank Heavy Tailed RMT : MP Soft Rank Suitable for poorly correlated systems Scale invariant
  • 42. c|c (TM) (TM) 42 calculation | consulting why deep learning works Predicting Test / Generalization Accuracies: Empirical results for pretrained DNNs
  • 43. c|c (TM) (TM) 43 calculation | consulting why deep learning works Universality: Capacity metrics Universality suggests the power law exponent would make a good, Universal. DNN capacity metric Imagine a weighted average where the weights b ~ |W| scale of the matrix An Unsupervised, VC-like data dependent complexity metric for predicting trends in average case generalization accuracy in DNNs
  • 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works DNN Capacity metrics: Product norms The product norm is a data-dependent,VC-like capacity metric for DNNs
  • 45. c|c (TM) (TM) 45 calculation | consulting why deep learning works Predicting test accuracies: Product norms many product norms correlate with reported test accuracies Frobenius Norm
  • 46. c|c (TM) (TM) 46 calculation | consulting why deep learning works Predicting test accuracies: weighted alpha Here we treat both Linear layers and Conv2D feature maps The weights compensate for different size & scale weight matrices & feature maps We extract (k ) (N x M) pre-Activation maps, and rescale by2
  • 47. c|c (TM) (TM) 47 calculation | consulting why deep learning works Predicting test accuracies: Product norms empirical product norms perform pretty well Frobenius Norm Spectral Norm Weighted Alpha … and we are best :)
  • 48. c|c (TM) (TM) 48 calculation | consulting why deep learning works Predicting test accuracies: Soft Rank measures of Soft Rank can be anti-correlated! Alpha Stable Rank MP Soft Rank … and we are worst :P note: these 3 are not always so similar
  • 49. c|c (TM) (TM) 49 calculation | consulting why deep learning works Predicting test accuracies: Heavy tailed norms The heavy tailed metrics perform best Weighted Alpha Alpha (Shatten) Norm
  • 50. c|c (TM) (TM) 50 calculation | consulting why deep learning works
  • 51. c|c (TM) (TM) 51 calculation | consulting why deep learning works
  • 52. c|c (TM) (TM) 52 calculation | consulting why deep learning works Predicting test accuracies: 100 pretrained models The heavy tailed metrics perform best https://github.com/osmr/imgclsmob From an open source sandbox of nearly 500 pretrained CV models (picked >=5 models / regression)
  • 53. c|c (TM) (TM) 53 calculation | consulting why deep learning works Reproducibility pip install weightwatcher
  • 54. c|c (TM) (TM) 54 calculation | consulting why deep learning works Open source tool: weightwatcher pip install weightwatcher … import weightwatcher as ww watcher = ww.WeightWatcher(model=model) results = watcher.analyze() watcher.get_summary() watcher.print_results() All results can be reproduced using the python weightwatcher tool python tool to analyze Fat Tails in Deep Neural Networks https://github.com/CalculatedContent/WeightWatcher
  • 55. c|c (TM) (TM) 55 calculation | consulting why deep learning works Statistical Mechanics derivation of the alpha-norm metric
  • 56. c|c (TM) (TM) 56 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model Statistical Mechanics of Learning Engle &Van den Broeck (2001) Generalization error ~ phase space volume Average error ~ overlap between T and J
  • 57. c|c (TM) (TM) 57 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model Statistical Mechanics of Learning Engle &Van den Broeck (2001) Standard approach: • Teacher (T) and Student (J) random Perceptron vectors • Treat data as an external random Gaussian field • Apply Hubbard–Stratonovich to get mean-field result • Assume continuous or discrete J • Solve for as a function of load (# data points / # parameters)
  • 58. c|c (TM) (TM) 58 calculation | consulting why deep learning works New Set Up: Matrix-generalized Student-Teacher x Continuous perceptron Ising Perceptron Uninteresting Replica theory, shows phase behavior, entropy collapse, etc
  • 59. c|c (TM) (TM) 59 calculation | consulting why deep learning works New Set Up: Matrix-generalized Student-Teacher “Towards a new theory…” Martin & Mahoney (in preparation) real DNN matrices: NxM Strongly correlated Heavy tailed correlation matrices Solve for total integrated phase-space volume
  • 60. c|c (TM) (TM) 60 calculation | consulting why deep learning works New approach: HCIZ Matrix Integrals Write the overlap as Fix a Teacher. The integral is now over all random students J that overlap w/T Use the following result in RMT “Asymptotics of HCZI integrals …” Tanaka (2008)
  • 61. c|c (TM) (TM) 61 calculation | consulting why deep learning works RMT: Annealed vs Quenched averages “A First Course in Random Matrix Theory” Potters and Bouchaud (2020) good outside spin-glass phases where system is trained well We imagine averaging over all (random) students DNNs with (correlations that) look like the teacher DNN
  • 62. c|c (TM) (TM) 62 calculation | consulting why deep learning works New interpretation: HCIZ Matrix Integrals Generating functional R-Transform (inverse Green’s function, via Contour Integral) in terms of Teacher's eigenvalues , and Student’s cumulants
  • 63. c|c (TM) (TM) 63 calculation | consulting why deep learning works Some basic RMT: Greens functions The Green’s function is the Stieltjes transform of eigenvalue distribution Given the empirical spectral density (average eigenvalue density) and using:
  • 64. c|c (TM) (TM) 64 calculation | consulting why deep learning works Some basic RMT: Moment generating functions The Green’s function has poles at the actual eigenvalues But is analytic in the complex plane up and away from the real axis z Expand in a series around z = ∞ moment generating function
  • 65. c|c (TM) (TM) 65 calculation | consulting why deep learning works Some basic RMT: R-Transforms Which gives a (similar) moment generating function The free-cumulant-generating function (R-transform) is related to the Green function as Gaussian random and very Heavy Tailed (Levy) random matrices But which takes a simple form for both
  • 66. c|c (TM) (TM) 66 calculation | consulting why deep learning works Results: Gaussian Random Weight Matrices “Random Matrix Theory (book)” Bouchaud and Potters (2020) Recover the Frobenius Norm (squared) as the metric
  • 67. c|c (TM) (TM) 67 calculation | consulting why deep learning works Results: (very) Heavy Tailed Weight Matrices “Heavy-tailed random matrices” Burda and Jukiewicz (2009) Recover a Shatten Norm, in terms of the Heavy Tailed exponent
  • 68. c|c (TM) (TM) 68 calculation | consulting why deep learning works Application to: Heavy Tailed Weight Matrices Some reasonable approximations give the weighted alpha metric Q.E.D.