SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Random Matrix Theory for Machine Learning
The Mystery of Generalization: Why Does Deep Learning Work?
Fabian Pedregosa, Courtney Paquette, Tom Trogdon, Jeffrey Pennington
https://random-matrix-learning.github.io
Why does deep learning work?
Deep neural networks define a flexible and expressive class of functions.
State-of-the-art models have millions or
billions of parameters
• Meena: 2.6 billion
• Turing NLG: 17 billion
• GPT-3: 175 billion
Source: [Canziani et al., 2016]
1
Why does deep learning work?
Deep neural networks define a flexible and expressive class of functions.
State-of-the-art models have millions or
billions of parameters
• Meena: 2.6 billion
• Turing NLG: 17 billion
• GPT-3: 175 billion
Source: [Canziani et al., 2016]
Models that perform well on real data can
easily memorize noise
Source: [Zhang et al., 2021]
1
Why does deep learning work?
Deep neural networks define a flexible and expressive class of functions.
⇒ Standard wisdom suggests they should overfit
2
Why does deep learning work?
Deep neural networks define a flexible and expressive class of functions.
⇒ Standard wisdom suggests they should overfit
3
Double descent
However, large neural networks do not obey the classical theory:
Source: [Belkin et al., 2019]
Source: [Advani et al., 2020]
Source: [Nakkiran et al., 2019]
The emerging paradigm of double descent seeks to explain this phenomenon. 4
Models of Double Descent
History of double descent: Kernel interpolation
1) Interpolating kernels (trained to zero error) generalize well [Belkin et al., 2018]
⇒ Double descent is not unique to deep neural
networks
5
History of double descent: Kernel interpolation
1) Interpolating kernels (trained to zero error) generalize well [Belkin et al., 2018]
⇒ Double descent is not unique to deep neural
networks
2) Kernels can implicitly regularize in high
dimensions [Liang and Rakhlin, 2020]
0.0 0.2 0.4 0.6 0.8 1.0 1.2
lambda
10 1
log(error)
Kernel Regression on MNIST
digits pair [i,j]
[2,5]
[2,9]
[3,6]
[3,8]
[4,7]
Source: [Liang and Rakhlin, 2020]
5
History of double descent: Kernel interpolation
1) Interpolating kernels (trained to zero error) generalize well [Belkin et al., 2018]
⇒ Double descent is not unique to deep neural
networks
2) Kernels can implicitly regularize in high
dimensions [Liang and Rakhlin, 2020]
0.0 0.2 0.4 0.6 0.8 1.0 1.2
lambda
10 1
log(error)
Kernel Regression on MNIST
digits pair [i,j]
[2,5]
[2,9]
[3,6]
[3,8]
[4,7]
Source: [Liang and Rakhlin, 2020]
3) Consistency is a high-dimensional phenomenon [Rakhlin and Zhai, 2019]:
The estimation error of the minimum-norm kernel interpolant
arg min
f∈H
∥f∥H s.t. f(xi) = yi , i = 1 . . . n
does not converge to zero as n grows, unless d is proportional to n. 5
Models of double descent: High-dimensional linear regression
What is the simplest theoretically tractable model that exhibits double descent?
6
Models of double descent: High-dimensional linear regression
What is the simplest theoretically tractable model that exhibits double descent?
Linear regression suffices, but requires a mechanism to vary the effective number of
parameters or samples:
• The size of randomly selected subsets of features [Belkin et al., 2020]
• The dimensionality of the low-variance subspace [Bartlett et al., 2020]
• The sparsity of the generative model [Mitra, 2019]
0 20 40 60 80 100
p
0
2
4
6
8
10
Risk
Source: [Belkin et al., 2020] Source: [Mitra, 2019]
6
Models of double descent: Random feature models
In random feature regression, the number of random features controls the model
capacity, and can be tuned independently from the data.
Exact asymptotic results in high dimensions exist in many settings, including:
• Unstructured random features [Mei and Montanari, 2019]
• NTK-structured random features [Adlam and Pennington, 2020]
• Random Fourier features [Liao et al., 2020]
100 101 102 103 104 105
n1
10 2
10 1
100
Test
loss
NTK reg. pred.
NTK reg. emp.
NN GD emp.
Source: [Adlam and Pennington, 2020] Source: [Mei and Montanari, 2019]
7
Random feature models
Random feature regression: definition
Random feature regression is just linear regression on a transformed feature matrix,
F = f( 1
√
d
WX) ∈ Rm×n
, where W ∈ Rm×d
, Wij ∼ N(0, 1).
• Model given by β⊤
F (instead of β⊤
X) — variable capacity (m vs d parameters)
• f(·) is a nonlinear activation function, acting elementwise
• F is equivalent to first post-activation layer of a NN at init
8
Random feature regression: definition
Random feature regression is just linear regression on a transformed feature matrix,
F = f( 1
√
d
WX) ∈ Rm×n
, where W ∈ Rm×d
, Wij ∼ N(0, 1).
• Model given by β⊤
F (instead of β⊤
X) — variable capacity (m vs d parameters)
• f(·) is a nonlinear activation function, acting elementwise
• F is equivalent to first post-activation layer of a NN at init
For targets Y ∈ R1×n
, the ridge-regularized loss is,
L(β; X) = ∥Y − 1
√
m
β⊤
F∥2
2 + λ∥β∥2
2 ,
and the optimal regression coefficients β̂ are given by,
β̂ = 1
√
m
Y(K + λIn)−1
F⊤
, K = 1
m F⊤
F .
8
Random feature regression: definition
Random feature regression is just linear regression on a transformed feature matrix,
F = f( 1
√
d
WX) ∈ Rm×n
, where W ∈ Rm×d
, Wij ∼ N(0, 1).
• Model given by β⊤
F (instead of β⊤
X) — variable capacity (m vs d parameters)
• f(·) is a nonlinear activation function, acting elementwise
• F is equivalent to first post-activation layer of a NN at init
For targets Y ∈ R1×n
, the ridge-regularized loss is,
L(β; X) = ∥Y − 1
√
m
β⊤
F∥2
2 + λ∥β∥2
2 ,
and the optimal regression coefficients β̂ are given by,
β̂ = 1
√
m
Y(K + λIn)−1
F⊤
, K = 1
m F⊤
F .
Note that Q ≡ (K + λIn)−1
is the resolvent of the kernel matrix K.
8
Random feature regression: training error
Training error is determined by the resolvent matrix Q ≡ (K + λIn)−1
:
Etrain(λ) = 1
n ∥Y − 1
√
m
β̂⊤
F∥2
2
= 1
n ∥Y − 1
m YQF⊤
F∥2
2
= 1
n ∥Y − YQK∥2
2
= 1
n ∥Y(In − QK)∥2
2
= λ2 1
n ∥YQ∥2
2 ,
where we used that In − QK = In − Q(Q−1
− λIn) = In − (In − λQ) = λQ.
9
Random feature regression: training error
Training error is determined by the resolvent matrix Q ≡ (K + λIn)−1
:
Etrain(λ) = 1
n ∥Y − 1
√
m
β̂⊤
F∥2
2
= 1
n ∥Y − 1
m YQF⊤
F∥2
2
= 1
n ∥Y − YQK∥2
2
= 1
n ∥Y(In − QK)∥2
2
= λ2 1
n ∥YQ∥2
2 ,
where we used that In − QK = In − Q(Q−1
− λIn) = In − (In − λQ) = λQ.
So we see that the training error measures the alignment between the resolvent and the
label vector.
What about the test error?
9
Aside: Generalized cross validation (GCV)
A model’s performance on the training set, or subsets thereof, can be useful for
estimating its performance on the test set.
• Leave-one-out cross validation (LOOCV)
ELOOCV(λ) = 1
n ∥YQ · diag(Q)−1
∥2
2
• Generalized cross validation (GCV)
EGCV(λ) = 1
n ∥YQ∥2
2/( 1
n tr(Q))2
10
Aside: Generalized cross validation (GCV)
A model’s performance on the training set, or subsets thereof, can be useful for
estimating its performance on the test set.
• Leave-one-out cross validation (LOOCV)
ELOOCV(λ) = 1
n ∥YQ · diag(Q)−1
∥2
2
• Generalized cross validation (GCV)
EGCV(λ) = 1
n ∥YQ∥2
2/( 1
n tr(Q))2
In certain high-dimensional limits, EGCV(λ) = ELOOCV(λ) = Etest(λ):
• Ridge regression [Hastie et al., 2019]
• Kernel ridge regression [Jacot et al., 2020]
• Random feature regression [Adlam and Pennington, 2020]
10
Random feature regression: high-dimensional asymptotics
To develop an analytical model of double descent, we study the high-dimensional
asymptotics:
m, d, n → ∞ such that ϕ ≡ d
n , ψ ≡ d
m are constant.
11
Random feature regression: high-dimensional asymptotics
To develop an analytical model of double descent, we study the high-dimensional
asymptotics:
m, d, n → ∞ such that ϕ ≡ d
n , ψ ≡ d
m are constant.
In this limit, only linear functions of the data can be learned.
• Intuition: only enough constraints to disambiguate linear combinations of features.
• Nonlinear target function behaves like linear function plus noise
11
Random feature regression: high-dimensional asymptotics
To develop an analytical model of double descent, we study the high-dimensional
asymptotics:
m, d, n → ∞ such that ϕ ≡ d
n , ψ ≡ d
m are constant.
In this limit, only linear functions of the data can be learned.
• Intuition: only enough constraints to disambiguate linear combinations of features.
• Nonlinear target function behaves like linear function plus noise
Therefore it suffices to consider labels given by
Y = 1
√
d
β∗⊤
X + ε , εi ∼ N(0, σ2
ε) .
11
Random feature regression: high-dimensional asymptotics
To develop an analytical model of double descent, we study the high-dimensional
asymptotics:
m, d, n → ∞ such that ϕ ≡ d
n , ψ ≡ d
m are constant.
In this limit, only linear functions of the data can be learned.
• Intuition: only enough constraints to disambiguate linear combinations of features.
• Nonlinear target function behaves like linear function plus noise
Therefore it suffices to consider labels given by
Y = 1
√
d
β∗⊤
X + ε , εi ∼ N(0, σ2
ε) .
For simplicity, we focus on the specific setting in which,
Xij ∼ N(0, 1) and β∗
∼ N(0, Id) .
11
Random feature regression: test error
In the high-dimensional asymptotic setup from above, the random feature test error can
be written as,
Etest(λ) = EGCV(λ)
= lim
n→∞
1
n ∥YQ∥2
2/( 1
n tr(Q))2
= lim
n→∞
1
n ∥( 1
√
d
β∗⊤
X + ε)Q∥2
2/( 1
n tr(Q))2
= lim
n→∞
1
n tr[(σ2
εIn + 1
d X⊤
X)Q2
]/( 1
n tr(Q))2
≡ −
σ2
ετ′
1(λ) + τ′
2(λ)
τ1(λ)2
,
where we used that ∂
∂λ Q = −Q2
, and we defined
τ1 = lim
n→∞
1
n tr(Q) and τ2 = lim
n→∞
1
n tr( 1
d X⊤
XQ) .
12
Computing the asymptotic test error
To compute the test error, we need:
τ1 = lim
n→∞
1
n tr(K + λIn)−1
and τ2 = lim
n→∞
1
n tr( 1
d X⊤
X(K + λIn)−1
) .
13
Computing the asymptotic test error
To compute the test error, we need:
τ1 = lim
n→∞
1
n tr(K + λIn)−1
and τ2 = lim
n→∞
1
n tr( 1
d X⊤
X(K + λIn)−1
) .
Recalling the definition of K,
K = 1
m F⊤
F , F = f( 1
√
d
WX) ,
it is evident that the entries of F are nonlinearly dependent.
• Cannot simply utilize standard results for Wishart matrices
• Stieltjes transform is insufficient for τ2
13
Computing the asymptotic test error
To compute the test error, we need:
τ1 = lim
n→∞
1
n tr(K + λIn)−1
and τ2 = lim
n→∞
1
n tr( 1
d X⊤
X(K + λIn)−1
) .
Recalling the definition of K,
K = 1
m F⊤
F , F = f( 1
√
d
WX) ,
it is evident that the entries of F are nonlinearly dependent.
• Cannot simply utilize standard results for Wishart matrices
• Stieltjes transform is insufficient for τ2
These technical challenges can be overcome with two tricks:
1. Constructing an equivalent Gaussian linearized model
2. Analyzing a suitably augmented resolvent
13
Computing the asymptotic test error: Gaussian equivalents
The nonlinear dependencies in F = f( 1
√
d
WX) complicate the analysis.
Can we identify a simpler matrix in the same universality class?
14
Computing the asymptotic test error: Gaussian equivalents
The nonlinear dependencies in F = f( 1
√
d
WX) complicate the analysis.
Can we identify a simpler matrix in the same universality class?
There exist constants c1 and c2 such that
F ∼
= Flin ≡ c1
1
√
d
WX + c2Θ , Θij ∼ N(0, 1) ,
where F ∼
= Flin indicates the two matrices share all statistics relevant for computing the
test error:
τ1 = lim
n→∞
1
n tr( 1
m F⊤
F + λIn)−1
= lim
n→∞
1
n tr( 1
m F⊤
linFlin + λIn)−1
τ2 = lim
n→∞
1
n tr( 1
d X⊤
X( 1
m F⊤
F + λIn)−1
) = lim
n→∞
1
n tr( 1
d X⊤
X( 1
m F⊤
linFlin + λIn)−1
)
14
Computing the asymptotic test error: Gaussian equivalents
The nonlinear dependencies in F = f( 1
√
d
WX) complicate the analysis.
Can we identify a simpler matrix in the same universality class?
There exist constants c1 and c2 such that
F ∼
= Flin ≡ c1
1
√
d
WX + c2Θ , Θij ∼ N(0, 1) ,
where F ∼
= Flin indicates the two matrices share all statistics relevant for computing the
test error:
τ1 = lim
n→∞
1
n tr( 1
m F⊤
F + λIn)−1
= lim
n→∞
1
n tr( 1
m F⊤
linFlin + λIn)−1
τ2 = lim
n→∞
1
n tr( 1
d X⊤
X( 1
m F⊤
F + λIn)−1
) = lim
n→∞
1
n tr( 1
d X⊤
X( 1
m F⊤
linFlin + λIn)−1
)
How can we compute these traces? Need to augment the resolvent.
14
Computing the asymptotic test error: resolvent method
Recall from Part 2 that the resolvent method identifies consistency relations between
suitably chosen submatrices of the resolvent.
Here we can undertake a similar analysis as in Part 2, but now on an augmented matrix,
H =
[
λIn
1
√
m
F⊤
lin
1
√
m
Flin −Im
]
,
which encodes the resolvent through Q = (K + λIn)−1
= [H−1
]1:n,1:n.
15
Computing the asymptotic test error: resolvent method
Recall from Part 2 that the resolvent method identifies consistency relations between
suitably chosen submatrices of the resolvent.
Here we can undertake a similar analysis as in Part 2, but now on an augmented matrix,
H =
[
λIn
1
√
m
F⊤
lin
1
√
m
Flin −Im
]
,
which encodes the resolvent through Q = (K + λIn)−1
= [H−1
]1:n,1:n.
To derive consistency relations, we consider two submatrices: H(1)
(leaving out
row/column 1), and H(n+1)
(leaving out row/column n + 1).
As before, we use the Sherman-Morrison formula to compute [H(1)
]−1
and [H(n+1)
]−1
, and
relate them to Q and rows/columns of Flin.
Straightforward concentration arguments eventually lead to coupled self-consistent
equations for τ1 and τ2 [Adlam et al., 2019].
15
Computing the asymptotic test error: free probability
An alternative augmentation of the resolvent completely linearizes the dependence on
the random matrices:
M =





λIn
c2
m Θ⊤ c1
√
dm
X⊤
0
c2Θ −Im 0 c1
√
d
W
0 W⊤
−Id 0
X 0 0 −Id





,
where the Schur complement formula now gives,
τ1 = lim
n→∞
1
n tr([M−1
]1,1) , and τ2 = lim
n→∞
1
n tr([M−1
]4,3) .
The asymptotic blockwise traces tr([M−1
]a,b) can themselves be computed using free
probability [Adlam and Pennington, 2020].
16
Computing the asymptotic test error: free probability
M is linear in the random matrices X, W, and Θ:
M =





λIn 0 0 0
0 −Im 0 0
0 0 −Id 0
0 0 0 −Id





+





0 0 c1
√
dm
X⊤
0
0 0 0 0
0 0 0 0
X 0 0 0





+





0 0 0 0
0 0 0 c1
√
d
W
0 W⊤
0 0
0 0 0 0





+





0 c2
m
Θ⊤
0 0
c2Θ 0 0 0
0 0 0 0
0 0 0 0





Can we compute the blockwise traces with free probability via the R-transform?
17
Computing the asymptotic test error: free probability
M is linear in the random matrices X, W, and Θ:
M =





λIn 0 0 0
0 −Im 0 0
0 0 −Id 0
0 0 0 −Id





+





0 0 c1
√
dm
X⊤
0
0 0 0 0
0 0 0 0
X 0 0 0





+





0 0 0 0
0 0 0 c1
√
d
W
0 W⊤
0 0
0 0 0 0





+





0 c2
m
Θ⊤
0 0
c2Θ 0 0 0
0 0 0 0
0 0 0 0





Can we compute the blockwise traces with free probability via the R-transform?
Not naively: the additive terms are independent, but not free over C.
However, they are free over M4(C), and there exists a suitable operator-valued generalization of the
R-transform that enables the necessary computations [Mingo and Speicher, 2017].
17
Asymptotic test error
Theorem
Let η = E[f(g)2
] and ζ = (E[gf(g)])2
for g ∼ N(0, 1). Then, the asymptotic traces τ1(λ)
and τ2(λ) are given by solutions to the polynomial system,
ζτ1τ2 (1 − λτ1) = ϕ/ψ (ζτ1τ2 + ϕ(τ2 − τ1)) = (τ1 − τ2) ϕ ((η − ζ)τ1 + ζτ2) ,
and, Etrain = −λ2
(σ2
ετ′
1 + τ′
2) and Etest = −(σ2
ετ′
1 + τ′
2)/τ2
1 .
MSE
0.01 0.10 1 10 100
0.0
0.5
1.0
1.5
2.0
Random Feature Regression, d = n
m/n
MSE
0.01 0.10 1 10 100
0.0
0.5
1.0
1.5
2.0
Random Feature Regression, d = n/2
m/n
log10(λ)
-5
-4
-3
-2
-1
18
References i
References
Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of
generalization. In International Conference on Machine Learning. PMLR, 2020. URL
https://arxiv.org/pdf/2008.06786.pdf.
Ben Adlam, Jake Levinson, and Jeffrey Pennington. A random matrix perspective on mixtures of nonlinearities for deep learning.
arXiv preprint arXiv:1912.00827, 2019.
Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.
Neural Networks, 132:428–446, 2020.
Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the
National Academy of Sciences, 117(48):30063–30070, 2020.
Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In
International Conference on Machine Learning, pages 541–549. PMLR, 2018.
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical
bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
References ii
Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. SIAM Journal on Mathematics of Data
Science, 2(4):1167–1180, 2020.
Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications.
arXiv preprint arXiv:1605.07678, 2016.
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares
interpolation. arXiv preprint arXiv:1903.08560, 2019. URL https://arxiv.org/pdf/1903.08560.pdf.
Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, and Franck Gabriel. Kernel alignment risk estimator: risk
prediction from training data. arXiv preprint arXiv:2006.09796, 2020.
Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48
(3):1329–1347, 2020.
Zhenyu Liao, Romain Couillet, and Michael W Mahoney. A random matrix analysis of random fourier features: beyond the
gaussian kernel, a precise phase transition, and the corresponding double descent. arXiv preprint arXiv:2006.05013, 2020. URL
https://arxiv.org/pdf/2006.05013.pdf.
Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double
descent curve. arXiv preprint arXiv:1908.05355, 2019. URL https://arxiv.org/pdf/1908.05355.pdf.
James A Mingo and Roland Speicher. Free probability and random matrices, volume 35. Springer, 2017.
Partha P Mitra. Understanding overfitting peaks in generalization error: Analytical risk curves for l_2 and l_1 penalized
interpolation. arXiv preprint arXiv:1906.03667, 2019.
References iii
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger
models and more data hurt. arXiv preprint arXiv:1912.02292, 2019.
Alexander Rakhlin and Xiyu Zhai. Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In
Conference on Learning Theory, pages 2595–2623. PMLR, 2019.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires
rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.

Contenu connexe

Tendances

Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
Taiji Suzuki
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Chiheb Ben Hammouda
 
tensor-decomposition
tensor-decompositiontensor-decomposition
tensor-decomposition
Kenta Oono
 
Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...
zukun
 
A current perspectives of corrected operator splitting (os) for systems
A current perspectives of corrected operator splitting (os) for systemsA current perspectives of corrected operator splitting (os) for systems
A current perspectives of corrected operator splitting (os) for systems
Alexander Decker
 
Adaptive dynamic programming for control
Adaptive dynamic programming for controlAdaptive dynamic programming for control
Adaptive dynamic programming for control
Springer
 

Tendances (20)

Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
JAISTサマースクール2016「脳を知るための理論」講義01 Single neuron models
JAISTサマースクール2016「脳を知るための理論」講義01 Single neuron modelsJAISTサマースクール2016「脳を知るための理論」講義01 Single neuron models
JAISTサマースクール2016「脳を知るための理論」講義01 Single neuron models
 
On estimating the integrated co volatility using
On estimating the integrated co volatility usingOn estimating the integrated co volatility using
On estimating the integrated co volatility using
 
tensor-decomposition
tensor-decompositiontensor-decomposition
tensor-decomposition
 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMC
 
Levitan Centenary Conference Talk, June 27 2014
Levitan Centenary Conference Talk, June 27 2014Levitan Centenary Conference Talk, June 27 2014
Levitan Centenary Conference Talk, June 27 2014
 
Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...
 
11.generalized and subset integrated autoregressive moving average bilinear t...
11.generalized and subset integrated autoregressive moving average bilinear t...11.generalized and subset integrated autoregressive moving average bilinear t...
11.generalized and subset integrated autoregressive moving average bilinear t...
 
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksLinear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGD
 
Neural Processes
Neural ProcessesNeural Processes
Neural Processes
 
A current perspectives of corrected operator splitting (os) for systems
A current perspectives of corrected operator splitting (os) for systemsA current perspectives of corrected operator splitting (os) for systems
A current perspectives of corrected operator splitting (os) for systems
 
STUDIES ON INTUTIONISTIC FUZZY INFORMATION MEASURE
STUDIES ON INTUTIONISTIC FUZZY INFORMATION MEASURESTUDIES ON INTUTIONISTIC FUZZY INFORMATION MEASURE
STUDIES ON INTUTIONISTIC FUZZY INFORMATION MEASURE
 
Convex optimization methods
Convex optimization methodsConvex optimization methods
Convex optimization methods
 
Adaptive dynamic programming for control
Adaptive dynamic programming for controlAdaptive dynamic programming for control
Adaptive dynamic programming for control
 
Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...
Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...
Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLP
 
Integration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsIntegration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methods
 
Smart Multitask Bregman Clustering
Smart Multitask Bregman ClusteringSmart Multitask Bregman Clustering
Smart Multitask Bregman Clustering
 

Similaire à Random Matrix Theory and Machine Learning - Part 4

Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
butest
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
CMA-ES with local meta-models
CMA-ES with local meta-modelsCMA-ES with local meta-models
CMA-ES with local meta-models
zyedb
 
Robust Immunological Algorithms for High-Dimensional Global Optimization
Robust Immunological Algorithms for High-Dimensional Global OptimizationRobust Immunological Algorithms for High-Dimensional Global Optimization
Robust Immunological Algorithms for High-Dimensional Global Optimization
Mario Pavone
 

Similaire à Random Matrix Theory and Machine Learning - Part 4 (20)

The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Learning to Reconstruct
Learning to ReconstructLearning to Reconstruct
Learning to Reconstruct
 
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq LearningJulia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoring
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...
 
PhD defense talk slides
PhD  defense talk slidesPhD  defense talk slides
PhD defense talk slides
 
CMA-ES with local meta-models
CMA-ES with local meta-modelsCMA-ES with local meta-models
CMA-ES with local meta-models
 
Digital signal and image processing FAQ
Digital signal and image processing FAQDigital signal and image processing FAQ
Digital signal and image processing FAQ
 
Robust Immunological Algorithms for High-Dimensional Global Optimization
Robust Immunological Algorithms for High-Dimensional Global OptimizationRobust Immunological Algorithms for High-Dimensional Global Optimization
Robust Immunological Algorithms for High-Dimensional Global Optimization
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 

Plus de Fabian Pedregosa

Plus de Fabian Pedregosa (9)

Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you need
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 

Dernier

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
anilsa9823
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 

Dernier (20)

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 

Random Matrix Theory and Machine Learning - Part 4

  • 1. Random Matrix Theory for Machine Learning The Mystery of Generalization: Why Does Deep Learning Work? Fabian Pedregosa, Courtney Paquette, Tom Trogdon, Jeffrey Pennington https://random-matrix-learning.github.io
  • 2. Why does deep learning work? Deep neural networks define a flexible and expressive class of functions. State-of-the-art models have millions or billions of parameters • Meena: 2.6 billion • Turing NLG: 17 billion • GPT-3: 175 billion Source: [Canziani et al., 2016] 1
  • 3. Why does deep learning work? Deep neural networks define a flexible and expressive class of functions. State-of-the-art models have millions or billions of parameters • Meena: 2.6 billion • Turing NLG: 17 billion • GPT-3: 175 billion Source: [Canziani et al., 2016] Models that perform well on real data can easily memorize noise Source: [Zhang et al., 2021] 1
  • 4. Why does deep learning work? Deep neural networks define a flexible and expressive class of functions. ⇒ Standard wisdom suggests they should overfit 2
  • 5. Why does deep learning work? Deep neural networks define a flexible and expressive class of functions. ⇒ Standard wisdom suggests they should overfit 3
  • 6. Double descent However, large neural networks do not obey the classical theory: Source: [Belkin et al., 2019] Source: [Advani et al., 2020] Source: [Nakkiran et al., 2019] The emerging paradigm of double descent seeks to explain this phenomenon. 4
  • 8. History of double descent: Kernel interpolation 1) Interpolating kernels (trained to zero error) generalize well [Belkin et al., 2018] ⇒ Double descent is not unique to deep neural networks 5
  • 9. History of double descent: Kernel interpolation 1) Interpolating kernels (trained to zero error) generalize well [Belkin et al., 2018] ⇒ Double descent is not unique to deep neural networks 2) Kernels can implicitly regularize in high dimensions [Liang and Rakhlin, 2020] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda 10 1 log(error) Kernel Regression on MNIST digits pair [i,j] [2,5] [2,9] [3,6] [3,8] [4,7] Source: [Liang and Rakhlin, 2020] 5
  • 10. History of double descent: Kernel interpolation 1) Interpolating kernels (trained to zero error) generalize well [Belkin et al., 2018] ⇒ Double descent is not unique to deep neural networks 2) Kernels can implicitly regularize in high dimensions [Liang and Rakhlin, 2020] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda 10 1 log(error) Kernel Regression on MNIST digits pair [i,j] [2,5] [2,9] [3,6] [3,8] [4,7] Source: [Liang and Rakhlin, 2020] 3) Consistency is a high-dimensional phenomenon [Rakhlin and Zhai, 2019]: The estimation error of the minimum-norm kernel interpolant arg min f∈H ∥f∥H s.t. f(xi) = yi , i = 1 . . . n does not converge to zero as n grows, unless d is proportional to n. 5
  • 11. Models of double descent: High-dimensional linear regression What is the simplest theoretically tractable model that exhibits double descent? 6
  • 12. Models of double descent: High-dimensional linear regression What is the simplest theoretically tractable model that exhibits double descent? Linear regression suffices, but requires a mechanism to vary the effective number of parameters or samples: • The size of randomly selected subsets of features [Belkin et al., 2020] • The dimensionality of the low-variance subspace [Bartlett et al., 2020] • The sparsity of the generative model [Mitra, 2019] 0 20 40 60 80 100 p 0 2 4 6 8 10 Risk Source: [Belkin et al., 2020] Source: [Mitra, 2019] 6
  • 13. Models of double descent: Random feature models In random feature regression, the number of random features controls the model capacity, and can be tuned independently from the data. Exact asymptotic results in high dimensions exist in many settings, including: • Unstructured random features [Mei and Montanari, 2019] • NTK-structured random features [Adlam and Pennington, 2020] • Random Fourier features [Liao et al., 2020] 100 101 102 103 104 105 n1 10 2 10 1 100 Test loss NTK reg. pred. NTK reg. emp. NN GD emp. Source: [Adlam and Pennington, 2020] Source: [Mei and Montanari, 2019] 7
  • 15. Random feature regression: definition Random feature regression is just linear regression on a transformed feature matrix, F = f( 1 √ d WX) ∈ Rm×n , where W ∈ Rm×d , Wij ∼ N(0, 1). • Model given by β⊤ F (instead of β⊤ X) — variable capacity (m vs d parameters) • f(·) is a nonlinear activation function, acting elementwise • F is equivalent to first post-activation layer of a NN at init 8
  • 16. Random feature regression: definition Random feature regression is just linear regression on a transformed feature matrix, F = f( 1 √ d WX) ∈ Rm×n , where W ∈ Rm×d , Wij ∼ N(0, 1). • Model given by β⊤ F (instead of β⊤ X) — variable capacity (m vs d parameters) • f(·) is a nonlinear activation function, acting elementwise • F is equivalent to first post-activation layer of a NN at init For targets Y ∈ R1×n , the ridge-regularized loss is, L(β; X) = ∥Y − 1 √ m β⊤ F∥2 2 + λ∥β∥2 2 , and the optimal regression coefficients β̂ are given by, β̂ = 1 √ m Y(K + λIn)−1 F⊤ , K = 1 m F⊤ F . 8
  • 17. Random feature regression: definition Random feature regression is just linear regression on a transformed feature matrix, F = f( 1 √ d WX) ∈ Rm×n , where W ∈ Rm×d , Wij ∼ N(0, 1). • Model given by β⊤ F (instead of β⊤ X) — variable capacity (m vs d parameters) • f(·) is a nonlinear activation function, acting elementwise • F is equivalent to first post-activation layer of a NN at init For targets Y ∈ R1×n , the ridge-regularized loss is, L(β; X) = ∥Y − 1 √ m β⊤ F∥2 2 + λ∥β∥2 2 , and the optimal regression coefficients β̂ are given by, β̂ = 1 √ m Y(K + λIn)−1 F⊤ , K = 1 m F⊤ F . Note that Q ≡ (K + λIn)−1 is the resolvent of the kernel matrix K. 8
  • 18. Random feature regression: training error Training error is determined by the resolvent matrix Q ≡ (K + λIn)−1 : Etrain(λ) = 1 n ∥Y − 1 √ m β̂⊤ F∥2 2 = 1 n ∥Y − 1 m YQF⊤ F∥2 2 = 1 n ∥Y − YQK∥2 2 = 1 n ∥Y(In − QK)∥2 2 = λ2 1 n ∥YQ∥2 2 , where we used that In − QK = In − Q(Q−1 − λIn) = In − (In − λQ) = λQ. 9
  • 19. Random feature regression: training error Training error is determined by the resolvent matrix Q ≡ (K + λIn)−1 : Etrain(λ) = 1 n ∥Y − 1 √ m β̂⊤ F∥2 2 = 1 n ∥Y − 1 m YQF⊤ F∥2 2 = 1 n ∥Y − YQK∥2 2 = 1 n ∥Y(In − QK)∥2 2 = λ2 1 n ∥YQ∥2 2 , where we used that In − QK = In − Q(Q−1 − λIn) = In − (In − λQ) = λQ. So we see that the training error measures the alignment between the resolvent and the label vector. What about the test error? 9
  • 20. Aside: Generalized cross validation (GCV) A model’s performance on the training set, or subsets thereof, can be useful for estimating its performance on the test set. • Leave-one-out cross validation (LOOCV) ELOOCV(λ) = 1 n ∥YQ · diag(Q)−1 ∥2 2 • Generalized cross validation (GCV) EGCV(λ) = 1 n ∥YQ∥2 2/( 1 n tr(Q))2 10
  • 21. Aside: Generalized cross validation (GCV) A model’s performance on the training set, or subsets thereof, can be useful for estimating its performance on the test set. • Leave-one-out cross validation (LOOCV) ELOOCV(λ) = 1 n ∥YQ · diag(Q)−1 ∥2 2 • Generalized cross validation (GCV) EGCV(λ) = 1 n ∥YQ∥2 2/( 1 n tr(Q))2 In certain high-dimensional limits, EGCV(λ) = ELOOCV(λ) = Etest(λ): • Ridge regression [Hastie et al., 2019] • Kernel ridge regression [Jacot et al., 2020] • Random feature regression [Adlam and Pennington, 2020] 10
  • 22. Random feature regression: high-dimensional asymptotics To develop an analytical model of double descent, we study the high-dimensional asymptotics: m, d, n → ∞ such that ϕ ≡ d n , ψ ≡ d m are constant. 11
  • 23. Random feature regression: high-dimensional asymptotics To develop an analytical model of double descent, we study the high-dimensional asymptotics: m, d, n → ∞ such that ϕ ≡ d n , ψ ≡ d m are constant. In this limit, only linear functions of the data can be learned. • Intuition: only enough constraints to disambiguate linear combinations of features. • Nonlinear target function behaves like linear function plus noise 11
  • 24. Random feature regression: high-dimensional asymptotics To develop an analytical model of double descent, we study the high-dimensional asymptotics: m, d, n → ∞ such that ϕ ≡ d n , ψ ≡ d m are constant. In this limit, only linear functions of the data can be learned. • Intuition: only enough constraints to disambiguate linear combinations of features. • Nonlinear target function behaves like linear function plus noise Therefore it suffices to consider labels given by Y = 1 √ d β∗⊤ X + ε , εi ∼ N(0, σ2 ε) . 11
  • 25. Random feature regression: high-dimensional asymptotics To develop an analytical model of double descent, we study the high-dimensional asymptotics: m, d, n → ∞ such that ϕ ≡ d n , ψ ≡ d m are constant. In this limit, only linear functions of the data can be learned. • Intuition: only enough constraints to disambiguate linear combinations of features. • Nonlinear target function behaves like linear function plus noise Therefore it suffices to consider labels given by Y = 1 √ d β∗⊤ X + ε , εi ∼ N(0, σ2 ε) . For simplicity, we focus on the specific setting in which, Xij ∼ N(0, 1) and β∗ ∼ N(0, Id) . 11
  • 26. Random feature regression: test error In the high-dimensional asymptotic setup from above, the random feature test error can be written as, Etest(λ) = EGCV(λ) = lim n→∞ 1 n ∥YQ∥2 2/( 1 n tr(Q))2 = lim n→∞ 1 n ∥( 1 √ d β∗⊤ X + ε)Q∥2 2/( 1 n tr(Q))2 = lim n→∞ 1 n tr[(σ2 εIn + 1 d X⊤ X)Q2 ]/( 1 n tr(Q))2 ≡ − σ2 ετ′ 1(λ) + τ′ 2(λ) τ1(λ)2 , where we used that ∂ ∂λ Q = −Q2 , and we defined τ1 = lim n→∞ 1 n tr(Q) and τ2 = lim n→∞ 1 n tr( 1 d X⊤ XQ) . 12
  • 27. Computing the asymptotic test error To compute the test error, we need: τ1 = lim n→∞ 1 n tr(K + λIn)−1 and τ2 = lim n→∞ 1 n tr( 1 d X⊤ X(K + λIn)−1 ) . 13
  • 28. Computing the asymptotic test error To compute the test error, we need: τ1 = lim n→∞ 1 n tr(K + λIn)−1 and τ2 = lim n→∞ 1 n tr( 1 d X⊤ X(K + λIn)−1 ) . Recalling the definition of K, K = 1 m F⊤ F , F = f( 1 √ d WX) , it is evident that the entries of F are nonlinearly dependent. • Cannot simply utilize standard results for Wishart matrices • Stieltjes transform is insufficient for τ2 13
  • 29. Computing the asymptotic test error To compute the test error, we need: τ1 = lim n→∞ 1 n tr(K + λIn)−1 and τ2 = lim n→∞ 1 n tr( 1 d X⊤ X(K + λIn)−1 ) . Recalling the definition of K, K = 1 m F⊤ F , F = f( 1 √ d WX) , it is evident that the entries of F are nonlinearly dependent. • Cannot simply utilize standard results for Wishart matrices • Stieltjes transform is insufficient for τ2 These technical challenges can be overcome with two tricks: 1. Constructing an equivalent Gaussian linearized model 2. Analyzing a suitably augmented resolvent 13
  • 30. Computing the asymptotic test error: Gaussian equivalents The nonlinear dependencies in F = f( 1 √ d WX) complicate the analysis. Can we identify a simpler matrix in the same universality class? 14
  • 31. Computing the asymptotic test error: Gaussian equivalents The nonlinear dependencies in F = f( 1 √ d WX) complicate the analysis. Can we identify a simpler matrix in the same universality class? There exist constants c1 and c2 such that F ∼ = Flin ≡ c1 1 √ d WX + c2Θ , Θij ∼ N(0, 1) , where F ∼ = Flin indicates the two matrices share all statistics relevant for computing the test error: τ1 = lim n→∞ 1 n tr( 1 m F⊤ F + λIn)−1 = lim n→∞ 1 n tr( 1 m F⊤ linFlin + λIn)−1 τ2 = lim n→∞ 1 n tr( 1 d X⊤ X( 1 m F⊤ F + λIn)−1 ) = lim n→∞ 1 n tr( 1 d X⊤ X( 1 m F⊤ linFlin + λIn)−1 ) 14
  • 32. Computing the asymptotic test error: Gaussian equivalents The nonlinear dependencies in F = f( 1 √ d WX) complicate the analysis. Can we identify a simpler matrix in the same universality class? There exist constants c1 and c2 such that F ∼ = Flin ≡ c1 1 √ d WX + c2Θ , Θij ∼ N(0, 1) , where F ∼ = Flin indicates the two matrices share all statistics relevant for computing the test error: τ1 = lim n→∞ 1 n tr( 1 m F⊤ F + λIn)−1 = lim n→∞ 1 n tr( 1 m F⊤ linFlin + λIn)−1 τ2 = lim n→∞ 1 n tr( 1 d X⊤ X( 1 m F⊤ F + λIn)−1 ) = lim n→∞ 1 n tr( 1 d X⊤ X( 1 m F⊤ linFlin + λIn)−1 ) How can we compute these traces? Need to augment the resolvent. 14
  • 33. Computing the asymptotic test error: resolvent method Recall from Part 2 that the resolvent method identifies consistency relations between suitably chosen submatrices of the resolvent. Here we can undertake a similar analysis as in Part 2, but now on an augmented matrix, H = [ λIn 1 √ m F⊤ lin 1 √ m Flin −Im ] , which encodes the resolvent through Q = (K + λIn)−1 = [H−1 ]1:n,1:n. 15
  • 34. Computing the asymptotic test error: resolvent method Recall from Part 2 that the resolvent method identifies consistency relations between suitably chosen submatrices of the resolvent. Here we can undertake a similar analysis as in Part 2, but now on an augmented matrix, H = [ λIn 1 √ m F⊤ lin 1 √ m Flin −Im ] , which encodes the resolvent through Q = (K + λIn)−1 = [H−1 ]1:n,1:n. To derive consistency relations, we consider two submatrices: H(1) (leaving out row/column 1), and H(n+1) (leaving out row/column n + 1). As before, we use the Sherman-Morrison formula to compute [H(1) ]−1 and [H(n+1) ]−1 , and relate them to Q and rows/columns of Flin. Straightforward concentration arguments eventually lead to coupled self-consistent equations for τ1 and τ2 [Adlam et al., 2019]. 15
  • 35. Computing the asymptotic test error: free probability An alternative augmentation of the resolvent completely linearizes the dependence on the random matrices: M =      λIn c2 m Θ⊤ c1 √ dm X⊤ 0 c2Θ −Im 0 c1 √ d W 0 W⊤ −Id 0 X 0 0 −Id      , where the Schur complement formula now gives, τ1 = lim n→∞ 1 n tr([M−1 ]1,1) , and τ2 = lim n→∞ 1 n tr([M−1 ]4,3) . The asymptotic blockwise traces tr([M−1 ]a,b) can themselves be computed using free probability [Adlam and Pennington, 2020]. 16
  • 36. Computing the asymptotic test error: free probability M is linear in the random matrices X, W, and Θ: M =      λIn 0 0 0 0 −Im 0 0 0 0 −Id 0 0 0 0 −Id      +      0 0 c1 √ dm X⊤ 0 0 0 0 0 0 0 0 0 X 0 0 0      +      0 0 0 0 0 0 0 c1 √ d W 0 W⊤ 0 0 0 0 0 0      +      0 c2 m Θ⊤ 0 0 c2Θ 0 0 0 0 0 0 0 0 0 0 0      Can we compute the blockwise traces with free probability via the R-transform? 17
  • 37. Computing the asymptotic test error: free probability M is linear in the random matrices X, W, and Θ: M =      λIn 0 0 0 0 −Im 0 0 0 0 −Id 0 0 0 0 −Id      +      0 0 c1 √ dm X⊤ 0 0 0 0 0 0 0 0 0 X 0 0 0      +      0 0 0 0 0 0 0 c1 √ d W 0 W⊤ 0 0 0 0 0 0      +      0 c2 m Θ⊤ 0 0 c2Θ 0 0 0 0 0 0 0 0 0 0 0      Can we compute the blockwise traces with free probability via the R-transform? Not naively: the additive terms are independent, but not free over C. However, they are free over M4(C), and there exists a suitable operator-valued generalization of the R-transform that enables the necessary computations [Mingo and Speicher, 2017]. 17
  • 38. Asymptotic test error Theorem Let η = E[f(g)2 ] and ζ = (E[gf(g)])2 for g ∼ N(0, 1). Then, the asymptotic traces τ1(λ) and τ2(λ) are given by solutions to the polynomial system, ζτ1τ2 (1 − λτ1) = ϕ/ψ (ζτ1τ2 + ϕ(τ2 − τ1)) = (τ1 − τ2) ϕ ((η − ζ)τ1 + ζτ2) , and, Etrain = −λ2 (σ2 ετ′ 1 + τ′ 2) and Etest = −(σ2 ετ′ 1 + τ′ 2)/τ2 1 . MSE 0.01 0.10 1 10 100 0.0 0.5 1.0 1.5 2.0 Random Feature Regression, d = n m/n MSE 0.01 0.10 1 10 100 0.0 0.5 1.0 1.5 2.0 Random Feature Regression, d = n/2 m/n log10(λ) -5 -4 -3 -2 -1 18
  • 39. References i References Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning. PMLR, 2020. URL https://arxiv.org/pdf/2008.06786.pdf. Ben Adlam, Jake Levinson, and Jeffrey Pennington. A random matrix perspective on mixtures of nonlinearities for deep learning. arXiv preprint arXiv:1912.00827, 2019. Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020. Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • 40. References ii Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020. Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016. Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019. URL https://arxiv.org/pdf/1903.08560.pdf. Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, and Franck Gabriel. Kernel alignment risk estimator: risk prediction from training data. arXiv preprint arXiv:2006.09796, 2020. Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48 (3):1329–1347, 2020. Zhenyu Liao, Romain Couillet, and Michael W Mahoney. A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. arXiv preprint arXiv:2006.05013, 2020. URL https://arxiv.org/pdf/2006.05013.pdf. Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019. URL https://arxiv.org/pdf/1908.05355.pdf. James A Mingo and Roland Speicher. Free probability and random matrices, volume 35. Springer, 2017. Partha P Mitra. Understanding overfitting peaks in generalization error: Analytical risk curves for l_2 and l_1 penalized interpolation. arXiv preprint arXiv:1906.03667, 2019.
  • 41. References iii Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292, 2019. Alexander Rakhlin and Xiyu Zhai. Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pages 2595–2623. PMLR, 2019. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.