1. Convolutional Neural Networks
on Graphs
Xavier Bresson
Xavier Bresson 2
School of Information Systems
Singapore Management University (SMU)
Oct 19th 2017
School of Computer Science and Engineering
Nanyang Technological University
Joint work with M. Defferrard (EPFL), P. Vandergheynst (EPFL),
M. Bronstein (USI), F. Monti (USI), R. Levie (TAU)
2. Outline
Ø Deep Learning
- A brief history
- High-dimensional learning
Ø Convolutional Neural Networks
- Architecture
- Non-Euclidean Data
Ø Spectral Graph Theory
- Graphs and operators
- Spectral decomposition
- Fourier analysis
- Convolution
- Coarsening
Ø Spectral ConvNets
- SplineNets
- ChebNets* [NIPS’16]
- GraphConvNets
- CayleyNets*
Ø Spectral ConvNets on Multiple Graphs
- Recommender Systems* [NIPS’17]
Ø Conclusion
Xavier Bresson 3
4. A brief history of artificial neural networks
Xavier Bresson 5
1958
Perceptron
Rosenblatt
1982
Backprop
Werbos
1959
Visual primary cortex
Hubel-Wiesel
1987
Neocognitron
Fukushima
SVM/Kernel techniques
Vapnik
1995
1997 1998
1999 2006 2012 2014 2015
Data scientist
1st Job in US
Facebook Center
OpenAI Center
Deep Learning
Breakthough
Hinton
GAN, Goodfellow
ResNet, He
Deep RL, Mnih
First NVIDIA
GPU
First Amazon
Cloud Center
RNN/LSTM
Schmidhuber
CNN
LeCun
AI Winter [1960’s-2012]AI Birth AI Renaissance
Hardware
GPU speed 2x/year
Kernel techniques
Graphical models
Wavelets, numerical PDEs
Handcrafted Features
4th industrial revolution?
Digital intelligence
revolution
or new AI bubble?
Big Data
Volume 2x/1.5 year
Google AI
TensorFlow
Facebook AI
Torch
1962
Birth of
Data Science
Split from Statistics
Tukey
First
NIPS
1989
First
KDD
2010
Kaggle
Platform
5. Breakthrough in image recognition
Xavier Bresson 6
4.1%Human accuracy
Learned features (end-to-end systems)Handcraft features (SIFT)
6. Convolutional neural networks: LeNet-5[1]
2 convolutional + 2 fully connected layers
tanh non-linearity
1M parameters
Training set: MNIST 70K images
Trained on CPU
Xavier Bresson 7
[1] LeCun et al. 1998
1998
7. Convolutional neural networks: AlexNet[2]
Xavier Bresson 8
5 convolutional + 3 fully connected layers
ReLU non-linearity
Dropout regularization
60M parameters
Trained set: ImageNet 1.5M images
Trained on GPU
[2] Krizhevsky, Sutskever, Hinton 2012
2012
More learning capacity
More data
More computational power
9. High-dimensional learning
Xavier Bresson 10
NNs are powerful to solve high-dimensional
learning problems.
Prototypal learning: Evaluate function f(x)
given n samples {xi,f(xi)}. How? Suppose f
is regular, use interpolation of close points.
f(x)
f(xi)
xi
1-dimensional
learning
10. Curse of dimensionality
Xavier Bresson 11
What about in high dimensions? No interpolation possible because
we can never have enough close points to make a good interpolation.
1
· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·
· · · · · · · · · ·· · · · · · · · · ·
0
1
N samples/dim
" = N 1
How many points to cover the 2D plane? ε-2
d-dim cube [0,1]d requires ε-d points.
dim(image) = 512 x 512 ≈ 106
For N=10 samples/dim 101,000,000 points!
Curse of dim:
The number of x of f is exponential.
All samples are far from each other in high-dim[3].
Optimization: Also need ε-d points to find solution.
Nsamples/dim
[3] Beyer 1998
min
f
L(f) = E[`(f(xi), yi)]
11. Blessing of structure
Xavier Bresson 12
In our universe:
Data have structures, they concentrate in (non-)smooth low-dim spaces.
These structures can be expressed as mathematical properties
(statistics, geometry, groups, etc).
Low-dim space
High-dim space
Data
12. Learning exponential functions
Neural networks can learn parametric exponential functions with linear
number of parameters.
Xavier Bresson 13
A very large number of parameters can be learned by backpropagation,
order of billions.
Huge capacity to learn complex data[4].
n_in = 2 n_out = 3
|W|= 2.3
Parameters
#configurations=
2n_out =23=8
Number of parameters grow linearly,
but function capacity grows exponentially.
[4] Zhang, Bengio, Hardt, Recht, Vinyals, 2016
13. Universal function approximation[4]
Xavier Bresson 14
[5] Cybenko 1989, Hornik 1991
Universal Approximation Theorem Let ⇠ be a non-constant, bounded,
and monotonically-increasing continuous activation function, y : [0, 1]p
! R
continuous function, and ✏ > 0. Then, 9n and parameters a, b 2 Rn
, W 2 Rn⇥p
s.t. nX
i=1
ai⇠(w>
i f + bi) y(f) < ✏ 8f 2 [0, 1]p
14. Shallow neural nets as universal approximators?
Any continuous function can be approximated arbitrarily well by
a neural network with a single hidden layer. Then,
How many neurons?
How long is the optimization?
Does it generalize well/overfit?
Xavier Bresson 15
Shallow NNs: Data (s.a. image) requires NNs with an exponential
number of neurons for a single layer, optimizes badly, and does
not generalize.
Deep NNs: Multiple layers and data structure can alleviate the
above issues[6].
[6] Montufar, Pascanu, Cho, Bengio 2014
15. Fully connected networks
Xavier Bresson 16
Deep neural network consists of multiple layers.
fl = l-th image feature (R,G,B channels), dim(fl) = n ⇥ 1
g
(k)
l = l-th feature map, dim(g
(k)
l ) = n
(k)
l ⇥ 1
Linear layer g
(k)
l = ⇠
qk 1
X
l0=1
W
(k)
l,l0 ⇠
qk 2
X
l0=1
W
(k 1)
l,l0 ⇠ · · · fl0
!!!
Activation, e.g. ⇠(x) = max{x, 0} rectified linear unit (ReLU)
16. Fully connected networks
Xavier Bresson 17
FC networks capture non-linear factorizations of data.
They overfit: Too many parameters to learn.
NN architectures require additional data structures to successfully
analyze image, sound, document, etc.
18. Convolutional neural networks[1]
Data (image, video, sound) are compositional, i.e. they are formed
of patterns that are:
Ø Local
Ø Stationary
Ø Multi-scale (/hierarchical)
Xavier Bresson 19
ConvNets leverage the compositionality structure: They extract
compositional features and feed them to classifier, recommender,
etc (end-to-end).
Computer Vision NLP Drug discovery Games
[1] LeCun et al. 1998
19. Key properties
Locality: Property inspired by visual cortex neurons[7].
Xavier Bresson 20
Local receptive fields activate in the presence of (learned) local features.
Neocognitron[8]
[7] Cybenko 1989, Hornik 1991, [8] Fukushima 1980
21. Key properties
Multi-scale: Simple structures combine to compose slightly more
abstract structures, and so on, in a hierarchical way.
Xavier Bresson 22
Also inspired by brain visual primary cortex (V1 and V2 neurons).
Features learned by a ConvNet become increasingly complex at deeper layers.[9]
[9] Zeiler, Fergus 2013
22. Implementation
Locality: compact support kernels
O(1) parameters per filter.
Xavier Bresson 23
Stationarity: convolutional operators
O(nlogn) in general (FFT) and
O(n) for compact kernels.
Multi-scale: downsampling +
pooling O(n)
23. Compositional features
Xavier Bresson 24
fl = l-th image feature (R,G,B channels), dim(fl) = n ⇥ 1
g
(k)
l = l-th feature map, dim(g
(k)
l ) = n
(k)
l ⇥ 1
Compositional features consist of multiple convolutional + pooling layers.
Linear layer g
(k)
l = ⇠
qk 1
X
l0=1
W
(k)
l,l0 ? ⇠
qk 2
X
l0=1
W
(k 1)
l,l0 ? ⇠ · · · fl0
!!!
Activation, e.g. ⇠(x) = max{x, 0} rectified linear unit (ReLU)
Pooling g
(k)
l (x) = kg
(k 1)
l (x0
) : x0
2 N(x)kp p = 1, 2, or 1
24. ConvNets[1]
Xavier Bresson 25
Filters localized in space (Locality)
Convolutional filters (Stationarity)
Multiple layers (Multi-scale)
O(1) parameters per filter (independent of input image size n)
O(n) complexity per layer (filtering done in the spatial domain)
[1] LeCun et al. 1998
25. ConvNets in computer vision
Xavier Bresson 26
Image/object recognition
[Krizhevsky-Sutskever-Hinton’12]
Image inpainting
[Pathak-Efros-etal’16]
Image captioning
[Karpathy-Li’15]
Self-driving cars
Highly successful - ConvNets is the Swiss knife of Computer Vision!
27. Data domain for ConvNets
Xavier Bresson 28
Image, volume, video: 2D,
3D, 2D+1 Euclidean
domains
Sentence, word, character,
sound: 1D Euclidean domain
These domains have nice regular spatial structures.
All CNN operations are math well defined and fast (convolution, pooling).
2D grids
1D grid
28. Non-Euclidean data
Also NLP, physics, social science, communication networks, etc.
Xavier Bresson 29
Graphs/
Networks
=
29. Challenges of graph deep learning
Extend neural network techniques to graph-structured data.
Assumption: Non-Euclidean data are locally stationary and manifest
hierarchical structures.
How to define compositionality on graphs? (convolution and pooling
on graphs)
How to make them fast? (linear complexity)
Xavier Bresson 30
31. Graphs[10]
Xavier Bresson 32
Graph
Vertices
Edges
Vertex weights
Edge weights
Vertex fields
Represented as
Hilbert space with inner product
G = (V, E)
V = {1, . . . , n}
E ✓ V ⇥ V
f = (f1, . . . , fn)
hf, giL2(V) =
P
i2V aifigi
L2
(V) = {f : V ! Rh
}
[10] Chung 1994
bi > 0 for i 2 V
aij 0 for (i, j) 2 E
32. Calculus on graphs
Xavier Bresson 33
Gradient operator
Divergence operator
adjoint to the gradient operator
r : L2
(V) ! L2
(E)
div : L2
(E) ! L2
(V)
hF, rfiL2(E) = hr⇤
F, fiL2(V) = h divF, fiL2(V)
(rf)ij =
p
aij(fi fj)
(divF)i =
1
bi
X
j:(i,j)2E
p
aij(Fij Fji)
33. Graph Laplacian
Xavier Bresson 34
Laplacian operator
difference between f and its local average (2nd
derivative on graphs)
Core operator in spectral graph theory.
Represented as a positive semi-definite n × n matrix
Unnormalized Laplacian
Normalized Laplacian
Random walk Laplacian
: L2
(V) ! L2
(V)
( f)i =
1
bi
X
j:(i,j)2E
aij(fi fj)
where A = (aij) and D = diag(
P
j6=i aij)
= D A
= I D 1
A
= I D 1/2
AD 1/2
35. Spectral decomposition
Xavier Bresson 36
A Laplacian of a graph of n vertices admits n eigenvectors
Eigenvectors are real and orthonormal
(due to self-adjointness)
Eigenvalues are non-negative
(due to positive-semidefiniteness)
Eigendecomposition of a graph Laplacian
k = k k, k = 1, 2, . . .
h k, k0 iL2(V) = kk0
0 = 1 2 . . . n
where = ( 1, . . . , n) and ⇤ = diag( 1, . . . , n)
= T
⇤
36. Interpretation
Xavier Bresson 37
Find the smoothest orthonormal basis on a graph
where EDir is the Dirichlet energy = measure of smoothness of a function
Solution: first n Laplacian eigenvectors
= ( 1, . . . , n)
min
k
EDir( k) s.t. k kk = 1, k = 2, 3, . . . n
k ? span{ 1, . . . , k 1}
EDir(f) = f>
f
min
2Rn⇥n
trace( >
)
| {z }
k kG Dirichlet norm
s.t. >
= I
39. Fourier analysis on Euclidean spaces
Xavier Bresson 40
A function can be written as Fourier series
Fourier basis = Laplace-Beltrami eigenfunctions:
f : [ ⇡, ⇡] ! R
f(x) =
X
k 0
1
2⇡
Z ⇡
⇡
f(x0
)e ikx0
dx0
| {z }
ˆfk=hf,e ikxiL2([ ⇡,⇡])
e ikx
k = Fourier mode
k = frequency of Fourier mode
e ikx
k = k2
k
f
40. Fourier analysis on graphs[11]
Xavier Bresson 41
[11] Hammond, Vandergheynst, Gribonval, 2011
A function can be written as Fourier series
is the k-th graph Fourier coefficient.
In matrix-vector notation, with the n x n Fourier matrix
Graph Fourier basis = Laplacian eigenvectors :
ˆfk
fi =
nX
k=1
hf, kiL2(V)
| {z }
ˆfk
k,i
=
⇥
1, ..., n
⇤
ˆf = >
f and f = ˆf
k
k = k k
k = graph Fourier mode
k = (square) frequency
f : V ! R
Fourier
transform
Inverse Fourier
transform
42. Convolution on Euclidean spaces
Xavier Bresson 43
Given two functions their convolution is a function
Shift-invariance:
Convolution operator commutes with Laplacian:
Convolution theorem: Convolution can be computed in the
Fourier domain as
Efficient computation using FFT: O(nlogn)
f, g : [ ⇡, ⇡] ! R
(f ? g)(x) =
Z ⇡
⇡
f(x0
)g(x x0
)dx0
f(x x0) ? g(x) = (f ? g)(x x0)
( f) ? g = (f ? g)
(f ? g) = ˆf · ˆg
43. Convolution on discrete Euclidean spaces
Xavier Bresson 44
Convolution of two vectors f = (f1, . . . , fn)>
and g = (g1, . . . , gn)>
(f ? g)i =
X
m
g(i m) mod n · fm
f ? g =
2
6
6
6
6
6
4
g1 g2 . . . . . . gn
gn g1 g2 . . . gn 1
...
...
...
...
...
g3 g4 . . . g1 g2
g2 g3 . . . . . . g1
3
7
7
7
7
7
5
| {z }
Circulant matrix
diagonalised by Fourier basis (Toeplitz)
2
6
4
f1
...
fn
3
7
5
=
2
6
4
ˆg1
...
ˆgn
3
7
5
>
f = >
g >
f
44. f ? g = ( >
g >
f)
= diag(ˆg1, . . . , ˆgn) >
| {z }
G
f
= ˆg(⇤) >
f = ˆg( ⇤ >
)f = ˆg( )f
Convolution on graphs[11]
Xavier Bresson 45
Spectral convolution of can be defined by analogy
In matrix-vector notation
Not shift-invariant! (G has no circulant structure)
Commutes with Laplacian:
Filter coefficients depend on basis
Expensive computation (no FFT): O(n2)
(f ? g)i =
X
k 1
hf, kiL2(V)hg, kiL2(V)
| {z }
product in the Fourier domain
k,i
| {z }
inverse Fourier transform
f, g 2 L2
(V)
G f = Gf
1, . . . , n
[11] Hammond, Vandergheynst, Gribonval, 2011
45. (f) Ti00 f(e) Ti0 f(d) Tif
No shift invariance on graphs
Xavier Bresson 46
A signal f on graph can be translated to vertex i:
Euclidean domain
Graph domain[12]
[12] Shuman et al. 2016
Tif = f ? i
(a) Tsf (b) Ts0 f (c) Ts00 f
47. Graph coarsening for graph pooling
Goals:
Pool similar local features (max pooling).
Series of pooling layers create invariance to global geometric
deformations.
Challenges:
Design a multi-scale coarsening algorithm that preserves non-
linear graph structures.
How to make graph pooling fast?
Xavier Bresson 48
48. Graph coarsening = graph partitioning
Xavier Bresson 49
Graph partitioning decomposes G into smaller meaningful clusters.
NP-hard Approximation
49. Powerful combinatorial graph partitioning models:
Normalized cut[13]
Normalized association
Both models are equivalent, but lead to different algorithms.
Balanced cuts
Xavier Bresson 50
Partitioning by min edge cuts.
Ck Cc
k
Partitioning by max vertex matching.
Ck
[13] Shi, Malik, 2000
max
C1,...,CK
KX
k=1
Assoc(Ck)
Vol(Ck)
min
C1,...,CK
KX
k=1
Cut(Ck, Cc
k)
Vol(Ck)
where Cut(A, B) :=
P
i2A,j2B aij, Assoc(A) :=
P
i2A,i2B aij,
Vol(A) :=
P
i2A,j2B di, and di :=
P
j2V aij.
50. Balanced cuts
Xavier Bresson 51
Balanced cuts are NP-hard most popular approximation
techniques focus on linear spectral relaxation (eigenproblem
with global solution).
Graph geometry are generally not linear Graclus[14] algorithm
computes non-linear clusters that locally maximize the
Normalized Association.
Graclus algorithm offers a control of the coarsening ratio of ≈ 2,
like image grid using heavy-edge matching[15].
[14] Dhillon, Guan, Kulis 2007
[15] Karypis, Kumar 1995
51. Heavy-Edge Matching (HEM)
Xavier Bresson 52
HEM proceeds by two successive steps, vertex matching and
graph coarsening (that guarantees a local solution of Norm
assoc):
Matched vertices { i, j} are
merged into a super-vertex
at the next coarsening level.
… …
Gl 1
…
…
…
Gl
Gl+1
Gl+2
Graph coarsening/
clustering
…
…
i
j
(1) Vertex matching:
n
i, j = argmax
j
al
ii + 2al
ij + al
jj
dl
i + dl
j
o
(2): Gl+1
=
⇢
al+1
ij = Cut(Cl
i, Cl
j)
al+1
ii = Assoc(Cl
i)
52. Unstructured pooling
Sequence of coarsened graphs produced by HEM:
Xavier Bresson 53
Stores a table of indices for graph and all its coarsened versions
Computationally inefficient
53. Fast graph pooling[18]
Structured pooling: Arrangement of the node indexing such that
adjacent nodes are hierarchically merged at the next coarser level.
Xavier Bresson 54
As efficient as 1D-Euclidean grid pooling.
[18] Defferrard, Bresson, Vandergheynst 2016
55. Vanilla spectral graph ConvNets[16]
Graph convolutional layer
Xavier Bresson 56
[16] Bruna, Zaremba, Szlam, LeCun 2014; Henaff, Bruna, LeCun 2015
fl = l-th data feature on graphs, dim(fl) = n ⇥ 1
gl = l-th feature map, dim(gl) = n ⇥ 1
Conv. layer gl = ⇠
pX
l0=1
Wl,l0 ? fl0
!
Activation, e.g. ⇠(x) = max{x, 0} rectified linear unit (ReLU)
56. Spectral graph convolution
Xavier Bresson 57
Convolutional layer in the spatial domain:
where = matrix of graph spatial filter,
can also be expressed in the spectral domain
where = n x n diagonal matrix of graph spectral filter.
We will denote the spectral filter without the hat symbol, i.e.
gl = ⇠
pX
l0=1
Wl,l0 ? fl0
!
,
Wl,l0
(using g ? f = ˆg(⇤) >
f)
ˆWl,l0
Wl,l0
gl = ⇠
pX
l0=1
ˆWl,l0
>
fl0
!
,
57. Vanilla spectral graph ConvNets[16]
Xavier Bresson 58
Series of spectral convolutional layers
with spectral coefficients to be learned at each layer.
g
(k)
l = ⇠
0
@
q(k 1)
X
l0=1
W
(k)
l,l0
>
g
(k 1)
l0
1
A ,
W
(k)
l,l0
First Graph CNN architecture
No guarantee of spatial localization of filters
O(n) parameters per layer
O(n2) computation of forward and inverse Fourier transforms
φ, φT (no FFT on graphs)
Filters are basis-dependent does not generalize across graphs
[16] Bruna, Zaremba, Szlam, LeCun 2014; Henaff, Bruna, LeCun 2015
58. Spatial localization and spectral smoothness
Xavier Bresson 59
In the Euclidean setting (by Parseval's identity)
Localization in space = smoothness in frequency domain
Smooth spectral filter function
,
Spatial localization
Z +1
1
|x|2k
|w(x)|2
dx =
Z +1
1
@k
ˆw( )
@ k
2
d
w(x) ˆw( )
59. Smooth parametric spectral filter
Xavier Bresson 60
[17] (Litman, Bronstein, 2014); Henaff, Bruna, LeCun 2015
Parametrize the smooth spectral filter function with a linear
combination of smooth kernel functions , e.g. splines[17]
where is the vector of filter parameters
1( ), . . . , r( )
↵ = (↵1, . . . , ↵r)>)
) W = Diag(B↵)
w↵( ) =
rX
j=1
↵j j( )
w↵( i) =
rX
j=1
↵j j( i) = (B↵)i
w( )
w( )
60. Smooth parametric spectral filter
Xavier Bresson 61
Graph convolutional layer: Apply the spectral filter to a feature signal f
Application of the spectral filter to a feature signal with
learnable parameters
f
↵
w( )f = w(⇤) >
f
=
0
B
@
w( 1)
...
w( n)
1
C
A
>
f
w↵( )f =
0
B
@
w↵( 1)
...
w↵( n)
1
C
A
>
f
w
61. SplineNets[17]
Xavier Bresson 62
Series of spectral convolutional layers
with smooth spectral parametric coefficients to be learned at each layer.
g
(k)
l = ⇠
0
@
q(k 1)
X
l0=1
W
(k)
l,l0
>
g
(k 1)
l0
1
A ,
W
(k)
l,l0
Fast-decaying filters in space
O(n) parameters per layer
O(n2) computation of forward and inverse Fourier transforms
φ, φT (no FFT on graphs)
Filters are basis-dependent does not generalize across graphs
[17] Henaff, Bruna, LeCun 2015
63. Spectral polynomial filters[18]
Xavier Bresson 64
Represent smooth spectral functions with polynomials of Laplacian eigenvalues
where is the vector of filter parameters.
Convolutional layer: Apply spectral filter to feature signal f
w↵( ) =
rX
j=0
↵j
j
↵ = (↵1, . . . , ↵r)>
w↵( )f =
Pr
j=0 ↵j
j
f
Key observation: Each Laplacian
operation increases the support of a
function by 1-hop Exact control
the size of Laplacian-based filters.
s=heat
source
1-hop
2-hop· · ·| {z }
j
[18] Defferrard, Bresson, Vandergheynst 2016
64. Linear complexity
Xavier Bresson 65
Application of the filter to a feature signal f
Denote and define and the sequence
Two important observations:
1. *No* need to compute the eigendecomposition of the Laplacian (φ,Λ).
2. Observe that {Xj} are generated by multiplication of a sparse
matrix and a vector Complexity is O(Er)=O(n) for sparse graphs.
Graph convolutional layers are GPU friendly.
w↵( )f =
Pr
j=0 ↵j
j
f
Xj = Xj 1X0 = f
w↵( )f =
Pr
j=0 ↵jXj
X1 = X0 = f
65. Spectral graph ConvNets with polynomial filters
Xavier Bresson 66
Series of spectral convolutional layers
with spectral polynomial coefficients to be learned at each layer.
g
(k)
l = ⇠
0
@
q(k 1)
X
l0=1
W
(k)
l,l0
>
g
(k 1)
l0
1
A ,
W
(k)
l,l0
Filters are exactly localized in r-hops support
O(1) parameters per layer
No computation of φ, φT O(n) computational complexity
(assuming sparsely-connected graphs)
Unstable under coefficients perturbation (hard to optimize)
Filters are basis-dependent does not generalize across graphs
[18] Defferrard, Bresson, Vandergheynst 2016
66. Chebyshev polynomials
Xavier Bresson 67
Graph convolution with (non-orthogonal) monomial basis
Graph convolution with (orthogonal) Chebyshev polynomials
w↵( )f =
Pr
j=0 ↵j
j
f
1, x, x2
, x3
, · · ·
Orthonormal on
Stable under perturbation of
coefficients
L2
([ 1, +1]) w.r.t. hf, gi =
R +1
1
f(˜)g(˜) d˜
p
1 ˜2
T0, T1, T2, T3
w↵( ˜ )f =
Pr
j=0 ↵jTj( ˜ )f
67. ChebNets[18]
Xavier Bresson 68
[18] Defferrard, Bresson, Vandergheynst 2016
Application of the filter with the scaled Laplacian
with
˜ = 2 1
n I
X(j)
= Tj( ˜ )f
= 2 ˜ X(j 1)
X(j 2)
, X(0)
= f, X(1)
= ˜ f
Filters are exactly localized in r-hops support
O(1) parameters per layer
No computation of φ, φT O(n) computational complexity
(assuming sparsely-connected graphs)
Stable under coefficients perturbation
Filters are basis-dependent does not generalize across graphs
w↵( ˜ )f =
Pr
j=0 ↵jTj( ˜ )f =
Pr
j=0 ↵jX(j)
70. Graph convolutional nets[19]: simplified ChebNets
Xavier Bresson 71
[19] Kipf, Welling 2016
Use Chebychev polynomials of degree r=2 and assume
Further constrain to obtain a single-parameter filter
Caveat: The eigenvalues of are now in [0,2]
repeated application of the filter results in numerical instability
Fix: Apply a renormalization
with
n ⇡ 2
↵ = ↵0 = ↵1
w↵( )f = ↵0f + ↵1( I)f
= ↵0f ↵1D 1/2
WD 1/2
f
w↵( )f = ↵(I + D 1/2
WD 1/2
)f
I + D 1/2
WD 1/2
w↵( )f = ↵ ˜D 1/2 ˜W ˜D 1/2
f
˜W = W + I and ˜D = diag(
P
j6=i ˜wij)
73. Community graphs
Xavier Bresson 74
Synthetic graph with 15 communities Normalized Laplacian eigenvalues
Chebyshev filters learned on the 15-communities graph
ChebNet is unstable to produce filters with frequency bands of interest
74. Spectral graph ConvNets with Cayley polynomials[20]
Xavier Bresson 75
Cayley polynomials are a family of real-valued rational functions
(w/ complex coefficients)
Cayley transform
Expressivity:
we can rewrite as a real trigonometric polynomial w.r.t.
p( ) = c0 + 2Re
n rX
j=1
cj
⇣ i
+ i
| {z }
C( )
⌘jo
p( ) C( )
p( ) = c0 +
rX
j=1
cj Cj
( )
| {z }
eij!
+ ¯cj C j
( )
| {z }
e ij!
= 2
rX
j=1
Re{cj} cos(j! ) Im{cj} sin(j! )
[20] (Cayley 1846); Levie, Monti, Bresson, Bronstein 2017
C( ) = i
+i is a smooth bijection from R to eiR
{1}
Since z 1
= ¯z for z 2 eiR
and 2Re{z} = z + ¯z
75. Spectral zoom[20]
Xavier Bresson 76
Applying Cayley transform to the scaled Laplacian
results in a non-linear transformation of the eigenvalues
(spectral zoom)
C(h ) = (h iI)(h + iI) 1
h
[20] Levie, Monti, Bresson, Bronstein 2017
Cayley transform of the 15-communities graph Laplacian spectrumC(h )
Laplacian spectrum
h = 0.1 h = 1 h = 10
76. Fast inversion
Xavier Bresson 77
Application of Cayley filter requires the solution of
recursive linear system
Exact (stable) solution: O(n3) complexity
Approximate solution using K Jacobi iterations
with
Application of the approximate Cayley filter
w(h )f
y0 = f
(h + iI)yj = (h iI)yj 1 j = 1, . . . , r
˜y
(k+1)
j = J˜y
(k)
j + Diag 1
(h + iI)(h iI)˜yj 1
˜y
(0)
j = Diag 1
(h + iI)(h iI)˜yj 1
J = Diag 1
(h + iI)O↵(h + iI)
rX
j=0
cj ˜yj ⇡ ⌧(h )f
Complexity: O(ErK) = O(n)
77. CayleyNets[30]
Xavier Bresson 78
[20] Levie, Monti, Bresson, Bronstein 2017
Represent spectral transfer function as a Cayley polynomial of order r
where the filter parameters are the vector of complex coefficients
and the spectral zoom h.
wc,h( ) = c0 + 2Re
n rX
j=1
cj(h i)j
(h + i) j
o
c = (c0, . . . , cr)>
Filters have guaranteed exponential spatial decay
O(1) parameters per layer
O(n) computational complexity with Jacobi approximate
inversion (assuming sparsely-connected graph)
Spectral zoom property allowing to better localize in frequency
Richer class of filters than Chebyshev for the same order
Filters are basis-dependent does not generalize across graphs
86. Factorized matrix completion models[23]
Xavier Bresson 87
min
W2Rm⇥s
H2Rn⇥s
µk⌦ (X A)k2
F + µckWk2
F + µrkHk2
F
O(n+m) variables instead of O(nm)
rank(X) = rank(WHT) ≤ s by construction
[23] Srebro, Rennie, Jaakkola 2004
87. Factorized matrix completion on graphs[24]
Xavier Bresson 88
O(n+m) variables instead of O(nm)
rank(X) = rank(WHT) ≤ s by construction
[24] Rao et al. 2015
min
W2Rm⇥s
H2Rn⇥s
µk⌦ (X A)k2
F + µctr(H>
cH) + µrtr(W>
rW)
89. 2D Fourier transform
Xavier Bresson 90
Column-wise transform + Row-wise transform = 2D transform
From 1D signal processing to 2D image processing
90. Multi-graph Fourier transform and convolution[25]
Xavier Bresson 91
Multi-graph Fourier transform
where are the eigenvectors of the column- and row-
graph Laplacians , respectively.
Multi-graph spectral convolution
ˆX = >
r X c
c, r
c, r
X ? Y = r(( >
r X c) ( >
r Y c)) >
c
= r( ˆX ˆY) >
c
[25] Monti, Bresson, Bronstein 2017
91. Parametrize the multi-graph filter as a smooth function of column-
and row frequencies w/ e.g. bivariate Chebyshev polynomial
where is matrix of coefficients and
Multi-graph convolutional layer: Apply filter to matrix X
where the scaled Laplacians
Multi-graph spectral filters
Xavier Bresson 92
w⇥(˜c, ˜r) =
rX
j,j0=1
✓jj0 Tj(˜c)Tj0 (˜r)
⇥ = (✓jj0 ) 1 ˜c, ˜r 1
Y =
rX
j,j0=1
✓jj0 Tj( ˜ c)XTj0 ( ˜ r)
˜ c = 2 1
c,n c I and ˜ r = 2 1
r,m r I
92. Multi-graph ConvNets[25]
Xavier Bresson 93
[25] Monti, Bresson, Bronstein 2017
Multi-graph spectral filters
applied to p input channels (m x n matrices ) and
producing q output channels
Y
(k)
l = ⇠
0
@
pX
l0=1
rX
j,j0=0
✓jj0ll0 Tj( ˜ r)Y
(k 1)
l0 Tj0 ( ˜ c)
1
A l = 1, . . . , q
l0
= 1, . . . , p
Y
(k 1)
1 , . . . , Y
(k 1)
p
Y
(k)
1 , . . . , Y
(k)
q
Filters have guaranteed r-hops support on both graphs
O(1) parameters per layer
O(nm) computational complexity (assuming sparse graphs)
93. Factorized multi-graph ConvNets
Xavier Bresson 94
Two spectral convolutional layers applied to each of the factors W, H
ul = ⇠
0
@
pX
l0=1
rX
j=0
✓r,jll0 Tj( ˜ r)wl0
1
A l = 1, . . . , q
l0
= 1, . . . , p
vl = ⇠
0
@
pX
l0=1
rX
j0=0
✓c,j0ll0 Tj0 ( ˜ c)hl0
1
A
Filters have guaranteed r-hops support on both graphs
O(1) parameters per layer
O(n+m) computational complexity (assuming sparse graphs)
May loose some coupling information
94. Matrix completion with Recurrent Multi-Graph ConvNets[25]
Xavier Bresson 95
Complexity: O(nm)
Extract
compositional
features from
graphs of users
and items
[25] Monti, Bresson, Bronstein 2017
96. Incremental updates with RNNs
The RNN/LSTM[26] cast the completion problem as non-linear
diffusion process.
Replace multi-layer CNNs by simple RNN + diffusion
much less parameters to learn (reduce overfitting).
more favorable for highly sparse entries (better info diffusion).
Fix #parameters of the model independently of entry sparsity.
Allows to learn temporal dynamics of entries if such information is
available.
Xavier Bresson 97
dX(t)RNN
X(t + 1) = X(t) + dX(t)
X(t)
[26] Hochreiter, Schmidhuber 1997
101. Conclusion
Contributions:
Generalization of ConvNets to data on graphs (fixed)
Exact/tight localized filters on graphs
Linear complexity for sparse graphs
GPU implementation
Rich expressivity
Multiple graphs
Several potential applications in network sciences.
Xavier Bresson 102
102. Papers & code
Papers
Convolutional neural networks on graphs with fast localized spectral
filtering, M Defferrard, X Bresson, P Vandergheynst, NIPS, 2016,
arXiv:1606.09375
Geometric matrix completion with recurrent multi-graph neural
networks, F Monti, MM Bronstein, X Bresson, NIPS, 2017,
arXiv:1704.06803
CayleyNets: Graph Convolutional Neural Networks with Complex
Rational Spectral Filters, R Levie, F Monti, X Bresson, MM Bronstein,
2017, arXiv:1705.07664
Xavier Bresson 103
Code
https://github.com/xbresson/
graph_convnets_pytorch
103. Xavier Bresson 104
Tutorials
Computer Vision and Pattern Recognition (CVPR)
Geometric deep learning on graphs and manifolds
M. Bronstein, J. Bruna, A. Szlam, X. Bresson, Y. LeCun
July 21, 2017
Slides: https://www.dropbox.com/s/appafg1fumb6u7f/tutorial_CVPR17_DL_graphs.pdf?dl=0
Video will be posted on YouTube/ComputerVisionFoundation Videos
Neural Information Processing Systems (NIPS)
Geometric deep learning on graphs and manifolds
M. Bronstein, J. Bruna, A. Szlam, X. Bresson, Y. LeCun
Dec 4, 2017
104. Xavier Bresson 105
Workshop
New Deep Learning Techniques
Feb 5-9, 2018
Organizers
Xavier Bresson, NTU
Michael Bronstein, USI/Intel
Joan Bruna, NYU/Berkeley
Yann LeCun, NYU/Facebook
Stanley Osher, UCLA
Arthur Szlam, Facebook
http://www.ipam.ucla.edu/programs/work
shops/new-deep-learning-techniques