Tensor Train decomposition in machine learning

Tensor Train in machine learning
Alexander Novikov
October 11, 2016
Alexander Novikov Tensor Train in machine learning October 11, 2016 1 / 26

Recommender systems
Assume low-rank structure.

Tensor Train summary
Tensor Train (TT) decomposition [Oseledets 2011]:
A compact representation for tensors (=multidimensional array);
Allows for eﬃcient application of linear algebra operations.

Low-rank decomposition
A23 =
G1 G2
i2 = 3i1 = 2
Ai1i2 = G1[i1]
1×r
G2[i2]
r×1
A = G1G2
G1 – collection of rows, G2 – collection of columns:

Tensor Train decomposition
A2423 =
G1 G2 G3 G4
i2 = 4 i3 = 2 i4 = 3
i1 = 2
Ai1...id
= G1[i1]
1×r
G2[i2]
r×r
. . . Gd [id ]
r×1
An example of computing one element of 4-dimensional tensor:

Tensor Train decomposition Cont’d
Tensor A is said to be in the TT-format, if
Ai1,...,id
= G1[i1] G2[i2] · · · Gd [id ], ik ∈ {1, . . . , n},
where Gk[ik] — is a matrix of size rk−1 × rk, r0 = rd = 1.
Notation & terminology:
Gk — TT-cores;
rk — TT-ranks;
r = max
k=0,...,d
rk — the maximal TT-rank.
The TT-format uses O ndr2 memory to store nd elements. Eﬃcient only
if the TT-rank is small.

TT-format: example
Ai1,i2,i3 = i1 + i2 + i3,
i1 ∈ {1, 2, 3}, i2 ∈ {1, 2, 3, 4}, i3 ∈ {1, 2, 3, 4, 5}.
Ai1,i2,i3 = G1[i1]G2[i2]G3[i3],

TT-format: example
Ai1,i2,i3 = i1 + i2 + i3,
i1 ∈ {1, 2, 3}, i2 ∈ {1, 2, 3, 4}, i3 ∈ {1, 2, 3, 4, 5}.
Ai1,i2,i3 = G1[i1]G2[i2]G3[i3],
G1[i1] = i1 1 G2[i2] =
1 0
i2 1
G3[i3] =
1
i3
Lets check:
A(i1, i2, i3) = i1 1
1 0
i2 1
1
i3
=
= i1 + i2 1
1
i3
= i1 + i2 + i3.

TT-format: example
Ai1,i2,i3 = i1 + i2 + i3,
i1 ∈ {1, 2, 3}, i2 ∈ {1, 2, 3, 4}, i3 ∈ {1, 2, 3, 4, 5}.
Ai1,i2,i3 = G1[i1]G2[i2]G3[i3],
G1 = 1 1 , 2 1 , 3 1
G2 =
1 0
1 1
,
1 0
2 1
,
1 0
3 1
,
1 0
4 1
G3 =
1
1
,
1
2
,
1
3
,
1
4
,
1
5
The tensor has 3 · 4 · 5 = 60 elements.
The TT-format use 32 parameters to describe it.

Sum of tensors
Tensors A and B are in the TT-format:
Ai1...id
= GA
1 [i1] · · · GA
d [id ], Bi1...id
= GB
1 [i1] · · · GB
d [id ].
Find the TT-format of
C = A + B,
Ci1...id
= Ai1...id
+ Bi1...id
.

Sum of tensors
Tensors A and B are in the TT-format:
Ai1...id
= GA
1 [i1] · · · GA
d [id ], Bi1...id
= GB
1 [i1] · · · GB
d [id ].
Find the TT-format of
C = A + B,
Ci1...id
= Ai1...id
+ Bi1...id
.
TT-cores of the result:
GC
k [ik] =
GA
k [ik] 0
0 GB
k [ik]
, k = 2, . . . , d − 1,
GC
1 [i1] = GA
1 [i1] GB
1 [i1] , GC
d [id ] =
GA
d [id ]
GB
d [id ]
.
TT-ranks of the result are sums of the TT-ranks.

TT-rounding
Given a tensor A in the TT-format with rank r, the TT-rounding
[Oseledets, 2011]:
A = tt-round(A, ε), ε > 0
ﬁnds the tensor A such that
1 A − A F ≤ ε A F ;
2 TT-rank of A is minimal among all B:
A − B F ≤ ε√
d−1
A F .
Where A F = i1,...,id
A2
i1,...,id
.

How to ﬁnd TT-decomposition of a given tensor
Analytical formulas for special cases;
An exact algorithm based on SVD for medium tensor. E.g. for a
58 ≈ 400 000 tensor takes 8 ms on my laptop;
For large tensors (e.g. 250), approximate algorithms that look at a
fraction of the tensor elements: DMRG-cross [Savostyanov and
Oseledets, 2011], AMEn-cross [Dolgov and Savostyanov, 2013].

TT-format operations
Operation Rank of the result
C = c · A r(C) = r(A)
C = A + c r(C) = r(A)+1
C = A + B r(C) ≤ r(A)+r(B)
C = A B r(C) ≤ r(A)r(B)
C = round(A, ε) r(C) ≤ r(A)
sum A –
A F –
(Ask me about diﬀerential equations)

Example application: TensorNet
1 Neural networks use fully-connected layers: y = f (W x + b).
2 The matrix W is of millions parameters.
3 Lets store and train the matrix W in the TT-format.
Can’t work for general matrices, but for VGG-16 net we compressed
4048 × 4048 matrix to 320 params without loss of accuracy.

Linear model
Model
y(x) = w x + b,
b ∈ R, w ∈ Rd
Loss function
N
k=1
w x(k)
+ b, y(k)
.
Linear regression
Logistic regression
Linear SVM
...

Need for interactions
Linear models give everyone same recommendations
Same story e.g. in bag-of-words text tasks
Use interactions (products of features)!

Models with interactions
y(x) = b + w x +
i,j
Pijxi xj,
b ∈ R, w ∈ Rd
, P ∈ Rd×d
For d features d2 parameters: overﬁtting on sparse data
Complexity is also d2
For recommender systems d is millions
SVM with polynomial kernel has same drawbacks

Factorization machines
y(x) = b + w x +
i,j
Pijxi xj
Factorization machines [Rendle 2010] use rank r for P
y(x) =b + w x +
i,j
r
f =1
Vif Vjf xi xj,
b ∈ R, w ∈ Rd
, V ∈ Rd×r
Matrix P = VV is not sparse, but structured (low rank)
Control the number of parameters with r
Can represent almost any matrix with large r

High order analysis
Factorization machines model (3rd order)
y(x) =b + w x +
i,j
r
f =1
Vif Vjf xi xj
+
i,j,k
r
f =1
Uif Ujf Ukf xi xjxk.
In fact, Factorization machines just use CP-decomposition for the weight
tensor Pi,j,k:
Pijk =
r
f =1
Uif Ujf Ukf
But
Converge poorly with high order
Complexity of inference and learning

Exponential machines
Lets encode interactions by binary code. Every bit indicates if
corresponded feature is included or not in current interaction.
Exponential machines example (d = 3):
y(x) = W000 + W100 x1 + W010 x2 + W001x3
+ W110 x1x2 + W101 x1x3 + W011 x2x3
+ W111 x1x2x3.

Exponential machines
Lets encode interactions by binary code. Every bit indicates if
corresponded feature is included or not in current interaction.
Exponential machines example (d = 3):
y(x) = W000 + W100 x1 + W010 x2 + W001x3
+ W110 x1x2 + W101 x1x3 + W011 x2x3
+ W111 x1x2x3.
In general:
y(x) =
1
i1=0
. . .
1
id =0
Wi1,...,id
xi1
1 . . . xid
d ,
W ∈ R2×...×2
with TT-rank r
Captures all 2d interactions
Control the number of parameters with TT-rank r
Can represent any polynomial function with large r

Exponential machines inference
Linear O(r2d) inference:
y(x) =
i1,...,id
G1[i1] . . . Gd [id ]
d
k=1
xik
k
=
i1,...,id
xi1
1 G1[i1] . . . xid
d Gd [id ]
=


1
i1=0
xi1
1 G1[i1]

 . . .


1
id =0
xid
d Gd [id ]


= A1
1×r
A2
r×r
. . . Ad
r×1
,

Exponential machines learning
minimize
W
N
k=1
W, X(k)
, y(k)
,
subject to TT-rank(W) = r0,
1 Autodiﬀ to compute gradients with respect to TT-cores G
2 OR Riemannian optimization
Theorem [Holtz, 2012]
The set of all d-dimensional tensors with ﬁxed TT-rank r
Mr = {W ∈ R2×...×2
: TT-rank(W) = r}
forms a Riemannian manifold.

Riemannian optimization
− ∂L
∂Wt
TW Mr
−Gt
TT-roundWt+1
Mr
projection
Wt

Riemannian optimization Cont’d
Loss function
L(W) =
N
k=1
W, X(k)
, y(k)
Gradient
∂L
∂W
=
N
k=1
∂
∂y
X(k)
.
Where X is of TT-rank 1!
Xi1...id
=
d
k=1
xik
k .

Experiments: optimization
10-1 100 101 102
time (s)
10-17
10-15
10-13
10-11
10-9
10-7
10-5
10-3
10-1
trainloss
Cores GD
Cores SGD 100
Cores SGD 500
Riemann GD
Riemann 100
Riemann 500
Riemann GD rand init
(a) Car dataset
10-1 100 101 102 103 104
time (s)
10-16
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
trainloss
Cores GD
Cores SGD 100
Cores SGD 500
Riemann GD
Riemann 100
Riemann 500
Riemann GD rand init
(b) HIV dataset

Experiments: classiﬁcation
1 We generated 105 train and 105 test objects and d = 30 features.
2 Xij ∼ U{−1, +1}.
3 Ground truth for 3 interactions of order 2:
y(x) = ε1x1x5 + ε2x3x8 + ε3x4x5; ε1, ε2, ε3 ∼ U(−1, 1).
4 We used 20 interactions of order 6.
Method Test AUC Training time (s) Inference time (s)
Log. reg. 0.50 ± 0.0 0.4 0.0
RF 0.55 ± 0.0 21.4 1.3
SVM RBF 0.50 ± 0.0 2262.6 1076.1
SVM poly. 2 0.50 ± 0.0 1152.6 852.0
SVM poly. 6 0.56 ± 0.0 4090.9 754.8
2-nd order FM 0.50 ± 0.0 638.2 0.1
6-th order FM 0.57 ± 0.05 1412.0 0.2
ExM rank 2 0.54 ± 0.05 198.4 0.1
ExM rank 4 0.69 ± 0.02 443.0 0.1
ExM rank 8 0.75 ± 0.02 998.3 0.2

Conclusion
Tensor Train decomposition compactly represent tensors.
Can parametrize machine learning models with TT-tensors.
E.g. the weights of a neural network.
Or modeling all 2d interactions (products of features).
Control the number of underlying parameters via TT-rank.
Riemannian optimization learning sometimes outperforms SGD.
There is a Python code for everything: TT, TensorNet, and
Exponential Machines.

Tensor Train decomposition in machine learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Tensor Train decomposition in machine learning

Similaire à Tensor Train decomposition in machine learning (20)

Dernier

Dernier (20)

Tensor Train decomposition in machine learning