SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Higher-order Factorization Machines
Mathieu Blondel
NTT Communication Science Laboratories
Kyoto, Japan
Joint work with M. Ishihata, A. Fujino and
N. Ueda
2016/9/28
1
Regression analysis
• Variables
y ∈ R: target variable
x ∈ Rd
: explanatory variables (features)
• Training data
y = [y1, . . . , yn]T
∈ Rn
X = [x1, . . . , xn] ∈ Rd×n
• Goal
◦ Learn model parameters
◦ Compute prediction y for a new x
2
Linear regression
• Model
ˆyLR(x; w) := w, x =
d
j=1
wjxj
• Parameters
w ∈ Rd
: feature weights
• Pros and cons
O(d) predictions
Learning w can be cast as a convex optimization problem
Does not use feature interactions
3
Polynomial regression
• Model
ˆyPR(x; w) := w, x +xT
Wx = w, x +
d
j,j =1
wj,j xjxj
• Parameters
w ∈ Rd
: feature weights
W ∈ Rd×d
: weight matrix
• Pros and cons
Learning w and W can be cast as a convex optimization problem
O(d2
) time and memory cost
4
Kernel regression
• Model
ˆyKR(x; α) :=
n
i=1
αiK(xi, x)
• Parameters
α ∈ Rn
: instance weights
• Pros and cons
Can use non-linear kernels (RBF, polynomial, etc...)
Learning α can be cast as a convex optimization problem
O(dn) predictions (linear dependence on training set size)
5
Factorization Machines (FMs) (Rendle, ICDM 2010)
• Model
ˆyFM(x; w, P) := w, x +
j >j
¯pj, ¯pj xjxj
• Parameters
w ∈ Rd
: feature weights
P ∈ Rd×k
: weight matrix
• Pros and cons
Takes into account feature combinations
O(2dk) predictions (linear-time) instead of O(d2
)
Parameter estimation involves a non-convex optimization problem6
jth
row of P
Application 1: recsys without features
• Formulate it as a matrix completion problem
Movie 1 Movie 2 Movie 3 Movie 4
Alice ? ?
Bob ? ?
Charlie ? ?
• Matrix factorization: find U, V that approximately
reconstruct the rating matrix
R ≈ UVT
7
Conversion to a regression problem
Movie 1 Movie 2 Movie 3 Movie 4
Alice ? ?
Bob ? ?
Charlie ? ?
⇓ one-hot encoding




















y










1 0 0 1 0 0 0
1 0 0 0 0 1 0
0 1 0 1 0 0 0
0 1 0 0 0 1 0
0 0 1 1 0 0 0
0 0 1 0 0 0 1










X
8
Using this
representation,
FMs are equivalent
to MF!
Generalization ability of FMs
• The weight of xjxj is ¯pj, ¯pj compared to wj,j for PR
• The same parameters ¯pj are shared for the weight of
xjxj ∀j > j
• This increases the amount of data used to estimate ¯pj at
the cost of introducing some bias (low-rank assumption)
• This allows to generalize to feature interactions that
were not observed in the training set
9
Application 2: recsys with features
Rating
User Movie
Gender Age Genre Director
M 20-30 Adventure S. Spielberg
F 0-10 Anime H. Miyazaki
M 20-30 Drama A. Kurosawa
...
...
...
...
...
• Interactions between categorical variables
◦ Gender × genre: {M, F} × {Adventure, Anime, Drama, ...}
◦ Age × director: {0-10, 10-20, ...} × {S. Spielberg, H. Miyazaki, A. Kurosawa, ...}
• In practice, the number of interactions can be huge!
10
Conversion to regression
Rating
User Movie
Gender Age Genre Director
M 20-30 Adventure S. Spielberg
F 0-10 Anime H. Miyazaki
M 20-30 Drama A. Kurosawa
...
...
...
...
...
⇓ one-hot encoding






...






y






1 0 0 0 1 1 0 0 . . .
0 1 1 0 0 0 1 0 . . .
1 0 0 0 1 0 0 1 . . .
...
...
...
...
...
...
...
... . . .






X11
very sparse
binary data!
FMs revisited (Blondel+, ICML 2016)
• ANOVA kernel of degree m = 2 (Stitson+, 1997; Vapnik, 1998)
A2
(p, x) :=
j >j
pjxj pj xj
• Then
ˆyFM(x; w, P) = w, x +
j >j
¯pj, ¯pj xjxj
= w, x +
k
s=1
A2
(ps, x)
12
↑ sth
column of P
ANOVA kernel (arbitrary-order case)
• ANOVA kernel of degree 2 ≤ m ≤ d
Am
(p, x) :=
jm>···>j1
(pj1
xj1
) . . . (pjm
xjm
)
• Intuitively, the kernel uses all m-combinations of features
without replacement: xj1
. . . xjm
for j1 = · · · = jm
• Computing Am
(p, x) naively takes O(dm
)
13
↑ All possible m-combinations of {1, . . . , d}
Higher-order FMs (HOFMs)
• Model
ˆyHOFM(x; w, {Pt
}m
t=2) := w, x +
m
t=2
k
s=1
At
(pt
s, x)
• Parameters
w ∈ Rd
: feature weights
P2
, . . . , Pm
∈ Rd×k
: weight matrices
• Pros and cons
Takes into account higher-order feature combinations
O(dkm2
) prediction cost using our proposed algorithms
More complex than 2nd-order FMs14
Learning HOFMs (1/2)
• We use alternating mimimization w.r.t. w, P2
, ..., Pm
• Learning w alone reduces to linear regression
• Learning Pm
can be cast as minimizing
F(P) :=
1
n
n
i=1

yi,
k
s=1
Am
(ps, xi) + oi

 +
β
2
P 2
where oi is the contribution of degrees other than m
15
Learning HOFMs (2/2)
• Stochastic gradient update
ps ← ps − η (yi, ˆyi) Am
(ps, xi) − ηβps
where η is a learning rate hyper-parameter and
ˆyi :=
k
s=1
Am
(ps, xi ) + oi
• We propose O(dm) (linear time) DP algorithms for
◦ Evaluating ANOVA kernel Am
(p, x) ∈ R
◦ Computing gradient Am
(p, x) ∈ Rd
16
Evaluating the ANOVA kernel (1/3)
• Recursion (Blondel+, ICML 2016)
Am
(p, x) = Am
(p¬j, x¬j) + pjxj Am−1
(p¬j, x¬j) ∀j
where p¬j, x¬j ∈ Rd−1
are vectors with the jth
element
removed
• We can use this recursion to remove features until
computing the kernel becomes trivial
17
Evaluating the ANOVA kernel (2/3)
ad,m
ad-1,m ad-1,m-1
+
pd xd
ad-2,m ad-2,m-1
+
ad-2,m-2
+
pd-1 xd-1pd-1 xd-1
…
…
Value we want 

to compute
Redundant

computation
2 allows us to focus on the ANOVA kernel as the main
Ms. In this section, we develop dynamic programming (DP)
ng its gradient in only O(dm) time, i.e., linear time.
that we can use (7) to recursively remove features until
et us denote a subvector of p by p1:j 2 Rj
and similarly for
rthand aj,t := At
(p1:j, x1:j). Then we have that
if j 0
if j < t
,t + pjxj aj 1,t 1 otherwise.
(8)
Table 1: Example of DP table
j = 0 j = 1 j = 2 . . . j = d
t = 0 1 1 1 1 1
t = 1 0 a1,1 a2,1 . . . ad,1
t = 2 0 0 a2,2 . . . ad,2
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
t = m 0 0 0 . . . ad,m
m
(p, x) = ad,m.
cursion, which
tations, we use
mputations in a
ner to initialize
to arrive at the
procedure, sum-
me and memory.
of Am
(p, x) w.r.t. p, we use reverse-mode differentiation
Shortcut:
Continue until all

features have been
eliminated
…
18
p1:j := [p1, . . . , pj]T
x1:j := [x1, . . . , xj]T
Evaluating the ANOVA kernel (3/3)
Ways to avoid redundant computations:
• Top-down approach with memory table
• Bottum-up dynamic programming (DP)
j = 0 j = 1 j = 2 . . . j = d
t = 0 1 1 1 1 1
t = 1 0 a1,1 a2,1 . . . ad,1
t = 2 0 0 a2,2 . . . ad,2
...
...
...
... ... ...
t = m 0 0 0 . . . ad,m
19
p2x2
→
← goal
start
Algorithm 1 Evaluating Am
(p, x) in O(dm)
Input: p 2 Rd
, x 2 Rd
aj,t 0 8t 2 {1, . . . , m}, j 2 {0, 1, . . . , d}
aj,0 1 8j 2 {0, 1, . . . , d}
for t := 1, . . . , m do
for j := t, . . . , d do
aj,t aj 1,t + pjxjaj 1,t 1
end for
end for
Output: Am
(p, x) = ad,m
Algor
Inp
˜aj,t
˜ad,m
for
f
e
end
˜pj :
Ou
20
Backpropagation (chain rule)
Ex: compute derivatives of composite function f (g(h(p)))
• Forward pass
a = h(p)
b = g(a)
c = f (b)
• Backward pass
∂c
∂pj
=
∂c
∂b
∂b
∂a
∂a
∂pj
= f (b) g (a) hj(p)
21
↓
Only the last part
depends on j
Can compute all derivatives in one pass!
Gradient computation (1/2)
• We want to compute Am
(p, x) = [˜p1, . . . , ˜pd]T
• Using the chain rule, we have
˜pj :=
∂ad,m
∂pj
=
m
t=1
∂ad,m
∂aj,t
:=˜aj,t
∂aj,t
∂pj
=aj−1,t−1xj
=
m
t=1
˜aj,t aj−1,t−1 xj
since pj influences aj,t ∀t ∈ [m]
• ˜aj,t can be computed recursively in reverse order
˜aj,t = ˜aj+1,t + pj+1xj+1 ˜aj+1,t+1
22
Gradient computation (2/2)
j = 1 j = 2 . . . j = d − 1 j = d
t = 1 ˜a1,1 ˜a2,1 . . . 0 0
t = 2 0 ˜a2,2
... 0 0
...
...
...
... ˜ad−1,t−1 0
t = m 0 0 0 1 1
t = m + 1 0 0 0 0 0
23
←
pd xd
goal
← start
m) Algorithm 2 Computing rAm
(p, x) in O(dm)
Input: p 2 Rd
, x 2 Rd
, {aj,t}d,m
j,t=0
˜aj,t 0 8t 2 [m + 1], j 2 [d]
˜ad,m 1
for t := m, . . . , 1 do
for j := d 1, . . . , t do
˜aj,t ˜aj+1,t + ˜aj+1,t+1pj+1xj+1
end for
end for
˜pj :=
Pm
t=1 ˜aj,taj 1,t 1xj 8j 2 [d]
Output: rAm
(p, x) = [˜p1, . . . , ˜pd]T
24
Summary so far
• HOFMs can be expressed using the ANOVA kernel Am
• We proposed O(dm) time algorithms for computing
Am
(p, x) and Am
(p, x)
• The cost per epoch of stochastic gradient algorithms for
learning Pm
is therefore O(dnkm)
• The prediction cost is O(dkm2
)
25
Other contributions
• Coordinate-descent algorithm for learning Pm
based on a
different recursion
◦ Cost per epoch is O(dnkm2
)
◦ However, no learning rate to tune!
• HOFMs with shared parameters: P2
= · · · = Pm
◦ Total prediction cost is O(dkm) instead of O(dkm2
)
◦ Corresponds to using new kernels derived from the ANOVA kernel
26
Experiments
27
Application to link prediction
Goal: predict missing links between nodes in a graph
?
?
Graph:
• Co-author network
• Enzyme network
?
?
Bipartite graph:
• User-movie
• Gene-disease
28
Application to link prediction
• We assume two sets of nodes A (e.g., users) and B (e.g,
movies) of size nA and nB
• Nodes in A are represented by feature vectors ai ∈ RdA
• Nodes in B are represented by feature vectors bj ∈ RdB
• We are given a matrix Y ∈ {−1, +1}nA×nB
such that
yi,j = +1 if there is a link between ai and bj
• Number of positive samples is n+
29
Datasets
Dataset n+ Columns of A nA dA Columns of B nB dB
NIPS 4,140 Authors 2,037 13,649
Enzyme 2,994 Enzymes 668 325
GD 3,954 Diseases 3,209 3,209 Genes 12,331 25,275
ML 100K 21,201 Users 943 49 Movies 1,682 29
Features:
• NIPS: word occurence in author publications
• Enzyme: phylogenetic information, gene expression information and
gene location information
• GD: MimMiner similarity scores (diseases) and HumanNet similarity
scores (genes)
• ML 100K: age, gender, occupation, living area (users); release year,
genre (movies)30
Models compared
Goal: predict if there is a link between ai and bj
• HOFM: ˆyi,j = ˆyHOFM(ai ⊕ bj; w, {Pt
}m
t=2)
• HOFM-shared: same but with P2
= · · · = Pm
• Polynomial network (PN): replace ANOVA kernel by
polynomial kernel
• Bilinear regression (BLR): ˆyi,j = aiUVT
bj
31
vector concatenation
Experimental protocol
• We sample n− = n+ negatives samples (missing edges
are treated as negative samples)
• We use 50% for training and 50% for testing
• We use ROC-AUC (area under ROC curve) for evaluation
• β tuned by CV, k fixed to 30
• P2
, . . . , Pm
initialized randomly
• is set to the squared loss
32
HOFM HOFM-shared PN BLR
0.70
0.75
0.80
0.85
0.90
AUC
m=2
m=3
m=4
m=5
(a) NIPS
HOFM HOFM-shared PN BLR
0.60
0.65
0.70
0.75
0.80
0.85
0.90
AUC
m=2
m=3
m=4
m=5
(b) Enzyme
HOFM HOFM-shared PN BLR
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
AUC
m=2
m=3
m=4
m=5
(c) GD
HOFM HOFM-shared PN BLR
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
AUC
m=2
m=3
m=4
m=5
(d) ML100K33
Solver comparison
• Coordinate descent
• AdaGrad
• L-BFGS
AdaGrad and L-BFGS use the proposed DP algorithm to
compute Am
(p, x)
34
100 101 102 103
CPU time (seconds)
10-3
10-2
10-1
100
101
Objectivevalueminusbest
CD
AdaGrad
L-BFGS
(a) Convergence when m = 2
100 101 102 103 104
CPU time (seconds)
10-3
10-2
10-1
100
101
Objectivevalueminusbest
CD
AdaGrad
L-BFGS
(b) Convergence when m = 3
100 101 102 103 104
CPU time (seconds)
10-3
10-2
10-1
100
101
Objectivevalueminusbest
CD
AdaGrad
L-BFGS
(c) Convergence when m = 4
2 3 4 5
Degree m
0
50
100
150
200
250
300
350
Timetocompleteoneepoch(sec.)
CD
AdaGrad
L-BFGS
(d) Scalability w.r.t. m35
NIPS dataset

Contenu connexe

Tendances

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
Digital Signal Processing[ECEG-3171]-Ch1_L02
Digital Signal Processing[ECEG-3171]-Ch1_L02Digital Signal Processing[ECEG-3171]-Ch1_L02
Digital Signal Processing[ECEG-3171]-Ch1_L02Rediet Moges
 
Proximal Splitting and Optimal Transport
Proximal Splitting and Optimal TransportProximal Splitting and Optimal Transport
Proximal Splitting and Optimal TransportGabriel Peyré
 
Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsGabriel Peyré
 
Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...zukun
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Gabriel Peyré
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationGabriel Peyré
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Pythonrik0
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante
 
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsLow Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsGabriel Peyré
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes FamilyKota Matsui
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationGabriel Peyré
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorSSA KPI
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsDmitriy Selivanov
 
Comparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsComparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsBigMC
 
The multilayer perceptron
The multilayer perceptronThe multilayer perceptron
The multilayer perceptronESCOM
 

Tendances (20)

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
SASA 2016
SASA 2016SASA 2016
SASA 2016
 
Digital Signal Processing[ECEG-3171]-Ch1_L02
Digital Signal Processing[ECEG-3171]-Ch1_L02Digital Signal Processing[ECEG-3171]-Ch1_L02
Digital Signal Processing[ECEG-3171]-Ch1_L02
 
Proximal Splitting and Optimal Transport
Proximal Splitting and Optimal TransportProximal Splitting and Optimal Transport
Proximal Splitting and Optimal Transport
 
Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse Problems
 
YSC 2013
YSC 2013YSC 2013
YSC 2013
 
Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems Regularization
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Python
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
 
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsLow Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex Optimization
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial Sector
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
Comparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsComparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering models
 
The multilayer perceptron
The multilayer perceptronThe multilayer perceptron
The multilayer perceptron
 

En vedette

Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)
Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)
Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)STAIR Lab, Chiba Institute of Technology
 
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)
ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)
ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)STAIR Lab, Chiba Institute of Technology
 
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
言語資源と付き合う
言語資源と付き合う言語資源と付き合う
言語資源と付き合うYuya Unno
 
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)
Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)
Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)
教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)
教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)
最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)
最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)STAIR Lab, Chiba Institute of Technology
 
自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)
自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)
自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)STAIR Lab, Chiba Institute of Technology
 
深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)
深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)
深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)
視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)
視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)STAIR Lab, Chiba Institute of Technology
 
画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)
画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)
画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜Yuya Unno
 
深層学習時代の自然言語処理
深層学習時代の自然言語処理深層学習時代の自然言語処理
深層学習時代の自然言語処理Yuya Unno
 

En vedette (20)

Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)
Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)
Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)
 
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
 
ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)
ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)
ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)
 
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
 
第1回ステアラボ人工知能セミナー(オープニング)
第1回ステアラボ人工知能セミナー(オープニング)第1回ステアラボ人工知能セミナー(オープニング)
第1回ステアラボ人工知能セミナー(オープニング)
 
言語資源と付き合う
言語資源と付き合う言語資源と付き合う
言語資源と付き合う
 
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)
 
Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)
Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)
Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)
 
JSAI Cup2017報告会
JSAI Cup2017報告会JSAI Cup2017報告会
JSAI Cup2017報告会
 
教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)
教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)
教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)
 
最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)
最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)
最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)
 
自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)
自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)
自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)
 
深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)
深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)
深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
 
視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)
視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)
視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)
 
画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)
画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)
画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)
 
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
 
情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 
深層学習時代の自然言語処理
深層学習時代の自然言語処理深層学習時代の自然言語処理
深層学習時代の自然言語処理
 

Similaire à Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)

Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksStratio
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012Zheng Mengdi
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Gabriel Peyré
 
Optimization algorithms for solving computer vision problems
Optimization algorithms for solving computer vision problemsOptimization algorithms for solving computer vision problems
Optimization algorithms for solving computer vision problemsKrzysztof Wegner
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPK Lehre
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPer Kristian Lehre
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Michael Lie
 
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Shizuoka Inst. Science and Tech.
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
A Tutorial On Ip 1
A Tutorial On Ip 1A Tutorial On Ip 1
A Tutorial On Ip 1ankuredkie
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsAlexander Litvinenko
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsKeisuke OTAKI
 
Lp and ip programming cp 9
Lp and ip programming cp 9Lp and ip programming cp 9
Lp and ip programming cp 9M S Prasad
 

Similaire à Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー) (20)

ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012
 
ML unit2.pptx
ML unit2.pptxML unit2.pptx
ML unit2.pptx
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
 
Optimization algorithms for solving computer vision problems
Optimization algorithms for solving computer vision problemsOptimization algorithms for solving computer vision problems
Optimization algorithms for solving computer vision problems
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
 
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)
 
A Tutorial On Ip 1
A Tutorial On Ip 1A Tutorial On Ip 1
A Tutorial On Ip 1
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formats
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
KAUST_talk_short.pdf
KAUST_talk_short.pdfKAUST_talk_short.pdf
KAUST_talk_short.pdf
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGs
 
Lp and ip programming cp 9
Lp and ip programming cp 9Lp and ip programming cp 9
Lp and ip programming cp 9
 

Plus de STAIR Lab, Chiba Institute of Technology

リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)
リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)
リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)STAIR Lab, Chiba Institute of Technology
 
制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)
制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)
制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)STAIR Lab, Chiba Institute of Technology
 
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)
企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)
企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)STAIR Lab, Chiba Institute of Technology
 
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 

Plus de STAIR Lab, Chiba Institute of Technology (7)

リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)
リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)
リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)
 
制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)
制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)
制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)
 
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)
 
企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)
企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)
企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)
 
メテオサーチチャレンジ報告 (2位解法)
メテオサーチチャレンジ報告 (2位解法)メテオサーチチャレンジ報告 (2位解法)
メテオサーチチャレンジ報告 (2位解法)
 
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
 
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)
 

Dernier

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Dernier (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)

  • 1. Higher-order Factorization Machines Mathieu Blondel NTT Communication Science Laboratories Kyoto, Japan Joint work with M. Ishihata, A. Fujino and N. Ueda 2016/9/28 1
  • 2. Regression analysis • Variables y ∈ R: target variable x ∈ Rd : explanatory variables (features) • Training data y = [y1, . . . , yn]T ∈ Rn X = [x1, . . . , xn] ∈ Rd×n • Goal ◦ Learn model parameters ◦ Compute prediction y for a new x 2
  • 3. Linear regression • Model ˆyLR(x; w) := w, x = d j=1 wjxj • Parameters w ∈ Rd : feature weights • Pros and cons O(d) predictions Learning w can be cast as a convex optimization problem Does not use feature interactions 3
  • 4. Polynomial regression • Model ˆyPR(x; w) := w, x +xT Wx = w, x + d j,j =1 wj,j xjxj • Parameters w ∈ Rd : feature weights W ∈ Rd×d : weight matrix • Pros and cons Learning w and W can be cast as a convex optimization problem O(d2 ) time and memory cost 4
  • 5. Kernel regression • Model ˆyKR(x; α) := n i=1 αiK(xi, x) • Parameters α ∈ Rn : instance weights • Pros and cons Can use non-linear kernels (RBF, polynomial, etc...) Learning α can be cast as a convex optimization problem O(dn) predictions (linear dependence on training set size) 5
  • 6. Factorization Machines (FMs) (Rendle, ICDM 2010) • Model ˆyFM(x; w, P) := w, x + j >j ¯pj, ¯pj xjxj • Parameters w ∈ Rd : feature weights P ∈ Rd×k : weight matrix • Pros and cons Takes into account feature combinations O(2dk) predictions (linear-time) instead of O(d2 ) Parameter estimation involves a non-convex optimization problem6 jth row of P
  • 7. Application 1: recsys without features • Formulate it as a matrix completion problem Movie 1 Movie 2 Movie 3 Movie 4 Alice ? ? Bob ? ? Charlie ? ? • Matrix factorization: find U, V that approximately reconstruct the rating matrix R ≈ UVT 7
  • 8. Conversion to a regression problem Movie 1 Movie 2 Movie 3 Movie 4 Alice ? ? Bob ? ? Charlie ? ? ⇓ one-hot encoding                     y           1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1           X 8 Using this representation, FMs are equivalent to MF!
  • 9. Generalization ability of FMs • The weight of xjxj is ¯pj, ¯pj compared to wj,j for PR • The same parameters ¯pj are shared for the weight of xjxj ∀j > j • This increases the amount of data used to estimate ¯pj at the cost of introducing some bias (low-rank assumption) • This allows to generalize to feature interactions that were not observed in the training set 9
  • 10. Application 2: recsys with features Rating User Movie Gender Age Genre Director M 20-30 Adventure S. Spielberg F 0-10 Anime H. Miyazaki M 20-30 Drama A. Kurosawa ... ... ... ... ... • Interactions between categorical variables ◦ Gender × genre: {M, F} × {Adventure, Anime, Drama, ...} ◦ Age × director: {0-10, 10-20, ...} × {S. Spielberg, H. Miyazaki, A. Kurosawa, ...} • In practice, the number of interactions can be huge! 10
  • 11. Conversion to regression Rating User Movie Gender Age Genre Director M 20-30 Adventure S. Spielberg F 0-10 Anime H. Miyazaki M 20-30 Drama A. Kurosawa ... ... ... ... ... ⇓ one-hot encoding       ...       y       1 0 0 0 1 1 0 0 . . . 0 1 1 0 0 0 1 0 . . . 1 0 0 0 1 0 0 1 . . . ... ... ... ... ... ... ... ... . . .       X11 very sparse binary data!
  • 12. FMs revisited (Blondel+, ICML 2016) • ANOVA kernel of degree m = 2 (Stitson+, 1997; Vapnik, 1998) A2 (p, x) := j >j pjxj pj xj • Then ˆyFM(x; w, P) = w, x + j >j ¯pj, ¯pj xjxj = w, x + k s=1 A2 (ps, x) 12 ↑ sth column of P
  • 13. ANOVA kernel (arbitrary-order case) • ANOVA kernel of degree 2 ≤ m ≤ d Am (p, x) := jm>···>j1 (pj1 xj1 ) . . . (pjm xjm ) • Intuitively, the kernel uses all m-combinations of features without replacement: xj1 . . . xjm for j1 = · · · = jm • Computing Am (p, x) naively takes O(dm ) 13 ↑ All possible m-combinations of {1, . . . , d}
  • 14. Higher-order FMs (HOFMs) • Model ˆyHOFM(x; w, {Pt }m t=2) := w, x + m t=2 k s=1 At (pt s, x) • Parameters w ∈ Rd : feature weights P2 , . . . , Pm ∈ Rd×k : weight matrices • Pros and cons Takes into account higher-order feature combinations O(dkm2 ) prediction cost using our proposed algorithms More complex than 2nd-order FMs14
  • 15. Learning HOFMs (1/2) • We use alternating mimimization w.r.t. w, P2 , ..., Pm • Learning w alone reduces to linear regression • Learning Pm can be cast as minimizing F(P) := 1 n n i=1  yi, k s=1 Am (ps, xi) + oi   + β 2 P 2 where oi is the contribution of degrees other than m 15
  • 16. Learning HOFMs (2/2) • Stochastic gradient update ps ← ps − η (yi, ˆyi) Am (ps, xi) − ηβps where η is a learning rate hyper-parameter and ˆyi := k s=1 Am (ps, xi ) + oi • We propose O(dm) (linear time) DP algorithms for ◦ Evaluating ANOVA kernel Am (p, x) ∈ R ◦ Computing gradient Am (p, x) ∈ Rd 16
  • 17. Evaluating the ANOVA kernel (1/3) • Recursion (Blondel+, ICML 2016) Am (p, x) = Am (p¬j, x¬j) + pjxj Am−1 (p¬j, x¬j) ∀j where p¬j, x¬j ∈ Rd−1 are vectors with the jth element removed • We can use this recursion to remove features until computing the kernel becomes trivial 17
  • 18. Evaluating the ANOVA kernel (2/3) ad,m ad-1,m ad-1,m-1 + pd xd ad-2,m ad-2,m-1 + ad-2,m-2 + pd-1 xd-1pd-1 xd-1 … … Value we want 
 to compute Redundant
 computation 2 allows us to focus on the ANOVA kernel as the main Ms. In this section, we develop dynamic programming (DP) ng its gradient in only O(dm) time, i.e., linear time. that we can use (7) to recursively remove features until et us denote a subvector of p by p1:j 2 Rj and similarly for rthand aj,t := At (p1:j, x1:j). Then we have that if j 0 if j < t ,t + pjxj aj 1,t 1 otherwise. (8) Table 1: Example of DP table j = 0 j = 1 j = 2 . . . j = d t = 0 1 1 1 1 1 t = 1 0 a1,1 a2,1 . . . ad,1 t = 2 0 0 a2,2 . . . ad,2 . . . . . . . . . . . . ... . . . t = m 0 0 0 . . . ad,m m (p, x) = ad,m. cursion, which tations, we use mputations in a ner to initialize to arrive at the procedure, sum- me and memory. of Am (p, x) w.r.t. p, we use reverse-mode differentiation Shortcut: Continue until all
 features have been eliminated … 18 p1:j := [p1, . . . , pj]T x1:j := [x1, . . . , xj]T
  • 19. Evaluating the ANOVA kernel (3/3) Ways to avoid redundant computations: • Top-down approach with memory table • Bottum-up dynamic programming (DP) j = 0 j = 1 j = 2 . . . j = d t = 0 1 1 1 1 1 t = 1 0 a1,1 a2,1 . . . ad,1 t = 2 0 0 a2,2 . . . ad,2 ... ... ... ... ... ... t = m 0 0 0 . . . ad,m 19 p2x2 → ← goal start
  • 20. Algorithm 1 Evaluating Am (p, x) in O(dm) Input: p 2 Rd , x 2 Rd aj,t 0 8t 2 {1, . . . , m}, j 2 {0, 1, . . . , d} aj,0 1 8j 2 {0, 1, . . . , d} for t := 1, . . . , m do for j := t, . . . , d do aj,t aj 1,t + pjxjaj 1,t 1 end for end for Output: Am (p, x) = ad,m Algor Inp ˜aj,t ˜ad,m for f e end ˜pj : Ou 20
  • 21. Backpropagation (chain rule) Ex: compute derivatives of composite function f (g(h(p))) • Forward pass a = h(p) b = g(a) c = f (b) • Backward pass ∂c ∂pj = ∂c ∂b ∂b ∂a ∂a ∂pj = f (b) g (a) hj(p) 21 ↓ Only the last part depends on j Can compute all derivatives in one pass!
  • 22. Gradient computation (1/2) • We want to compute Am (p, x) = [˜p1, . . . , ˜pd]T • Using the chain rule, we have ˜pj := ∂ad,m ∂pj = m t=1 ∂ad,m ∂aj,t :=˜aj,t ∂aj,t ∂pj =aj−1,t−1xj = m t=1 ˜aj,t aj−1,t−1 xj since pj influences aj,t ∀t ∈ [m] • ˜aj,t can be computed recursively in reverse order ˜aj,t = ˜aj+1,t + pj+1xj+1 ˜aj+1,t+1 22
  • 23. Gradient computation (2/2) j = 1 j = 2 . . . j = d − 1 j = d t = 1 ˜a1,1 ˜a2,1 . . . 0 0 t = 2 0 ˜a2,2 ... 0 0 ... ... ... ... ˜ad−1,t−1 0 t = m 0 0 0 1 1 t = m + 1 0 0 0 0 0 23 ← pd xd goal ← start
  • 24. m) Algorithm 2 Computing rAm (p, x) in O(dm) Input: p 2 Rd , x 2 Rd , {aj,t}d,m j,t=0 ˜aj,t 0 8t 2 [m + 1], j 2 [d] ˜ad,m 1 for t := m, . . . , 1 do for j := d 1, . . . , t do ˜aj,t ˜aj+1,t + ˜aj+1,t+1pj+1xj+1 end for end for ˜pj := Pm t=1 ˜aj,taj 1,t 1xj 8j 2 [d] Output: rAm (p, x) = [˜p1, . . . , ˜pd]T 24
  • 25. Summary so far • HOFMs can be expressed using the ANOVA kernel Am • We proposed O(dm) time algorithms for computing Am (p, x) and Am (p, x) • The cost per epoch of stochastic gradient algorithms for learning Pm is therefore O(dnkm) • The prediction cost is O(dkm2 ) 25
  • 26. Other contributions • Coordinate-descent algorithm for learning Pm based on a different recursion ◦ Cost per epoch is O(dnkm2 ) ◦ However, no learning rate to tune! • HOFMs with shared parameters: P2 = · · · = Pm ◦ Total prediction cost is O(dkm) instead of O(dkm2 ) ◦ Corresponds to using new kernels derived from the ANOVA kernel 26
  • 28. Application to link prediction Goal: predict missing links between nodes in a graph ? ? Graph: • Co-author network • Enzyme network ? ? Bipartite graph: • User-movie • Gene-disease 28
  • 29. Application to link prediction • We assume two sets of nodes A (e.g., users) and B (e.g, movies) of size nA and nB • Nodes in A are represented by feature vectors ai ∈ RdA • Nodes in B are represented by feature vectors bj ∈ RdB • We are given a matrix Y ∈ {−1, +1}nA×nB such that yi,j = +1 if there is a link between ai and bj • Number of positive samples is n+ 29
  • 30. Datasets Dataset n+ Columns of A nA dA Columns of B nB dB NIPS 4,140 Authors 2,037 13,649 Enzyme 2,994 Enzymes 668 325 GD 3,954 Diseases 3,209 3,209 Genes 12,331 25,275 ML 100K 21,201 Users 943 49 Movies 1,682 29 Features: • NIPS: word occurence in author publications • Enzyme: phylogenetic information, gene expression information and gene location information • GD: MimMiner similarity scores (diseases) and HumanNet similarity scores (genes) • ML 100K: age, gender, occupation, living area (users); release year, genre (movies)30
  • 31. Models compared Goal: predict if there is a link between ai and bj • HOFM: ˆyi,j = ˆyHOFM(ai ⊕ bj; w, {Pt }m t=2) • HOFM-shared: same but with P2 = · · · = Pm • Polynomial network (PN): replace ANOVA kernel by polynomial kernel • Bilinear regression (BLR): ˆyi,j = aiUVT bj 31 vector concatenation
  • 32. Experimental protocol • We sample n− = n+ negatives samples (missing edges are treated as negative samples) • We use 50% for training and 50% for testing • We use ROC-AUC (area under ROC curve) for evaluation • β tuned by CV, k fixed to 30 • P2 , . . . , Pm initialized randomly • is set to the squared loss 32
  • 33. HOFM HOFM-shared PN BLR 0.70 0.75 0.80 0.85 0.90 AUC m=2 m=3 m=4 m=5 (a) NIPS HOFM HOFM-shared PN BLR 0.60 0.65 0.70 0.75 0.80 0.85 0.90 AUC m=2 m=3 m=4 m=5 (b) Enzyme HOFM HOFM-shared PN BLR 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 AUC m=2 m=3 m=4 m=5 (c) GD HOFM HOFM-shared PN BLR 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 AUC m=2 m=3 m=4 m=5 (d) ML100K33
  • 34. Solver comparison • Coordinate descent • AdaGrad • L-BFGS AdaGrad and L-BFGS use the proposed DP algorithm to compute Am (p, x) 34
  • 35. 100 101 102 103 CPU time (seconds) 10-3 10-2 10-1 100 101 Objectivevalueminusbest CD AdaGrad L-BFGS (a) Convergence when m = 2 100 101 102 103 104 CPU time (seconds) 10-3 10-2 10-1 100 101 Objectivevalueminusbest CD AdaGrad L-BFGS (b) Convergence when m = 3 100 101 102 103 104 CPU time (seconds) 10-3 10-2 10-1 100 101 Objectivevalueminusbest CD AdaGrad L-BFGS (c) Convergence when m = 4 2 3 4 5 Degree m 0 50 100 150 200 250 300 350 Timetocompleteoneepoch(sec.) CD AdaGrad L-BFGS (d) Scalability w.r.t. m35 NIPS dataset