SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
Conditional Neural Processes (ICML2018)
Neural Processes (ICML2018Workshop)
Kazuki Fujikawa, DeNA
•
– DC DC8A 0 IF8A 1FD: GG GR 01S
• , .
• 8F 8 8FC AD 8C 2DG C 8I F G DE F 8 GDC 8 D 28 8A D 8J 8L DC
IFF8M 8C8 8C M 8C AD 2 N C A GA8
– 0 IF8A 1FD: GG GR01S
• , . DF G DE
• 8F 8 8FC AD DC8 8C : 8FN 8C 2DG C 8I 8 D DA8 8C AD 2 N C A
GA8 M
•
– 1P00 c rb a cz VYtyc
• 1 mkop
• 00 uT x k i vWahg nlo d
– w s ihe c
•
•
• 3
–
–
•
• 4
•
–
–
• : 715 - 2 0
– t R I Cn L i e LF o
v c Cg a o C
– hR - 2 0 5 - c
n w g a sI lrI
C l
Published as a conference paper at ICLR 2017
o ,- 2 - 5 1 , &
• l yʼ v
– y l m% g olgm T k f n h
k r
• W p sd W U Ur U Udh g aT r c
– 61: , 7 % :A DIE G LAKJLN .AK LNHAIN J :L I KJLN NEJI 6 NEJI G 1ECD R :L E
ANR , HEIE NL NEJI I % :A G -L D 7LAGEHEI LR P G NEJI 8AKJLN
• t ʼu t ʼu rwʼ h T ef v f nec
– A E 0 RII 0JJCGA KDJNJ G AGA G KAJKGA SCJLEGG S %
– k g hT y khef yU r
h g
• blyU k fm i r k h g
+ AI GG 627 % + 5AA 2-58 %
•
•
–
–
–
•
–
–
•
–
•
• )
( (
–
• 8
• ! = ($%, … , $() * !
• !
• + = (,%, … , ,() *(+)
• ,(-% *(,(-%, +)
• *(,(-%|+)
•
–
•
– !" #" , 9 $ %
» !" = %'
$(#")
» % : * % = + % ,, ./0
1 .
– 2 = (!3, … , !5) 9 6 D
» 2 = 6%
!3
⋮
!5
$0(#3) ⋯ $9(#3)
⋮ ⋱ ⋮
$0(#5) ⋯ $9(#5)
•
– )
• ! = ($%, … , $() : * !
– ! ( + 1 0 * ! 1
» * ! = ,(!|., /)
• E ! = 1E ! = .
• cov ! = E !56
= 17 ++6
16
= 89:
116
= 89:
;(<%);(<%) ⋯ ;(<%);(<()
⋮ ⋱ ⋮
;(<();(<%) ⋯ ;(<();(<()
=
@(<%, <%) ⋯ @(<%, <()
⋮ ⋱ ⋮
@(<(, <%) ⋯ @(<(, <()
= /
• :
–
• !
– "# 1 $ ε# &#
» "# = &# + ε#
• ) = ("+, … , ".) 0())
– 0()|!) 1
» 0 ) ! = 2()|!, 345
6) 3
» 0 ) = ∫ 0 ) ! 0 ! 8! = 2 ) 9, :;
:; =
< =+, =+ + 345
⋯ <(=+, =.)
⋮ ⋱ ⋮
<(=., =+) ⋯ < =., =. + 345
•
–
• !"#$ 1 : %(!"#$, ()
– %(!"#$, () % (
» % !"#$, ( = + ( ,, -.#/
) -.#/ =
0 1$, 1$ + 34/
⋯ 0(1$, 1"#$)
⋮ ⋱ ⋮
0(1", 1$) ⋯ 0 1", 1"#$ + 34/
=
-. 8
89 :
• %(!"#$|()
– 2
» % !"#$ ( =
<(=>?@,()
<(()
=
" !"#$, ( ,, -.#/
" ( ,, -.
=
|-.|
|-.?/|
exp −
$
E
(, !"#$
F
-.#/ (, !"#$ +
$
E
(F
-. (
–
» % !"#$ ( = + !"#$ 8F
-"
4$
(, : − 8F
-"
4$
8
•
–
•
• 3 3
–
• 1 3
–
•
– ! " #((! + ")'
)
–
•
–
• ] W
– 006 0 40 0 4241 ,2 4
• e D:! " !, $ N
–
» "(!, $) ++ o snpe ++
» ' [ [
• $, ' C D i r lK
–
• $, ' e a gL
–
(2)
y µX) ,
1
KX,X⇤
.
n matrix of
X⇤ and X.
X and X⇤,
valuated at
s implicitly
dels which
space, and
ese models
or support
of any con-
Indeed the
induced by
ernel func-
In this section we show how we can contruct kernels
which encapsulate the expressive power of deep archi-
tectures, and how to learn the properties of these ker-
nels as part of a scalable probabilistic GP framework.
Specifically, starting from a base kernel k(xi, xj|✓)
with hyperparameters ✓, we transform the inputs (pre-
dictors) x as
k(xi, xj|✓) ! k(g(xi, w), g(xj, w)|✓, w) , (5)
where g(x, w) is a non-linear mapping given by a deep
architecture, such as a deep convolutional network,
parametrized by weights w. The popular RBF kernel
(Eq. (3)) is a sensible choice of base kernel k(xi, xj|✓).
For added flexibility, we also propose to use spectral
mixture base kernels (Wilson and Adams, 2013):
kSM(x, x0
|✓) = (6)
QX
q=1
aq
|⌃q|
1
2
(2⇡)
D
2
exp
✓
1
2
||⌃
1
2
q (x x0
)||2
◆
coshx x0
, 2⇡µqi .
372
he marginal likeli-
des significant ad-
ied to the output
twork. Moreover,
ng non-parametric
very recent KISS-
, 2015) and exten-
iently representing
e deep kernels.
ses (GPs) and the
dictions and kernel
Williams (2006) for
.
(predictor) vectors
on D, which index
(x1), . . . , y(xn))>
.
ection of function
bution,
N (µ, KX,X ) , (1)
e matrix defined
nce kernel of the
2
encodes the inductive bias that function values closer
together in the input space are more correlated. The
complexity of the functions in the input space is deter-
mined by the interpretable length-scale hyperparam-
eter `. Shorter length-scales correspond to functions
which vary more rapidly with the inputs x.
The structure of our data is discovered through
learning interpretable kernel hyperparameters. The
marginal likelihood of the targets y, the probability
of the data conditioned only on kernel hyperparame-
ters , provides a principled probabilistic framework
for kernel learning:
log p(y| , X) / [y>
(K + 2
I) 1
y+log |K + 2
I|] ,
(4)
where we have used K as shorthand for KX,X given .
Note that the expression for the log marginal likeli-
hood in Eq. (4) pleasingly separates into automatically
calibrated model fit and complexity terms (Rasmussen
and Ghahramani, 2001). Kernel learning is performed
by optimizing Eq. (4) with respect to .
The computational bottleneck for inference is solving
the linear system (KX,X + 2
I) 1
y, and for kernel
learning is computing the log determinant log |KX,X +
2
I|. The standard approach is to compute the
Cholesky decomposition of the n ⇥ n matrix KX,X ,
Deep Kernel Learning
. .
h
(L)
1
h
(L)
C
(L)
h1(✓)
h1(✓)
ers
1 layer
y1
yP
Output layer
. .
...
......
...
...
ing: A Gaussian process with
onal inputs x through L para-
by a hidden layer with an in-
ns, with base kernel hyperpa-
an process with a deep kernel
ping with an infinite number
parametrized by = {w, ✓}.
jointly through the marginal
ocess.
For kernel learning, we use the chain rule to compute
derivatives of the log marginal likelihood with respect
to the deep kernel hyperparameters:
@L
@✓
=
@L
@K
@K
@✓
,
@L
@w
=
@L
@K
@K
@g(x, w)
@g(x, w)
@w
.
The implicit derivative of the log marginal likelihood
with respect to our n ⇥ n data covariance matrix K
is given by
@L
@K
=
1
2
(K 1
yy>
K 1
K 1
) , (7)
where we have absorbed the noise covariance 2
I into
our covariance matrix, and treat it as part of the base
kernel hyperparameters ✓.
@K
@✓ are the derivatives of
the deep kernel with respect to the base kernel hyper-
parameters (such as length-scale), conditioned on the
fixed transformation of the inputs g(x, w). Similarly,
• D c a
– +77 7 5: 7 C 9 5 ,5 85 0 7 7 77 1 4
• ,0G] wl nr puRu RL N tW I
75: 4 28::85 4
– ko
»
– ] nr
»
• ] nr s m k P
– i[e ko
»
– ] nr
»
»
denote its output. The ith component of the activations in the lth layer, post-nonlinearity and post-
affine transformation, are denoted xl
i and zl
i respectively. We will refer to these as the post- and
pre-activations. (We let x0
i ⌘ xi for the input, dropping the Arabic numeral superscript, and instead
use a Greek superscript x↵
to denote a particular input ↵). Weight and bias parameters for the lth
layer have components Wl
ij, bl
i, which are independent and randomly drawn, and we take them all to
have zero mean and variances 2
w/Nl and 2
b , respectively. GP(µ, K) denotes a Gaussian process
with mean and covariance functions µ(·), K(·, ·), respectively.
2.2 REVIEW OF GAUSSIAN PROCESSES AND SINGLE-LAYER NEURAL NETWORKS
We briefly review the correspondence between single-hidden layer neural networks and GPs (Neal
(1994a;b); Williams (1997)). The ith component of the network output, z1
i , is computed as,
z1
i (x) = b1
i +
N1X
j=1
W1
ijx1
j (x), x1
j (x) =
✓
b0
j +
dinX
k=1
W0
jkxk
◆
, (1)
where we have emphasized the dependence on input x. Because the weight and bias parameters are
taken to be i.i.d., the post-activations x1
j , x1
j0 are independent for j 6= j0
. Moreover, since z1
i (x) is
a sum of i.i.d terms, it follows from the Central Limit Theorem that in the limit of infinite width
N1 ! 1, z1
i (x) will be Gaussian distributed. Likewise, from the multidimensional Central Limit
Theorem, any finite collection of {z1
i (x↵=1
), ..., z1
i (x↵=k
)} will have a joint multivariate Gaussian
distribution, which is exactly the definition of a Gaussian process. Therefore we conclude that
z1
i ⇠ GP(µ1
, K1
), a GP with mean µ1
and covariance K1
, which are themselves independent of i.
Because the parameters have zero mean, we have that µ1
(x) = E
⇥
z1
i (x)
⇤
= 0 and,
K1
(x, x0
) ⌘ E
⇥
z1
i (x)z1
i (x0
)
⇤
= 2
b + 2
w E
⇥
x1
i (x)x1
i (x0
)
⇤
⌘ 2
b + 2
wC(x, x0
), (2)
where we have introduced C(x, x0
) as in Neal (1994a); it is obtained by integrating against the
distribution of W0
, b0
. Note that, as any two z1
i , z1
j for i 6= j are joint Gaussian and have zero
covariance, they are guaranteed to be independent despite utilizing the same features produced by
the hidden layer.
3
We briefly review the correspondence between single-hidden layer neural networks and GPs (Neal
(1994a;b); Williams (1997)). The ith component of the network output, z1
i , is computed as,
z1
i (x) = b1
i +
N1X
j=1
W1
ijx1
j (x), x1
j (x) =
✓
b0
j +
dinX
k=1
W0
jkxk
◆
, (1)
where we have emphasized the dependence on input x. Because the weight and bias parameters are
taken to be i.i.d., the post-activations x1
j , x1
j0 are independent for j 6= j0
. Moreover, since z1
i (x) is
a sum of i.i.d terms, it follows from the Central Limit Theorem that in the limit of infinite width
N1 ! 1, z1
i (x) will be Gaussian distributed. Likewise, from the multidimensional Central Limit
Theorem, any finite collection of {z1
i (x↵=1
), ..., z1
i (x↵=k
)} will have a joint multivariate Gaussian
distribution, which is exactly the definition of a Gaussian process. Therefore we conclude that
z1
i ⇠ GP(µ1
, K1
), a GP with mean µ1
and covariance K1
, which are themselves independent of i.
Because the parameters have zero mean, we have that µ1
(x) = E
⇥
z1
i (x)
⇤
= 0 and,
K1
(x, x0
) ⌘ E
⇥
z1
i (x)z1
i (x0
)
⇤
= 2
b + 2
w E
⇥
x1
i (x)x1
i (x0
)
⇤
⌘ 2
b + 2
wC(x, x0
), (2)
where we have introduced C(x, x0
) as in Neal (1994a); it is obtained by integrating against the
distribution of W0
, b0
. Note that, as any two z1
i , z1
j for i 6= j are joint Gaussian and have zero
covariance, they are guaranteed to be independent despite utilizing the same features produced by
the hidden layer.
3
2.3 GAUSSIAN PROCESSES AND DEEP NEURAL NETWORKS
The arguments of the previous section can be extended to deeper layers by induction. We proceed
by taking the hidden layer widths to be infinite in succession (N1 ! 1, N2 ! 1, etc.) as we
continue with the induction, to guarantee that the input to the layer under consideration is already
governed by a GP. In Appendix C we provide an alternative derivation, in terms of marginalization
over intermediate layers, which does not depend on the order of limits, in the case of a Gaussian
prior on the weights. A concurrent work (Anonymous, 2018) further derives the convergence rate
towards a GP if all layers are taken to infinite width simultaneously.
Suppose that zl 1
j is a GP, identical and independent for every j (and hence xl
j(x) are independent
and identically distributed). After l 1 steps, the network computes
zl
i(x) = bl
i +
NlX
j=1
Wl
ijxl
j(x), xl
j(x) = (zl 1
j (x)). (3)
As before, zl
i(x) is a sum of i.i.d. random terms so that, as Nl ! 1, any finite collection
{z1
i (x↵=1
), ..., z1
i (x↵=k
)} will have joint multivariate Gaussian distribution and zl
i ⇠ GP(0, Kl
).
The covariance is
Kl
(x, x0
) ⌘ E
⇥
zl
i(x)zl
i(x0
)
⇤
= 2
b + 2
w Ezl 1
i ⇠GP(0,Kl 1)
⇥
(zl 1
i (x)) (zl 1
i (x0
))
⇤
. (4)
By induction, the expectation in Equation (4) is over the GP governing zl 1
i , but this is equivalent
to integrating against the joint distribution of only zl 1
i (x) and zl 1
i (x0
). The latter is described by
a zero mean, two-dimensional Gaussian whose covariance matrix has distinct entries Kl 1
(x, x0
),
continue with the induction, to guarantee that the input to the layer under consideration is already
governed by a GP. In Appendix C we provide an alternative derivation, in terms of marginalization
over intermediate layers, which does not depend on the order of limits, in the case of a Gaussian
prior on the weights. A concurrent work (Anonymous, 2018) further derives the convergence rate
towards a GP if all layers are taken to infinite width simultaneously.
Suppose that zl 1
j is a GP, identical and independent for every j (and hence xl
j(x) are independent
and identically distributed). After l 1 steps, the network computes
zl
i(x) = bl
i +
NlX
j=1
Wl
ijxl
j(x), xl
j(x) = (zl 1
j (x)). (3)
As before, zl
i(x) is a sum of i.i.d. random terms so that, as Nl ! 1, any finite collection
{z1
i (x↵=1
), ..., z1
i (x↵=k
)} will have joint multivariate Gaussian distribution and zl
i ⇠ GP(0, Kl
).
The covariance is
Kl
(x, x0
) ⌘ E
⇥
zl
i(x)zl
i(x0
)
⇤
= 2
b + 2
w Ezl 1
i ⇠GP(0,Kl 1)
⇥
(zl 1
i (x)) (zl 1
i (x0
))
⇤
. (4)
By induction, the expectation in Equation (4) is over the GP governing zl 1
i , but this is equivalent
to integrating against the joint distribution of only zl 1
i (x) and zl 1
i (x0
). The latter is described by
a zero mean, two-dimensional Gaussian whose covariance matrix has distinct entries Kl 1
(x, x0
),
Kl 1
(x, x), and Kl 1
(x0
, x0
). As such, these are the only three quantities that appear in the result.
We introduce the shorthand
Kl
(x, x0
) = 2
b + 2
w F
⇣
Kl 1
(x, x0
), Kl 1
(x, x), Kl 1
(x0
, x0
)
⌘
(5)
to emphasize the recursive relationship between Kl
and Kl 1
via a deterministic function F whose
form depends only on the nonlinearity . This gives an iterative series of computations which can
2.3 GAUSSIAN PROCESSES AND DEEP NEURAL NETWORKS
The arguments of the previous section can be extended to deeper layers by induction. We proceed
by taking the hidden layer widths to be infinite in succession (N1 ! 1, N2 ! 1, etc.) as we
continue with the induction, to guarantee that the input to the layer under consideration is already
governed by a GP. In Appendix C we provide an alternative derivation, in terms of marginalization
over intermediate layers, which does not depend on the order of limits, in the case of a Gaussian
prior on the weights. A concurrent work (Anonymous, 2018) further derives the convergence rate
towards a GP if all layers are taken to infinite width simultaneously.
Suppose that zl 1
j is a GP, identical and independent for every j (and hence xl
j(x) are independent
and identically distributed). After l 1 steps, the network computes
zl
i(x) = bl
i +
NlX
j=1
Wl
ijxl
j(x), xl
j(x) = (zl 1
j (x)). (3)
As before, zl
i(x) is a sum of i.i.d. random terms so that, as Nl ! 1, any finite collection
{z1
i (x↵=1
), ..., z1
i (x↵=k
)} will have joint multivariate Gaussian distribution and zl
i ⇠ GP(0, Kl
).
The covariance is
Kl
(x, x0
) ⌘ E
⇥
zl
i(x)zl
i(x0
)
⇤
= 2
b + 2
w Ezl 1
i ⇠GP(0,Kl 1)
⇥
(zl 1
i (x)) (zl 1
i (x0
))
⇤
. (4)
By induction, the expectation in Equation (4) is over the GP governing zl 1
i , but this is equivalent
l 1 l 1 0
• 76:
Learning to Learn
134
Losses
+ , 2 0 1
•
7 : + 12 ,0
Losses
Learning to Learn
Model Based
● Santoro et al. ’16
● Duan et al. ’17
● Wang et al. ‘17
● Munkhdalai & Yu ‘17
● Mishra et al. ‘17
Optimization Based
● Schmidhuber ’87, ’92
● Bengio et al. ’90, ‘92
● Hochreiter et al. ’01
● Li & Malik ‘16
● Andrychowicz et al. ’16
● Ravi & Larochelle ‘17
● Finn et al ‘17
Metric Based
● Koch ’15
● Vinyals et al. ‘16
● Snell et al. ‘17
● Shyam et al. ‘17
Trends: Learning to Learn / Meta Learning 135
• m : P
– I I C L + C I 0 0 I A +00
7 I -, 8
• I I U ih kN !P ol gnk s
• t "P Ts u Mnk wmnp y g ]Td
ol Tr ecR e
• nk 0 Pg M ol UkN Sd g wmnp y e
• y u ]eaV [ emnp U bcRM0 I 2 I L +
2 + g
)
Oriol Vinyals, NIPS 17
Model Based Meta Learning
6 L 0-1 ( I LC C
• hcr I 6 : 2 6
– 2 9: 6 7 , 6 9 62 : 0 : 2 - 1
• 6 I N ca d c ! "# Sg e p "#(%&, !)
• u l I% )# Sg e p )# (%, !)
• )# %, ! 6 "# %&, ! o wV ia o m
• f bS 6 nSs iaV u l o
MkNO P] [L )#, "# r
• y Ms M g e V ] [L )#, "# r
t : 2 - 6 2 62 : :
Oriol Vinyals, NIPS 17
Matching Networks, Vinyals et al, NIPS 2016
Metric Based Meta Learning
• kdz I0 C C 7
– 7 ( C 7C -7 A A + C ( C C 77 7CD A ( -
1+ ,) - 2
• g d z t z L Fimk d z
• , 7 7C A7 CA lgnO db z pwN]F imk d
r [ db z s M
• a ce lgn o Ffb ce O L lgn
imk d z
Model-Agnostic Meta-Learning for Fast Adaptation o
Algorithm 1 Model-Agnostic Meta-Learning
Require: p(T ): distribution over tasks
Require: ↵, : step size hyperparameters
1: randomly initialize ✓
2: while not done do
3: Sample batch of tasks Ti ⇠ p(T )
4: for all Ti do
5: Evaluate r✓LTi
(f✓) with respect to K examples
6: Compute adapted parameters with gradient de-
scent: ✓0
i = ✓ ↵r✓LTi
(f✓)
7: end for
8: Update ✓ ✓ r✓
P
Ti⇠p(T ) LTi (f✓0
i
)
9: end while
make no assumption on the form of the model, other than
products, which
braries such as
experiments, w
this backward p
which we discu
3. Species of
In this section,
meta-learning a
forcement learn
function and in
sented to the m
nism can be app
3.1. Supervised
meta-learning
learning/adaptation
✓
rL1
rL2
rL3
✓⇤
1 ✓⇤
2
✓⇤
3
Figure 1. Diagram of our model-agnostic meta-learning algo-
rithm (MAML), which optimizes for a representation ✓ that can
quickly adapt to new tasks.
Oriol Vinyals, NIPS 17
Summing Up
Model Based
Metric Based
Optimization Based
Oriol Vinyals, NIPS 17
Examples of Optimization Based Meta Learning
Finn et al, 17
Ravi et al, 17
•
•
•
–
–
• 0282 1 + 1 1 , 1
–
• +Gr L
– G
» t ! G d a C I
c G s P
– G [N
» d "# a C $# Gt L]d
• Gr L
– io d
» "# $# G ] lG n ]
eL
– G a M
» % u & u a C !(% + &)
Conditional Neural Processes
Marta Garnelo 1
Dan Rosenbaum 1
Chris J. Maddison 1
Tiago Ramalho 1
David Saxton 1
Murray Shanahan 1 2
Yee Whye Teh 1
Danilo J. Rezende 1
S. M. Ali Eslami 1
Abstract
Deep neural networks excel at function approxi-
mation, yet they are typically trained from scratch
for each new function. On the other hand,
Bayesian methods, such as Gaussian Processes
(GPs), exploit prior knowledge to quickly infer
the shape of a new function at test time. Yet GPs
are computationally expensive, and it can be hard
to design appropriate priors. In this paper we
propose a family of neural models, Conditional
Neural Processes (CNPs), that combine the bene-
fits of both. CNPs are inspired by the flexibility
of stochastic processes such as GPs, but are struc-
tured as neural networks and trained via gradient
descent. CNPs make accurate predictions after
observing only a handful of training data points,
yet scale to complex functions and large datasets.
We demonstrate the performance and versatility
of the approach on a range of canonical machine
learning tasks, including regression, classification
and image completion.
1. Introduction
Deep neural networks have enjoyed remarkable success
in recent years, but they require large datasets for effec-
tive training (Lake et al., 2017; Garnelo et al., 2016). One
bSupervisedLearning
PredictTrain
…x3x1 x5 x6x4
y5 y6y4
y3y1 g gg
x2
y2
g
…
Observations Targets
…x6
y6
f
x5
y5
f
x4
y4
f
x3
y3
f
x2
y2
f
x1
y1
f
aData
…
PredictObserve Aggregate
r3r2r1
…x3x2x1 x5 x6x4
ahhh y5 y6y4
ry3y2y1 g gg
cOurModel
…
Figure 1. Conditional Neural Process. a) Data description
b) Training regime of conventional supervised deep learning mod-
els c) Our model.
of these can be framed as function approximation given a
v:1807.01613v1[cs.LG]4Jul2018
• 83 81 :1 : 2 1:8 ,+ 0
– [a
• 82 3 : ℎ
– v " = {(&', )')}',-
./0
l 1'Ar ]d el P
– ,+ I
• 1 : 1 : 2
– v Ln Cus P 1' l 1A GL L ]d M N L el P
– :1 I
• 3 2 3 : 3
– o it 4 = {&'}',.
.5650
l o g el P
» g L 7, 89
» g c dg L :;
Abstract
Deep neural networks excel at function approxi-
mation, yet they are typically trained from scratch
for each new function. On the other hand,
Bayesian methods, such as Gaussian Processes
(GPs), exploit prior knowledge to quickly infer
the shape of a new function at test time. Yet GPs
are computationally expensive, and it can be hard
to design appropriate priors. In this paper we
propose a family of neural models, Conditional
Neural Processes (CNPs), that combine the bene-
fits of both. CNPs are inspired by the flexibility
of stochastic processes such as GPs, but are struc-
tured as neural networks and trained via gradient
descent. CNPs make accurate predictions after
observing only a handful of training data points,
yet scale to complex functions and large datasets.
We demonstrate the performance and versatility
of the approach on a range of canonical machine
learning tasks, including regression, classification
and image completion.
1. Introduction
Deep neural networks have enjoyed remarkable success
in recent years, but they require large datasets for effec-
bSupervisedLearning
PredictTrain
…x3x1 x5 x6x4
y5 y6y4
y3y1 g gg
x2
y2
g
…
Observations Targets
…x6
y6
f
x5
y5
f
x4
y4
f
x3
y3
f
x2
y2
f
x1
y1
f
aData
…
PredictObserve Aggregate
r3r2r1
…x3x2x1 x5 x6x4
ahhh y5 y6y4
ry3y2y1 g gg
cOurModel
…
Figure 1. Conditional Neural Process. a) Data description
b) Training regime of conventional supervised deep learning mod-
els c) Our model.
:1807.01613v1[cs.LG]4Jul2018
Conditional Neural Processes
spect to some prior process. CNPs parametrize distributions
over f(T) given a distributed representation of O of fixed
dimensionality. By doing so we give up the mathematical
guarantees associated with stochastic processes, trading this
off for functional flexibility and scalability.
Specifically, given a set of observations O, a CNP is a con-
ditional stochastic process Q✓ that defines distributions over
f(x) for inputs x 2 T. ✓ is the real vector of all parame-
ters defining Q. Inheriting from the properties of stochastic
processes, we assume that Q✓ is invariant to permutations
of O and T. If O0
, T0
are permutations of O and T, re-
spectively, then Q✓(f(T) | O, T) = Q✓(f(T0
) | O, T0
) =
Q✓(f(T) | O0
, T). In this work, we generally enforce per-
mutation invariance with respect to T by assuming a fac-
tored structure. Specifically, we consider Q✓s that factor
Q✓(f(T) | O, T) =
Q
x2T Q✓(f(x) | O, x). In the absence
of assumptions on output space Y , this is the easiest way to
ensure a valid stochastic process. Still, this framework can
be extended to non-factored distributions, we consider such
a model in the experimental section.
The defining characteristic of a CNP is that it conditions on
O via an embedding of fixed dimensionality. In more detail,
we use the following architecture,
ri = h✓(xi, yi) 8(xi, yi) 2 O (1)
r = r1 r2 . . . rn 1 rn (2)
i = g✓(xi, r) 8(xi) 2 T (3)
where h✓ : X ⇥Y ! Rd
and g✓ : X ⇥Rd
! Re
are neural
ON = {(xi, yi)}N
i
minimize the nega
L(✓) = Ef⇠P
h
Thus, the targets i
and unobserved v
estimates of the gr
This approach shi
edge from an ana
the advantage of
specify an analyti
intended to summa
emphasize that the
conditionals for all
does not guarantee
1. A CNP is a
trained to mo
of functions f
2. A CNP is per
3. A CNP is sca
ity of O(n +
observations.
Within this specifi
aspects that can b
The exact impleme
• 24:4 0 , 80 8 1 08 +
–
• G C s o Nn CM
–
• idGI tN L C
– N ] G s ] ]P {(#$, &$)}$)*
+,-
– t n , sN ./ = {(#$, &$)}$)*
/
– ./ {#$}$)*
+,-
Nl {&$}$)*
+,-
r a Nu CM
– o 1 N ] ]P ] [ c Nu N eCM
Conditional Neural Processes
metrize distributions
tation of O of fixed
p the mathematical
ocesses, trading this
lity.
s O, a CNP is a con-
es distributions over
ector of all parame-
perties of stochastic
ant to permutations
ns of O and T, re-
✓(f(T0
) | O, T0
) =
nerally enforce per-
by assuming a fac-
der Q✓s that factor
O, x). In the absence
is the easiest way to
this framework can
ns, we consider such
that it conditions on
ON = {(xi, yi)}N
i=0 ⇢ O, the first N elements of O. We
minimize the negative conditional log probability
L(✓) = Ef⇠P
h
EN
h
log Q✓({yi}n 1
i=0 |ON , {xi}n 1
i=0 )
ii
(4)
Thus, the targets it scores Q✓ on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling f and N.
This approach shifts the burden of imposing prior knowl-
edge from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the Q✓ are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that. In summary,
1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions f ⇠ P.
• C 08D 1 8 8 2, 8 /. 3
– +
• lo
– ifk h s d N n ]
» h s Iu w
» 5 u w bM d G n ] a
• x
– 8 8 ("#, %#) d C G /.1 ]g r
– :: 8: C '# d b
– 8 8 ' [ "( d C G /.1 ifk I d
•
– , D C DC teu u w c ,1 [
] L
– 01 ,1 P G 01 8 8 8
C : 8 8 ]G N P d Nb
– b[x G b
3.2. Meta-Learning
Deep learning models are generally more scalable and are
very successful at learning features and prior knowledge
from the data directly. However they tend to be less flex-
ible with regards to input size and order. Additionally, in
general they only approximate one function as opposed to
distributions over functions. Meta-learning approaches ad-
dress the latter and share our core motivations. Recently
meta-learning has been applied to a wide range of tasks
like RL (Wang et al., 2016; Finn et al., 2017) or program
induction (Devlin et al., 2017).
Often meta-learning algorithms are implemented as deep
generative models that learn to do few-shot estimations
of the underlying density of the data. Generative Query
Networks (GQN), for example, predict new viewpoints in
3D scenes given some context observations using a similar
training regime to NPs (Eslami et al., 2018). As such, NPs
can be seen as a generalisation of GQN to few-shot predic-
tion tasks beyond scene understanding, such as regression
and classification. Another way of carrying out few-shot
density estimation is by updating existing models like Pix-
elCNN (van den Oord et al., 2016) and augmenting them
with attention mechanisms (Reed et al., 2017) or including
a memory unit in a VAE model (Bornschein et al., 2017).
Another successful latent variable approach is to explicitly
condition on some context during inference (J. Rezende
et al., 2016). Given the generative nature of these models
they are usually applied to image generation tasks, but mod-
els that include a conditioning class-variable can be used for
classification as well.
Classification itself is another common task in meta-
learning. Few-shot classification algorithms usually rely
on some distance metric in feature space to compare target
images to the observations provided (Koch et al., 2015),
(Santoro et al., 2016). Matching networks (Vinyals et al.,
2016; Bartunov & Vetrov, 2016) are closely related to CNPs.
In their case features of samples are compared with target
features using an attention kernel. At a higher level one can
interpret this model as a CNP where the aggregator is just
the concatenation over all input samples and the decoder g
contains an explicitly defined distance kernel. In this sense
matching networks are closer to GPs than to CNPs, since
they require the specification of a distance kernel that CNPs
learn from the data instead. In addition, as MNs carry out all-
to-all comparisons they scale with O(n ⇥ m), although they
other generative models the neural statistician learns to esti-
mate the density of the observed data but does not allow for
targeted sampling at what we have been referring to as input
positions xi. Instead, one can only generate i.i.d. samples
from the estimated density. Finally, the latent variant of
CNP can also be seen as an approximated amortized version
of Bayesian DL (Gal & Ghahramani, 2016; Blundell et al.,
2015; Louizos et al., 2017; Louizos & Welling, 2017)
4. Experimental Results
Figure 2. 1-D Regression. Regression results on a 1-D curve
(black line) using 5 (left column) and 50 (right column) context
points (black dots). The first two rows show the predicted mean
and variance for the regression of a single underlying kernel for
GPs (red) and CNPs (blue). The bottom row shows the predictions
of CNPs for a curve with switching kernel parameters.
• 8 68 8 2 8 + -,
– - +01G
• ln
– - +01
» IT og c N a
• s u
– ] [0, 1]&
i u T ]
– S] a ]
[0, 1]C t [0, 1]'
G
– S] P
•
– Nb T
] C ] 1 : L
– cdhoer T D [ c
aM C [ ,
aD
Conditional Neural Processes
Figure 3. Pixel-wise image regression on MNIST. Left: Two examples of image regression with v
provide the model with 1, 40, 200 and 728 context points (top row) and query the entire image. T
variance (bottom row) at each pixel position is shown for each of the context images. Right: mode
observations that are either chosen at random (blue) or by selecting the pixel with the highest varia
4.1. Function Regression
As a first experiment we test CNP on the classical 1D re-
gression task that is used as a common baseline for GPs.
We generate two different datasets that consist of functions
generated from a GP with an exponential kernel. In the first
dataset we use a kernel with fixed parameters, and in the
second dataset the function switches at some random point
on the real line between two functions each sampled with
different kernel parameters.
At every training step we sample a curve from the GP, select
a subset of n points (xi, yi) as observations, and a subset of
points (xt, yt) as target points. Using the model described
that increases in accuracy
increases.
Furthermore the model ach
on the switching kernel t
is not trivial for GPs whe
change the dataset used fo
4.2. Image Completion
We consider image comp
functions in either f : [0,
or f : [0, 1]2
! [0, 1]3
fo
2D pixel coordinates nor
Conditional Neural Processes
Figure 3. Pixel-wise image regression on MNIST. Left: Two examples of image regression with varying numbers of observations. We
provide the model with 1, 40, 200 and 728 context points (top row) and query the entire image. The resulting mean (middle row) and
variance (bottom row) at each pixel position is shown for each of the context images. Right: model accuracy with increasing number of
observations that are either chosen at random (blue) or by selecting the pixel with the highest variance (red).
4.1. Function Regression
As a first experiment we test CNP on the classical 1D re-
that increases in accuracy as the number of context points
increases.
• C C 8 0 G 8 2 C: EE E ,8 C - / 7
– L
• r t
–
» O[n ul P k i
•
– /0- b
•
– a NaI b a MI , CG G a
M]MST R[
– / +k a
» 8 CA C 0 b k s P o mL
• M I 02 c 00 ,2 he ]Mi
• M I 00 ,2 b
» 1 C 0 b k b P a L
• ,2 00 c x a ] RS i[dI
b ]e/ +c aSM
• 02 c b ]Mi L
Conditional Neural Processes
4.2.1. MNIST
We first test CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in Figure 3a the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point
(e.g. a black pixel on the edge) the model’s prediction corre-
sponds to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowl-
edge. It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution. We consider this model
later in section 4.2.3.
An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of Figure 3a, as we add more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
Figure 4. Pixel-wise image completion on CelebA. Two exam-
ples of CelebA image regression with varying numbers of obser-
vations. We provide the model with 1, 10, 100 and 1000 context
points (top row) and query the entire image. The resulting mean
(middle row) and variance (bottom row) at each pixel position is
shown for each of the context images.
An important aspect of CNPs demonstrated in Figure 5, is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
Figure 5. Flexible image completion. In contrast to standard con-
ditional models, CNPs can be directly conditioned on observed
pixels in arbitrary patterns, even ones which were never seen in
the training set. Similarly, the model can predict values for pixel
coordinates that were never included in the training set, like sub-
pixel values in different resolutions. The dotted white lines were
added for clarity after generation.
and prediction task, and has the capacity to extract domain
knowledge from a training set.
We compare CNPs quantitatively to two related models:
kNNs and GPs. As shown in Table 4.2.3 CNPs outperform
the latter when number of context points is small (empiri-
cally when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the ta-
ble we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.
4.2.3. LATENT VARIABLE MODEL
The main model we use throughout this paper is a factored
model that predicts the mean and variance of the target
outputs. Although we have shown that the mean is by itself
a useful prediction, and that the variance is a good way to
capture the uncertainty, this factored model prevents us from
Figure 6. Image completion with a latent variable model. The
latent variables capture the global uncertainty, allowing the sam-
pling of different coherent images which conform to the obser-
vations. As the number of observations increases, uncertainty is
reduced and the samples converge to a single estimate.
servations and targets. This forms a multivariate Gaussian
which can be used to coherently draw samples. In order to
maintain this property in a trained model, one approach is to
train the model to predict a GP kernel (Wilson et al., 2016).
However the difficulty is the need to back-propagate through
the sampling which involves a large matrix inversion (or
some approximation of it).
Random Context Ordered Context
# 10 100 1000 10 100 1000
kNN 0.215 0.052 0.007 0.370 0.273 0.007
GP 0.247 0.137 0.001 0.257 0.220 0.002
CNP 0.039 0.016 0.009 0.057 0.047 0.021
Table 1. Pixel-wise mean squared error for all of the pixels in the
image completion task on the CelebA data set with increasing
number of context points (10, 100, 1000). The context points are
chosen either at random or ordered from the top-left corner to the
bottom-right. With fewer context points CNPs outperform kNNs
and GPs. In addition CNPs perform well regardless of the order of
the context points, whereas GPs and kNNs perform worse when
the context is ordered.
In contrast, the approach we use is to simply add latent
variables z to our decoder g, allowing our model to capture
• : 8 8 :2 . 2 0 2 : + -, 1
– C
•
– I P
»
• Ici c N [or L I
• g d ] C Ai
I G
• a c x Iu Aci c
I M
»
• g d N [ I
• a c x [t L I IAil eb
s gnC 8 .. I N
Conditional
Figure 5. Flexible image completion. In contrast to standard con
ditional models, CNPs can be directly conditioned on observed
pixels in arbitrary patterns, even ones which were never seen in
the training set. Similarly, the model can predict values for pixel
coordinates that were never included in the training set, like sub
pixel values in different resolutions. The dotted white lines were
added for clarity after generation.
and prediction task, and has the capacity to extract domain
knowledge from a training set.
Conditional Neural Processes
Figure 5. Flexible image completion. In contrast to standard con-
ditional models, CNPs can be directly conditioned on observed
pixels in arbitrary patterns, even ones which were never seen in
the training set. Similarly, the model can predict values for pixel
coordinates that were never included in the training set, like sub-
pixel values in different resolutions. The dotted white lines were
added for clarity after generation.
and prediction task, and has the capacity to extract domain
knowledge from a training set.
We compare CNPs quantitatively to two related models:
kNNs and GPs. As shown in Table 4.2.3 CNPs outperform
the latter when number of context points is small (empiri-
cally when half of the image or less is provided as context).
Figure 6. Image completi
latent variables capture th
pling of different coheren
vations. As the number o
reduced and the samples c
servations and targets.
which can be used to co
maintain this property in
train the model to predi
However the difficulty is
the sampling which inv
some approximation of
Random Co
# 10 100
• , AIA :C 2 G:C G 8-:G C ,10 9
– y isku 3 A C IR
• oPmln
– 3 A C I O av R
» 2 L:M / I ,C: A A :IA
2isku h/ a oPmh ] fR
• rot
– G ,22
– : G :I G isk :K G: S :I
•
– 1+22 86: I G ,10 9 derot V w
– 1:I A 2 IL G 8 A M:C 2 6 9 a g] [
– IA a 1+22 1:I A 2 IL G !(#$) f
a cN,2 !(# + $) k Ps t
Conditional Neural Processes
a conditional Gaussian prior p(z|O) that is conditioned on
the observations, and a Gaussian posterior p(z|O, T) that is
also conditioned on the target points.
We apply this model to MNIST and CelebA (Figure 6).
We use the same models as before, but we concatenate the
representation r to a vector of latent variables z of size 64
(for CelebA we use bigger models where the sizes of r and
z are 1024 and 128 respectively). For both the prior and
posterior models, we use three layered MLPs and average
their outputs. We emphasize that the difference between the
prior and posterior is that the prior only sees the observed
pixels, while the posterior sees both the observed and the
target pixels. When sampling from this model with a small
number of observed pixels, we get coherent samples and
we see that the variability of the datasets is captured. As
the model is conditioned on more and more observations,
the variability of the samples drops and they eventually
converge to a single possibility.
4.3. Classification
Finally, we apply the model to one-shot classification using
the Omniglot dataset (Lake et al., 2015) (see Figure 7 for an
overview of the task). This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. As in (Vinyals
et al., 2016) we use 1,200 randomly selected classes as
our training set and the remainder as our testing data set.
In addition we augment the dataset following the protocol
described in (Santoro et al., 2016). This includes cropping
the image from 32 ⇥ 32 to 28 ⇥ 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. We generate
PredictObserve Aggregate
r5r4r3
ahhh
r
Class
E
Class
D
Class
C g gg
r2r1
hh
Class
B
Class
A
A
B
C
E
D
0 1 0 1 0 1
Figure 7. One-shot Omniglot classification. At test time the
model is presented with a labelled example for each class, and
outputs the classification probabilities of a new unlabelled example.
The model uncertainty grows when the new example comes from
an un-observed class.
5-way Acc 20-way Acc Runtime
1-shot 5-shot 1-shot 5-shot
MANN 82.8% 94.9% - - O(nm)
MN 98.1% 98.9% 93.8% 98.5% O(nm)
CNP 95.3% 98.5% 89.9% 96.8% O(n + m)
Table 2. Classification results on Omniglot. Results on the same
task for MANN (Santoro et al., 2016), and matching networks
(MN) (Vinyals et al., 2016) and CNP.
using a significantly simpler architecture (three convolu-
tional layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm).
Conditional Neural Processes
a conditional Gaussian prior p(z|O) that is conditioned on
the observations, and a Gaussian posterior p(z|O, T) that is
also conditioned on the target points.
We apply this model to MNIST and CelebA (Figure 6).
We use the same models as before, but we concatenate the
representation r to a vector of latent variables z of size 64
(for CelebA we use bigger models where the sizes of r and
z are 1024 and 128 respectively). For both the prior and
posterior models, we use three layered MLPs and average
their outputs. We emphasize that the difference between the
prior and posterior is that the prior only sees the observed
pixels, while the posterior sees both the observed and the
target pixels. When sampling from this model with a small
number of observed pixels, we get coherent samples and
we see that the variability of the datasets is captured. As
the model is conditioned on more and more observations,
the variability of the samples drops and they eventually
converge to a single possibility.
4.3. Classification
Finally, we apply the model to one-shot classification using
the Omniglot dataset (Lake et al., 2015) (see Figure 7 for an
overview of the task). This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. As in (Vinyals
et al., 2016) we use 1,200 randomly selected classes as
our training set and the remainder as our testing data set.
In addition we augment the dataset following the protocol
described in (Santoro et al., 2016). This includes cropping
the image from 32 ⇥ 32 to 28 ⇥ 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. We generate
PredictObserve Aggregate
r5r4r3
ahhh
r
Class
E
Class
D
Class
C g gg
r2r1
hh
Class
B
Class
A
A
B
C
E
D
0 1 0 1 0 1
Figure 7. One-shot Omniglot classification. At test time the
model is presented with a labelled example for each class, and
outputs the classification probabilities of a new unlabelled example.
The model uncertainty grows when the new example comes from
an un-observed class.
5-way Acc 20-way Acc Runtime
1-shot 5-shot 1-shot 5-shot
MANN 82.8% 94.9% - - O(nm)
MN 98.1% 98.9% 93.8% 98.5% O(nm)
CNP 95.3% 98.5% 89.9% 96.8% O(n + m)
Table 2. Classification results on Omniglot. Results on the same
task for MANN (Santoro et al., 2016), and matching networks
(MN) (Vinyals et al., 2016) and CNP.
using a significantly simpler architecture (three convolu-
tional layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm).
• 2 . 0 2 : ,+ 8 1
– t
• . n IL b
– epILhl G i l kIM c ac N P [ L C
l e u xL P
• 32 2 : 2 23 h
– 32 2 : 2 23 32 Ps ou i s ouG rv
Neural Processes
xC yT
C T
z
xTyC
(a) Graphical model
yC
xC h
rC ra z yT
xThrT
g
a
r
C T
Generation Inference
(b) Computational diagram
Figure 1. Neural process model. (a) Graphical model of a neural process. x and y correspond to the data where y = f(x). C and T
are the number of context points and target points respectively and z is the global latent variable. A grey background indicates that the
variable is observed. (b) Diagram of our neural process implementation. Variables in circles correspond to the variables of the graphical
ditional stochastic process Q✓ that defines distributions over
f(x) for inputs x 2 T. ✓ is the real vector of all parame-
ters defining Q. Inheriting from the properties of stochastic
processes, we assume that Q✓ is invariant to permutations
of O and T. If O0
, T0
are permutations of O and T, re-
spectively, then Q✓(f(T) | O, T) = Q✓(f(T0
) | O, T0
) =
Q✓(f(T) | O0
, T). In this work, we generally enforce per-
mutation invariance with respect to T by assuming a fac-
tored structure. Specifically, we consider Q✓s that factor
Q✓(f(T) | O, T) =
Q
x2T Q✓(f(x) | O, x). In the absence
of assumptions on output space Y , this is the easiest way to
ensure a valid stochastic process. Still, this framework can
be extended to non-factored distributions, we consider such
a model in the experimental section.
The defining characteristic of a CNP is that it conditions on
O via an embedding of fixed dimensionality. In more detail,
we use the following architecture,
ri = h✓(xi, yi) 8(xi, yi) 2 O (1)
r = r1 r2 . . . rn 1 rn (2)
i = g✓(xi, r) 8(xi) 2 T (3)
where h✓ : X ⇥Y ! Rd
and g✓ : X ⇥Rd
! Re
are neural
networks, is a commutative operation that takes elements
in Rd
and maps them into a single element of Rd
, and i are
parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend-
ing on the task the model learns to parametrize a different
output distribution. This architecture ensures permutation
invariance and O(n + m) scaling for conditional prediction.
We note that, since r1 . . . rn can be computed in O(1)
from r1 . . . rn 1, this architecture supports streaming
observations with minimal overhead.
For regression tasks we use i to parametrize the mean and
variance i = (µi, 2
i ) of a Gaussian distribution N(µi, 2
i )
for every xi 2 T. For classification tasks i parametrizes
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling f and N.
This approach shifts the burden of imposing prior knowl-
edge from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the Q✓ are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that. In summary,
1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions f ⇠ P.
2. A CNP is permutation invariant in O and T.
3. A CNP is scalable, achieving a running time complex-
ity of O(n + m) for making m predictions with n
observations.
Within this specification of the model there are still some
aspects that can be modified to suit specific requirements.
The exact implementation of h, for example, can be adapted
to the data type. For low dimensional data the encoder
can be implemented as an MLP, whereas for inputs with
larger dimensions and spatial correlations it can also include
convolutions. Finally, in the setup described the model is not
able to produce any coherent samples, as it learns to model
only a factored prediction of the mean and the variances,
disregarding the covariance between target points. This
is a result of this particular implementation of the model.
One way we can obtain coherent samples is by introducing
a latent variable that we can sample from. We carry out
some proof-of-concept experiments on such a model in
section 4.2.3.
.
ditional stochastic process Q✓ that defines distributions over
f(x) for inputs x 2 T. ✓ is the real vector of all parame-
ters defining Q. Inheriting from the properties of stochastic
processes, we assume that Q✓ is invariant to permutations
of O and T. If O0
, T0
are permutations of O and T, re-
spectively, then Q✓(f(T) | O, T) = Q✓(f(T0
) | O, T0
) =
Q✓(f(T) | O0
, T). In this work, we generally enforce per-
mutation invariance with respect to T by assuming a fac-
tored structure. Specifically, we consider Q✓s that factor
Q✓(f(T) | O, T) =
Q
x2T Q✓(f(x) | O, x). In the absence
of assumptions on output space Y , this is the easiest way to
ensure a valid stochastic process. Still, this framework can
be extended to non-factored distributions, we consider such
a model in the experimental section.
The defining characteristic of a CNP is that it conditions on
O via an embedding of fixed dimensionality. In more detail,
we use the following architecture,
ri = h✓(xi, yi) 8(xi, yi) 2 O (1)
r = r1 r2 . . . rn 1 rn (2)
i = g✓(xi, r) 8(xi) 2 T (3)
where h✓ : X ⇥Y ! Rd
and g✓ : X ⇥Rd
! Re
are neural
networks, is a commutative operation that takes elements
in Rd
and maps them into a single element of Rd
, and i are
parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend-
ing on the task the model learns to parametrize a different
output distribution. This architecture ensures permutation
invariance and O(n + m) scaling for conditional prediction.
We note that, since r1 . . . rn can be computed in O(1)
from r1 . . . rn 1, this architecture supports streaming
observations with minimal overhead.
For regression tasks we use i to parametrize the mean and
variance i = (µi, 2
i ) of a Gaussian distribution N(µi, 2
i )
for every xi 2 T. For classification tasks i parametrizes
and unobserved values. In practice, we
estimates of the gradient of this loss by
This approach shifts the burden of imp
edge from an analytic prior to empiri
the advantage of liberating a practition
specify an analytic form for the prior, w
intended to summarize their empirical e
emphasize that the Q✓ are not necessaril
conditionals for all observation sets, and
does not guarantee that. In summary,
1. A CNP is a conditional distribut
trained to model the empirical cond
of functions f ⇠ P.
2. A CNP is permutation invariant in
3. A CNP is scalable, achieving a run
ity of O(n + m) for making m
observations.
Within this specification of the model t
aspects that can be modified to suit spe
The exact implementation of h, for exam
to the data type. For low dimensiona
can be implemented as an MLP, wher
larger dimensions and spatial correlation
convolutions. Finally, in the setup describ
able to produce any coherent samples, a
only a factored prediction of the mean
disregarding the covariance between t
is a result of this particular implement
One way we can obtain coherent sample
a latent variable that we can sample fr
some proof-of-concept experiments o
section 4.2.3.
.
Specifically, given a set of observations O, a CNP is a con-
ditional stochastic process Q✓ that defines distributions over
f(x) for inputs x 2 T. ✓ is the real vector of all parame-
ters defining Q. Inheriting from the properties of stochastic
processes, we assume that Q✓ is invariant to permutations
of O and T. If O0
, T0
are permutations of O and T, re-
spectively, then Q✓(f(T) | O, T) = Q✓(f(T0
) | O, T0
) =
Q✓(f(T) | O0
, T). In this work, we generally enforce per-
mutation invariance with respect to T by assuming a fac-
tored structure. Specifically, we consider Q✓s that factor
Q✓(f(T) | O, T) =
Q
x2T Q✓(f(x) | O, x). In the absence
of assumptions on output space Y , this is the easiest way to
ensure a valid stochastic process. Still, this framework can
be extended to non-factored distributions, we consider such
a model in the experimental section.
The defining characteristic of a CNP is that it conditions on
O via an embedding of fixed dimensionality. In more detail,
we use the following architecture,
ri = h✓(xi, yi) 8(xi, yi) 2 O (1)
r = r1 r2 . . . rn 1 rn (2)
i = g✓(xi, r) 8(xi) 2 T (3)
where h✓ : X ⇥Y ! Rd
and g✓ : X ⇥Rd
! Re
are neural
networks, is a commutative operation that takes elements
in Rd
and maps them into a single element of Rd
, and i are
parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend-
ing on the task the model learns to parametrize a different
output distribution. This architecture ensures permutation
invariance and O(n + m) scaling for conditional prediction.
We note that, since r1 . . . rn can be computed in O(1)
from r1 . . . rn 1, this architecture supports streaming
observations with minimal overhead.
For regression tasks we use i to parametrize the mean and
variance i = (µi, 2
) of a Gaussian distribution N(µi, 2
)
Thus, the targets it scores Q✓ on include
and unobserved values. In practice, we
estimates of the gradient of this loss by
This approach shifts the burden of imp
edge from an analytic prior to empiri
the advantage of liberating a practition
specify an analytic form for the prior, w
intended to summarize their empirical e
emphasize that the Q✓ are not necessaril
conditionals for all observation sets, and
does not guarantee that. In summary,
1. A CNP is a conditional distribut
trained to model the empirical cond
of functions f ⇠ P.
2. A CNP is permutation invariant in
3. A CNP is scalable, achieving a run
ity of O(n + m) for making m
observations.
Within this specification of the model t
aspects that can be modified to suit spe
The exact implementation of h, for exam
to the data type. For low dimensiona
can be implemented as an MLP, wher
larger dimensions and spatial correlation
convolutions. Finally, in the setup describ
able to produce any coherent samples, a
only a factored prediction of the mean
disregarding the covariance between t
is a result of this particular implement
One way we can obtain coherent sample
a latent variable that we can sample fr
some proof-of-concept experiments o
a less informed posterior of the underlying function. This
formulation makes it clear that the posterior given a subset
of the context points will serve as the prior when additional
context points are included. By using this setup, and train-
ing with different sizes of context, we encourage the learned
model to be flexible with regards to the number and position
of the context points.
2.4. The Neural process model
In our implementation of NPs we accommodate for two ad-
ditional desiderata: invariance to the order of context points
and computational efficiency. The resulting model can be
boiled down to three core components (see Figure 1b):
• An encoder h from input space into representation
space that takes in pairs of (x, y)i context values and
produces a representation ri = h((x, y)i) for each of
the pairs. We parameterise h as a neural network.
• An aggregator a that summarises the encoded inputs.
We are interested in obtaining a single order-invariant
global representation r that parameterises the latent dis-
tribution z ⇠ N(µ(r), I (r)). The simplest operation
that ensures order-invariance and works well in prac-
tice is the mean function r = a(ri) = 1
n
Pn
i=1 ri. Cru-
cially, the aggregator reduces the runtime to O(n + m)
where n and m are the number of context and target
points respectively.
• A conditional decoder g that takes as input the sam-
pled global latent variable z as well as the new target
locations xT and outputs the predictions ˆyT for the
corresponding values of f(xT ) = yT .
3. Related work
3.1. Conditional neural processes
Neural Processes (NPs) are a generalisation of Conditional
lack a latent variable that allows for global sampl
Figure 2c for a diagram of the model). As a resul
are unable to produce different function samples
same context data, which can be important if model
uncertainty is desirable. It is worth mentioning that
inal CNP formulation did include experiments with
variable in addition to the deterministic connectio
ever, given the deterministic connections to the p
variables, the role of the global latent variable is n
In contrast, NPs constitute a more clear-cut gener
of the original deterministic CNP with stronger p
to other latent variable models and approximate B
methods. These parallels allow us to compare ou
to a wide range of related research areas in the fo
sections.
Finally, NPs and CNPs themselves can be seen a
alizations of recently published generative query n
(GQN) which apply a similar training procedure to
new viewpoints in 3D scenes given some context
tions (Eslami et al., 2018). Consistent GQN (CG
an extension of GQN that focuses on generating co
samples and is thus also closely related to NPs (Kum
2018).
3.2. Gaussian processes
We start by considering models that, like NPs, li
spectrum between neural networks (NNs) and G
processes (GPs). Algorithms on the NN end of the s
fit a single function that they learn from a very large
of data directly. GPs on the other hand can rep
distribution over a family of functions, which is con
by an assumption on the functional form of the co
between two points.
Scattered across this spectrum, we can place recent
that has combined ideas from Bayesian non-para
with neural networks. Methods like (Calandra et a
Huang et al., 2015) remain fairly close to the GPs, b
the prior on the context. As such this prior is equivalent to
a less informed posterior of the underlying function. This
formulation makes it clear that the posterior given a subset
of the context points will serve as the prior when additional
context points are included. By using this setup, and train-
ing with different sizes of context, we encourage the learned
model to be flexible with regards to the number and position
of the context points.
2.4. The Neural process model
In our implementation of NPs we accommodate for two ad-
ditional desiderata: invariance to the order of context points
and computational efficiency. The resulting model can be
boiled down to three core components (see Figure 1b):
• An encoder h from input space into representation
space that takes in pairs of (x, y)i context values and
produces a representation ri = h((x, y)i) for each of
the pairs. We parameterise h as a neural network.
• An aggregator a that summarises the encoded inputs.
We are interested in obtaining a single order-invariant
global representation r that parameterises the latent dis-
tribution z ⇠ N(µ(r), I (r)). The simplest operation
that ensures order-invariance and works well in prac-
tice is the mean function r = a(ri) = 1
n
Pn
i=1 ri. Cru-
cially, the aggregator reduces the runtime to O(n + m)
where n and m are the number of context and target
points respectively.
• A conditional decoder g that takes as input the sam-
pled global latent variable z as well as the new target
locations xT and outputs the predictions ˆyT for the
corresponding values of f(xT ) = yT .
3. Related work
3.1. Conditional neural processes
a large part of the motivation behind n
lack a latent variable that allows for g
Figure 2c for a diagram of the model)
are unable to produce different funct
same context data, which can be import
uncertainty is desirable. It is worth men
inal CNP formulation did include exper
variable in addition to the deterministi
ever, given the deterministic connectio
variables, the role of the global latent
In contrast, NPs constitute a more clea
of the original deterministic CNP wit
to other latent variable models and app
methods. These parallels allow us to
to a wide range of related research are
sections.
Finally, NPs and CNPs themselves ca
alizations of recently published genera
(GQN) which apply a similar training p
new viewpoints in 3D scenes given so
tions (Eslami et al., 2018). Consisten
an extension of GQN that focuses on g
samples and is thus also closely related
2018).
3.2. Gaussian processes
We start by considering models that,
spectrum between neural networks (N
processes (GPs). Algorithms on the NN
fit a single function that they learn from
of data directly. GPs on the other ha
distribution over a family of functions,
by an assumption on the functional for
between two points.
Scattered across this spectrum, we can p
that has combined ideas from Bayesi
Neural Processes
yC
zT
yT
C T
z
Neural statistician
xC yT
C T
xTyC
(c) Conditional neural process
xC yT
C T
z
xTyC
(d) Neural process
elated models (a-c) and of the neural process (d). Gray shading indicates the variable is observed. C
T for target variables i.e. the variables to predict given C.
. . ]W b b
0 2 : ,+ 8 1
.
0 2 : ,+ 8 1
• 8C 8 8 2 =8 + , 1 3
– l hp
• u
– o t I C= 8 := D [ r d
P MLT
– l y
• u
– TI o o b ab M] ] y
W kc n T I
– o t I b ab M] y P
sN ei C= 0 C G I IL
Neural Processes
10 100 300 784
Number of context points
ContextSample1Sample2Sample3
15 30 90 1024
Number of context points
ContextSample1Sample2Sample3
Figure 4. Pixel-wise regression on MNIST and CelebA The diagram on the left visualises how pixel-wise image completion can be
framed as a 2-D regression task where f(pixel coordinates) = pixel brightness. The figures to the right of the diagram show the results on
image completion for MNIST and CelebA. The images on the top correspond to the context points provided to the model. For better
clarity the unobserved pixels have been coloured blue for the MNIST images and white for CelebA. Each of the rows corresponds to a
• C /C DD D C + , 3 C D
– le G 2 D D :
•
– iIg u [ w
– 2 D D : ST M d
•
– / / PR t STle drSR s ophm PR s
RWa N] PR r d
– 0 8 1 C d PR / / L r d PR d] R L
– / [ C 8 C ekIkonIg cR d PTNeural Processes
Figure 5. Thompson sampling with neural processes on a 1-D objective function. The plots show the optimisation process over five
iterations. Each prediction function (blue) is drawn by sampling a latent variable conditioned on an incresing number of context points
(black circles). The underlying ground truth function is depicted as a black dotted line. The red triangle indicates the next evaluation point
which corresponds to the minimum value of the sampled NP curve. The red circle in the following iteration corresponds to this evaluation
point with its underlying ground truth value that serves as a new context point to the NP.
Neural Processes
Figure 5. Thompson sampling with neural processes on a 1-D objective function. The plots show the optimisation process over five
iterations. Each prediction function (blue) is drawn by sampling a latent variable conditioned on an incresing number of context points
(black circles). The underlying ground truth function is depicted as a black dotted line. The red triangle indicates the next evaluation point
which corresponds to the minimum value of the sampled NP curve. The red circle in the following iteration corresponds to this evaluation
point with its underlying ground truth value that serves as a new context point to the NP.
4.3. Black-box optimisation with Thompson sampling
To showcase the utility of sampling entire consistent trajecto-
ries we apply neural processes to Bayesian optimisation on
1-D function using Thompson sampling (Thompson, 1933).
Thompson sampling (also known as randomised proba-
bility matching) is an approach to tackle the exploration-
exploitation dilemma by maintaining a posterior distribution
over model parameters. A decision is taken by drawing a
sample of model parameters and acting greedily under the
resulting policy. The posterior distribution is then updated
and the process is repeated. Despite its simplicity, Thomp-
son sampling has been shown to be highly effective both
empirically and in theory. It is commonly applied to black
box optimisation and multi-armed bandit problems (e.g.
Agrawal & Goyal, 2012; Shahriari et al., 2016).
function evaluations increases.
Neural process Gaussian process Random Search
0.26 0.14 1.00
Table 1. Bayesian Optimisation using Thompson sampling.
Average number of optimisation steps needed to reach the global
minimum of a 1-D function generated by a Gaussian process. The
values are normalised by the number of steps taken using random
search. The performance of the Gaussian process with the correct
kernel constitutes an upper bound on performance.
4.4. Contextual bandits
• e n
– u o a
– t k G
• o C l F
–
• F lC N
• o C G G
–
• / i C ikG a a C k
P
• / 3 /
– r r bF N Fv r GwN :
:
– N F G
• /
– 2 NFLO 8 S FS L" /ON JSJON L FT L ODFRRFR" 4N 4/87 "
– 2 NFLO 8 S FS L" FT L ODFRRFR" 4N 4/87 VO KRIOP S DK "
• 8FS 7F NJNH
– J DIJN N 3THO 7 ODIFLLF" :PSJMJY SJON R MO FL GO GFV RIOS LF NJNH" 4N 4/7 "
– ?JN LR : JOL FS L" 8 SDIJNH NFSVO KR GO ONF RIOS LF NJNH" 4N 4 "
– 1JNN /IFLRF JFSF .CCFFL N F HF 7F JNF" 8O FL HNORSJD MFS LF NJNH GO G RS PS SJON OG FFP
NFSVO KR" 4N 4/87 "
– NSO O . M FS L" 8FS LF NJNH VJSI MFMO THMFNSF NFT L NFSVO KR" 4N 4/87 "
– FF DOSS N O F 1 FJS R : JOL ?JN LR" 0FFP 7F NJNH, DSJDF N FN R" 4N 4 TSO J L "
– ?J LR : JOL" 8O FL R :PSJMJY SJON 8FS 7F NJNH" 4N 4 8FS 7F NJNH MPORJTM "
• 2 TRRJ N ODFRRFR
– JLRON .N FV 2O ON FS L" 0FFP KF NFL LF NJNH" . SJGJDJ L 4NSFLLJHFNDF N S SJRSJDR" "
– 6FN LL .LFW N A JN 2 L" I S TNDF S JNSJFR O VF NFF JN C FRJ N FFP LF NJNH GO DOMPTSF
JRJON-" 4N 4 VO KRIOP S DK "
– 7FF FIOON FS L" 0FFP NFT L NFSVO KR R H TRRJ N P ODFRRFR" 4N 4/7 "

Contenu connexe

Tendances

猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoderSho Tatsuno
 
[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景Deep Learning JP
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential EquationsDeep Learning JP
 
Deeplearning輪読会
Deeplearning輪読会Deeplearning輪読会
Deeplearning輪読会正志 坪坂
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-EncoderDeep Learning JP
 
[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?Deep Learning JP
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門tmtm otm
 
最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向ohken
 
[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You NeedDeep Learning JP
 
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...Deep Learning JP
 
PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説弘毅 露崎
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習Eiji Uchibe
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習Deep Learning JP
 
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...Deep Learning JP
 
モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019Yusuke Uchida
 
SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII
 
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action DiffusionDeep Learning JP
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)Masahiro Suzuki
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Shohei Taniguchi
 

Tendances (20)

猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder
 
[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations
 
Deeplearning輪読会
Deeplearning輪読会Deeplearning輪読会
Deeplearning輪読会
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
 
[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門
 
最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向
 
[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need
 
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
 
PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
 
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
 
モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019
 
SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向
 
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
 

Similaire à [DL輪読会]Conditional Neural Processes

Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationAlexander Litvinenko
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015ihji
 
10.1.1.474.2861
10.1.1.474.286110.1.1.474.2861
10.1.1.474.2861pkavitha
 
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...Marisa Paryasto
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningBruno Gonçalves
 
Estimating structured vector autoregressive models
Estimating structured vector autoregressive modelsEstimating structured vector autoregressive models
Estimating structured vector autoregressive modelsAkira Tanimoto
 
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」Ken'ichi Matsui
 
MATHS LAB MANUAL 22mats11.pdf
MATHS LAB MANUAL 22mats11.pdfMATHS LAB MANUAL 22mats11.pdf
MATHS LAB MANUAL 22mats11.pdfssuser8f6b1d1
 
Fuzzy logic and fuzzy time series edited
Fuzzy logic and fuzzy time series   editedFuzzy logic and fuzzy time series   edited
Fuzzy logic and fuzzy time series editedProf Dr S.M.Aqil Burney
 
ch7_lin_updatedApril13_2022.pdf
ch7_lin_updatedApril13_2022.pdfch7_lin_updatedApril13_2022.pdf
ch7_lin_updatedApril13_2022.pdfCharlieTsui5
 
The Analytical Nature of the Greens Function in the Vicinity of a Simple Pole
The Analytical Nature of the Greens Function in the Vicinity of a Simple PoleThe Analytical Nature of the Greens Function in the Vicinity of a Simple Pole
The Analytical Nature of the Greens Function in the Vicinity of a Simple Poleijtsrd
 
Natural and Clamped Cubic Splines
Natural and Clamped Cubic SplinesNatural and Clamped Cubic Splines
Natural and Clamped Cubic SplinesMark Brandao
 
The Odd Generalized Exponential Log Logistic Distribution
The Odd Generalized Exponential Log Logistic DistributionThe Odd Generalized Exponential Log Logistic Distribution
The Odd Generalized Exponential Log Logistic Distributioninventionjournals
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codegokulprasath06
 
machinelearning project
machinelearning projectmachinelearning project
machinelearning projectLianli Liu
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...Vladimir Bakhrushin
 

Similaire à [DL輪読会]Conditional Neural Processes (20)

Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty Quantification
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015
 
10.1.1.474.2861
10.1.1.474.286110.1.1.474.2861
10.1.1.474.2861
 
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
 
Estimating structured vector autoregressive models
Estimating structured vector autoregressive modelsEstimating structured vector autoregressive models
Estimating structured vector autoregressive models
 
kactl.pdf
kactl.pdfkactl.pdf
kactl.pdf
 
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
 
MATHS LAB MANUAL 22mats11.pdf
MATHS LAB MANUAL 22mats11.pdfMATHS LAB MANUAL 22mats11.pdf
MATHS LAB MANUAL 22mats11.pdf
 
Fuzzy logic and fuzzy time series edited
Fuzzy logic and fuzzy time series   editedFuzzy logic and fuzzy time series   edited
Fuzzy logic and fuzzy time series edited
 
ch7_lin_updatedApril13_2022.pdf
ch7_lin_updatedApril13_2022.pdfch7_lin_updatedApril13_2022.pdf
ch7_lin_updatedApril13_2022.pdf
 
D024025032
D024025032D024025032
D024025032
 
The Analytical Nature of the Greens Function in the Vicinity of a Simple Pole
The Analytical Nature of the Greens Function in the Vicinity of a Simple PoleThe Analytical Nature of the Greens Function in the Vicinity of a Simple Pole
The Analytical Nature of the Greens Function in the Vicinity of a Simple Pole
 
Natural and Clamped Cubic Splines
Natural and Clamped Cubic SplinesNatural and Clamped Cubic Splines
Natural and Clamped Cubic Splines
 
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
Families of Triangular Norm Based Kernel Function and Its Application to Kern...Families of Triangular Norm Based Kernel Function and Its Application to Kern...
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
 
The Odd Generalized Exponential Log Logistic Distribution
The Odd Generalized Exponential Log Logistic DistributionThe Odd Generalized Exponential Log Logistic Distribution
The Odd Generalized Exponential Log Logistic Distribution
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source code
 
machinelearning project
machinelearning projectmachinelearning project
machinelearning project
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...
 

Plus de Deep Learning JP

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving PlannersDeep Learning JP
 
【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについてDeep Learning JP
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...Deep Learning JP
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-ResolutionDeep Learning JP
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxivDeep Learning JP
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLMDeep Learning JP
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...Deep Learning JP
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place RecognitionDeep Learning JP
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?Deep Learning JP
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究についてDeep Learning JP
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )Deep Learning JP
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...Deep Learning JP
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"Deep Learning JP
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "Deep Learning JP
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"Deep Learning JP
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...Deep Learning JP
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...Deep Learning JP
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...Deep Learning JP
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...Deep Learning JP
 

Plus de Deep Learning JP (20)

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
 
【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

[DL輪読会]Conditional Neural Processes

  • 1. 1 DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ Conditional Neural Processes (ICML2018) Neural Processes (ICML2018Workshop) Kazuki Fujikawa, DeNA
  • 2. • – DC DC8A 0 IF8A 1FD: GG GR 01S • , . • 8F 8 8FC AD 8C 2DG C 8I F G DE F 8 GDC 8 D 28 8A D 8J 8L DC IFF8M 8C8 8C M 8C AD 2 N C A GA8 – 0 IF8A 1FD: GG GR01S • , . DF G DE • 8F 8 8FC AD DC8 8C : 8FN 8C 2DG C 8I 8 D DA8 8C AD 2 N C A GA8 M • – 1P00 c rb a cz VYtyc • 1 mkop • 00 uT x k i vWahg nlo d – w s ihe c
  • 5. • : 715 - 2 0 – t R I Cn L i e LF o v c Cg a o C – hR - 2 0 5 - c n w g a sI lrI C l Published as a conference paper at ICLR 2017 o ,- 2 - 5 1 , &
  • 6. • l yʼ v – y l m% g olgm T k f n h k r • W p sd W U Ur U Udh g aT r c – 61: , 7 % :A DIE G LAKJLN .AK LNHAIN J :L I KJLN NEJI 6 NEJI G 1ECD R :L E ANR , HEIE NL NEJI I % :A G -L D 7LAGEHEI LR P G NEJI 8AKJLN • t ʼu t ʼu rwʼ h T ef v f nec – A E 0 RII 0JJCGA KDJNJ G AGA G KAJKGA SCJLEGG S % – k g hT y khef yU r h g • blyU k fm i r k h g + AI GG 627 % + 5AA 2-58 %
  • 8. • – • • ) ( ( – • 8 • ! = ($%, … , $() * ! • ! • + = (,%, … , ,() *(+) • ,(-% *(,(-%, +) • *(,(-%|+)
  • 9. • – • – !" #" , 9 $ % » !" = %' $(#") » % : * % = + % ,, ./0 1 . – 2 = (!3, … , !5) 9 6 D » 2 = 6% !3 ⋮ !5 $0(#3) ⋯ $9(#3) ⋮ ⋱ ⋮ $0(#5) ⋯ $9(#5)
  • 10. • – ) • ! = ($%, … , $() : * ! – ! ( + 1 0 * ! 1 » * ! = ,(!|., /) • E ! = 1E ! = . • cov ! = E !56 = 17 ++6 16 = 89: 116 = 89: ;(<%);(<%) ⋯ ;(<%);(<() ⋮ ⋱ ⋮ ;(<();(<%) ⋯ ;(<();(<() = @(<%, <%) ⋯ @(<%, <() ⋮ ⋱ ⋮ @(<(, <%) ⋯ @(<(, <() = /
  • 11. • : – • ! – "# 1 $ ε# &# » "# = &# + ε# • ) = ("+, … , ".) 0()) – 0()|!) 1 » 0 ) ! = 2()|!, 345 6) 3 » 0 ) = ∫ 0 ) ! 0 ! 8! = 2 ) 9, :; :; = < =+, =+ + 345 ⋯ <(=+, =.) ⋮ ⋱ ⋮ <(=., =+) ⋯ < =., =. + 345
  • 12. • – • !"#$ 1 : %(!"#$, () – %(!"#$, () % ( » % !"#$, ( = + ( ,, -.#/ ) -.#/ = 0 1$, 1$ + 34/ ⋯ 0(1$, 1"#$) ⋮ ⋱ ⋮ 0(1", 1$) ⋯ 0 1", 1"#$ + 34/ = -. 8 89 : • %(!"#$|() – 2 » % !"#$ ( = <(=>?@,() <(() = " !"#$, ( ,, -.#/ " ( ,, -. = |-.| |-.?/| exp − $ E (, !"#$ F -.#/ (, !"#$ + $ E (F -. ( – » % !"#$ ( = + !"#$ 8F -" 4$ (, : − 8F -" 4$ 8
  • 13. • – • • 3 3 – • 1 3 – • – ! " #((! + ")' ) – • –
  • 14. • ] W – 006 0 40 0 4241 ,2 4 • e D:! " !, $ N – » "(!, $) ++ o snpe ++ » ' [ [ • $, ' C D i r lK – • $, ' e a gL – (2) y µX) , 1 KX,X⇤ . n matrix of X⇤ and X. X and X⇤, valuated at s implicitly dels which space, and ese models or support of any con- Indeed the induced by ernel func- In this section we show how we can contruct kernels which encapsulate the expressive power of deep archi- tectures, and how to learn the properties of these ker- nels as part of a scalable probabilistic GP framework. Specifically, starting from a base kernel k(xi, xj|✓) with hyperparameters ✓, we transform the inputs (pre- dictors) x as k(xi, xj|✓) ! k(g(xi, w), g(xj, w)|✓, w) , (5) where g(x, w) is a non-linear mapping given by a deep architecture, such as a deep convolutional network, parametrized by weights w. The popular RBF kernel (Eq. (3)) is a sensible choice of base kernel k(xi, xj|✓). For added flexibility, we also propose to use spectral mixture base kernels (Wilson and Adams, 2013): kSM(x, x0 |✓) = (6) QX q=1 aq |⌃q| 1 2 (2⇡) D 2 exp ✓ 1 2 ||⌃ 1 2 q (x x0 )||2 ◆ coshx x0 , 2⇡µqi . 372 he marginal likeli- des significant ad- ied to the output twork. Moreover, ng non-parametric very recent KISS- , 2015) and exten- iently representing e deep kernels. ses (GPs) and the dictions and kernel Williams (2006) for . (predictor) vectors on D, which index (x1), . . . , y(xn))> . ection of function bution, N (µ, KX,X ) , (1) e matrix defined nce kernel of the 2 encodes the inductive bias that function values closer together in the input space are more correlated. The complexity of the functions in the input space is deter- mined by the interpretable length-scale hyperparam- eter `. Shorter length-scales correspond to functions which vary more rapidly with the inputs x. The structure of our data is discovered through learning interpretable kernel hyperparameters. The marginal likelihood of the targets y, the probability of the data conditioned only on kernel hyperparame- ters , provides a principled probabilistic framework for kernel learning: log p(y| , X) / [y> (K + 2 I) 1 y+log |K + 2 I|] , (4) where we have used K as shorthand for KX,X given . Note that the expression for the log marginal likeli- hood in Eq. (4) pleasingly separates into automatically calibrated model fit and complexity terms (Rasmussen and Ghahramani, 2001). Kernel learning is performed by optimizing Eq. (4) with respect to . The computational bottleneck for inference is solving the linear system (KX,X + 2 I) 1 y, and for kernel learning is computing the log determinant log |KX,X + 2 I|. The standard approach is to compute the Cholesky decomposition of the n ⇥ n matrix KX,X , Deep Kernel Learning . . h (L) 1 h (L) C (L) h1(✓) h1(✓) ers 1 layer y1 yP Output layer . . ... ...... ... ... ing: A Gaussian process with onal inputs x through L para- by a hidden layer with an in- ns, with base kernel hyperpa- an process with a deep kernel ping with an infinite number parametrized by = {w, ✓}. jointly through the marginal ocess. For kernel learning, we use the chain rule to compute derivatives of the log marginal likelihood with respect to the deep kernel hyperparameters: @L @✓ = @L @K @K @✓ , @L @w = @L @K @K @g(x, w) @g(x, w) @w . The implicit derivative of the log marginal likelihood with respect to our n ⇥ n data covariance matrix K is given by @L @K = 1 2 (K 1 yy> K 1 K 1 ) , (7) where we have absorbed the noise covariance 2 I into our covariance matrix, and treat it as part of the base kernel hyperparameters ✓. @K @✓ are the derivatives of the deep kernel with respect to the base kernel hyper- parameters (such as length-scale), conditioned on the fixed transformation of the inputs g(x, w). Similarly,
  • 15. • D c a – +77 7 5: 7 C 9 5 ,5 85 0 7 7 77 1 4 • ,0G] wl nr puRu RL N tW I 75: 4 28::85 4 – ko » – ] nr » • ] nr s m k P – i[e ko » – ] nr » » denote its output. The ith component of the activations in the lth layer, post-nonlinearity and post- affine transformation, are denoted xl i and zl i respectively. We will refer to these as the post- and pre-activations. (We let x0 i ⌘ xi for the input, dropping the Arabic numeral superscript, and instead use a Greek superscript x↵ to denote a particular input ↵). Weight and bias parameters for the lth layer have components Wl ij, bl i, which are independent and randomly drawn, and we take them all to have zero mean and variances 2 w/Nl and 2 b , respectively. GP(µ, K) denotes a Gaussian process with mean and covariance functions µ(·), K(·, ·), respectively. 2.2 REVIEW OF GAUSSIAN PROCESSES AND SINGLE-LAYER NEURAL NETWORKS We briefly review the correspondence between single-hidden layer neural networks and GPs (Neal (1994a;b); Williams (1997)). The ith component of the network output, z1 i , is computed as, z1 i (x) = b1 i + N1X j=1 W1 ijx1 j (x), x1 j (x) = ✓ b0 j + dinX k=1 W0 jkxk ◆ , (1) where we have emphasized the dependence on input x. Because the weight and bias parameters are taken to be i.i.d., the post-activations x1 j , x1 j0 are independent for j 6= j0 . Moreover, since z1 i (x) is a sum of i.i.d terms, it follows from the Central Limit Theorem that in the limit of infinite width N1 ! 1, z1 i (x) will be Gaussian distributed. Likewise, from the multidimensional Central Limit Theorem, any finite collection of {z1 i (x↵=1 ), ..., z1 i (x↵=k )} will have a joint multivariate Gaussian distribution, which is exactly the definition of a Gaussian process. Therefore we conclude that z1 i ⇠ GP(µ1 , K1 ), a GP with mean µ1 and covariance K1 , which are themselves independent of i. Because the parameters have zero mean, we have that µ1 (x) = E ⇥ z1 i (x) ⇤ = 0 and, K1 (x, x0 ) ⌘ E ⇥ z1 i (x)z1 i (x0 ) ⇤ = 2 b + 2 w E ⇥ x1 i (x)x1 i (x0 ) ⇤ ⌘ 2 b + 2 wC(x, x0 ), (2) where we have introduced C(x, x0 ) as in Neal (1994a); it is obtained by integrating against the distribution of W0 , b0 . Note that, as any two z1 i , z1 j for i 6= j are joint Gaussian and have zero covariance, they are guaranteed to be independent despite utilizing the same features produced by the hidden layer. 3 We briefly review the correspondence between single-hidden layer neural networks and GPs (Neal (1994a;b); Williams (1997)). The ith component of the network output, z1 i , is computed as, z1 i (x) = b1 i + N1X j=1 W1 ijx1 j (x), x1 j (x) = ✓ b0 j + dinX k=1 W0 jkxk ◆ , (1) where we have emphasized the dependence on input x. Because the weight and bias parameters are taken to be i.i.d., the post-activations x1 j , x1 j0 are independent for j 6= j0 . Moreover, since z1 i (x) is a sum of i.i.d terms, it follows from the Central Limit Theorem that in the limit of infinite width N1 ! 1, z1 i (x) will be Gaussian distributed. Likewise, from the multidimensional Central Limit Theorem, any finite collection of {z1 i (x↵=1 ), ..., z1 i (x↵=k )} will have a joint multivariate Gaussian distribution, which is exactly the definition of a Gaussian process. Therefore we conclude that z1 i ⇠ GP(µ1 , K1 ), a GP with mean µ1 and covariance K1 , which are themselves independent of i. Because the parameters have zero mean, we have that µ1 (x) = E ⇥ z1 i (x) ⇤ = 0 and, K1 (x, x0 ) ⌘ E ⇥ z1 i (x)z1 i (x0 ) ⇤ = 2 b + 2 w E ⇥ x1 i (x)x1 i (x0 ) ⇤ ⌘ 2 b + 2 wC(x, x0 ), (2) where we have introduced C(x, x0 ) as in Neal (1994a); it is obtained by integrating against the distribution of W0 , b0 . Note that, as any two z1 i , z1 j for i 6= j are joint Gaussian and have zero covariance, they are guaranteed to be independent despite utilizing the same features produced by the hidden layer. 3 2.3 GAUSSIAN PROCESSES AND DEEP NEURAL NETWORKS The arguments of the previous section can be extended to deeper layers by induction. We proceed by taking the hidden layer widths to be infinite in succession (N1 ! 1, N2 ! 1, etc.) as we continue with the induction, to guarantee that the input to the layer under consideration is already governed by a GP. In Appendix C we provide an alternative derivation, in terms of marginalization over intermediate layers, which does not depend on the order of limits, in the case of a Gaussian prior on the weights. A concurrent work (Anonymous, 2018) further derives the convergence rate towards a GP if all layers are taken to infinite width simultaneously. Suppose that zl 1 j is a GP, identical and independent for every j (and hence xl j(x) are independent and identically distributed). After l 1 steps, the network computes zl i(x) = bl i + NlX j=1 Wl ijxl j(x), xl j(x) = (zl 1 j (x)). (3) As before, zl i(x) is a sum of i.i.d. random terms so that, as Nl ! 1, any finite collection {z1 i (x↵=1 ), ..., z1 i (x↵=k )} will have joint multivariate Gaussian distribution and zl i ⇠ GP(0, Kl ). The covariance is Kl (x, x0 ) ⌘ E ⇥ zl i(x)zl i(x0 ) ⇤ = 2 b + 2 w Ezl 1 i ⇠GP(0,Kl 1) ⇥ (zl 1 i (x)) (zl 1 i (x0 )) ⇤ . (4) By induction, the expectation in Equation (4) is over the GP governing zl 1 i , but this is equivalent to integrating against the joint distribution of only zl 1 i (x) and zl 1 i (x0 ). The latter is described by a zero mean, two-dimensional Gaussian whose covariance matrix has distinct entries Kl 1 (x, x0 ), continue with the induction, to guarantee that the input to the layer under consideration is already governed by a GP. In Appendix C we provide an alternative derivation, in terms of marginalization over intermediate layers, which does not depend on the order of limits, in the case of a Gaussian prior on the weights. A concurrent work (Anonymous, 2018) further derives the convergence rate towards a GP if all layers are taken to infinite width simultaneously. Suppose that zl 1 j is a GP, identical and independent for every j (and hence xl j(x) are independent and identically distributed). After l 1 steps, the network computes zl i(x) = bl i + NlX j=1 Wl ijxl j(x), xl j(x) = (zl 1 j (x)). (3) As before, zl i(x) is a sum of i.i.d. random terms so that, as Nl ! 1, any finite collection {z1 i (x↵=1 ), ..., z1 i (x↵=k )} will have joint multivariate Gaussian distribution and zl i ⇠ GP(0, Kl ). The covariance is Kl (x, x0 ) ⌘ E ⇥ zl i(x)zl i(x0 ) ⇤ = 2 b + 2 w Ezl 1 i ⇠GP(0,Kl 1) ⇥ (zl 1 i (x)) (zl 1 i (x0 )) ⇤ . (4) By induction, the expectation in Equation (4) is over the GP governing zl 1 i , but this is equivalent to integrating against the joint distribution of only zl 1 i (x) and zl 1 i (x0 ). The latter is described by a zero mean, two-dimensional Gaussian whose covariance matrix has distinct entries Kl 1 (x, x0 ), Kl 1 (x, x), and Kl 1 (x0 , x0 ). As such, these are the only three quantities that appear in the result. We introduce the shorthand Kl (x, x0 ) = 2 b + 2 w F ⇣ Kl 1 (x, x0 ), Kl 1 (x, x), Kl 1 (x0 , x0 ) ⌘ (5) to emphasize the recursive relationship between Kl and Kl 1 via a deterministic function F whose form depends only on the nonlinearity . This gives an iterative series of computations which can 2.3 GAUSSIAN PROCESSES AND DEEP NEURAL NETWORKS The arguments of the previous section can be extended to deeper layers by induction. We proceed by taking the hidden layer widths to be infinite in succession (N1 ! 1, N2 ! 1, etc.) as we continue with the induction, to guarantee that the input to the layer under consideration is already governed by a GP. In Appendix C we provide an alternative derivation, in terms of marginalization over intermediate layers, which does not depend on the order of limits, in the case of a Gaussian prior on the weights. A concurrent work (Anonymous, 2018) further derives the convergence rate towards a GP if all layers are taken to infinite width simultaneously. Suppose that zl 1 j is a GP, identical and independent for every j (and hence xl j(x) are independent and identically distributed). After l 1 steps, the network computes zl i(x) = bl i + NlX j=1 Wl ijxl j(x), xl j(x) = (zl 1 j (x)). (3) As before, zl i(x) is a sum of i.i.d. random terms so that, as Nl ! 1, any finite collection {z1 i (x↵=1 ), ..., z1 i (x↵=k )} will have joint multivariate Gaussian distribution and zl i ⇠ GP(0, Kl ). The covariance is Kl (x, x0 ) ⌘ E ⇥ zl i(x)zl i(x0 ) ⇤ = 2 b + 2 w Ezl 1 i ⇠GP(0,Kl 1) ⇥ (zl 1 i (x)) (zl 1 i (x0 )) ⇤ . (4) By induction, the expectation in Equation (4) is over the GP governing zl 1 i , but this is equivalent l 1 l 1 0
  • 16. • 76: Learning to Learn 134 Losses + , 2 0 1
  • 17. • 7 : + 12 ,0 Losses Learning to Learn Model Based ● Santoro et al. ’16 ● Duan et al. ’17 ● Wang et al. ‘17 ● Munkhdalai & Yu ‘17 ● Mishra et al. ‘17 Optimization Based ● Schmidhuber ’87, ’92 ● Bengio et al. ’90, ‘92 ● Hochreiter et al. ’01 ● Li & Malik ‘16 ● Andrychowicz et al. ’16 ● Ravi & Larochelle ‘17 ● Finn et al ‘17 Metric Based ● Koch ’15 ● Vinyals et al. ‘16 ● Snell et al. ‘17 ● Shyam et al. ‘17 Trends: Learning to Learn / Meta Learning 135
  • 18. • m : P – I I C L + C I 0 0 I A +00 7 I -, 8 • I I U ih kN !P ol gnk s • t "P Ts u Mnk wmnp y g ]Td ol Tr ecR e • nk 0 Pg M ol UkN Sd g wmnp y e • y u ]eaV [ emnp U bcRM0 I 2 I L + 2 + g ) Oriol Vinyals, NIPS 17 Model Based Meta Learning 6 L 0-1 ( I LC C
  • 19. • hcr I 6 : 2 6 – 2 9: 6 7 , 6 9 62 : 0 : 2 - 1 • 6 I N ca d c ! "# Sg e p "#(%&, !) • u l I% )# Sg e p )# (%, !) • )# %, ! 6 "# %&, ! o wV ia o m • f bS 6 nSs iaV u l o MkNO P] [L )#, "# r • y Ms M g e V ] [L )#, "# r t : 2 - 6 2 62 : : Oriol Vinyals, NIPS 17 Matching Networks, Vinyals et al, NIPS 2016 Metric Based Meta Learning
  • 20. • kdz I0 C C 7 – 7 ( C 7C -7 A A + C ( C C 77 7CD A ( - 1+ ,) - 2 • g d z t z L Fimk d z • , 7 7C A7 CA lgnO db z pwN]F imk d r [ db z s M • a ce lgn o Ffb ce O L lgn imk d z Model-Agnostic Meta-Learning for Fast Adaptation o Algorithm 1 Model-Agnostic Meta-Learning Require: p(T ): distribution over tasks Require: ↵, : step size hyperparameters 1: randomly initialize ✓ 2: while not done do 3: Sample batch of tasks Ti ⇠ p(T ) 4: for all Ti do 5: Evaluate r✓LTi (f✓) with respect to K examples 6: Compute adapted parameters with gradient de- scent: ✓0 i = ✓ ↵r✓LTi (f✓) 7: end for 8: Update ✓ ✓ r✓ P Ti⇠p(T ) LTi (f✓0 i ) 9: end while make no assumption on the form of the model, other than products, which braries such as experiments, w this backward p which we discu 3. Species of In this section, meta-learning a forcement learn function and in sented to the m nism can be app 3.1. Supervised meta-learning learning/adaptation ✓ rL1 rL2 rL3 ✓⇤ 1 ✓⇤ 2 ✓⇤ 3 Figure 1. Diagram of our model-agnostic meta-learning algo- rithm (MAML), which optimizes for a representation ✓ that can quickly adapt to new tasks. Oriol Vinyals, NIPS 17 Summing Up Model Based Metric Based Optimization Based Oriol Vinyals, NIPS 17 Examples of Optimization Based Meta Learning Finn et al, 17 Ravi et al, 17
  • 22. • 0282 1 + 1 1 , 1 – • +Gr L – G » t ! G d a C I c G s P – G [N » d "# a C $# Gt L]d • Gr L – io d » "# $# G ] lG n ] eL – G a M » % u & u a C !(% + &) Conditional Neural Processes Marta Garnelo 1 Dan Rosenbaum 1 Chris J. Maddison 1 Tiago Ramalho 1 David Saxton 1 Murray Shanahan 1 2 Yee Whye Teh 1 Danilo J. Rezende 1 S. M. Ali Eslami 1 Abstract Deep neural networks excel at function approxi- mation, yet they are typically trained from scratch for each new function. On the other hand, Bayesian methods, such as Gaussian Processes (GPs), exploit prior knowledge to quickly infer the shape of a new function at test time. Yet GPs are computationally expensive, and it can be hard to design appropriate priors. In this paper we propose a family of neural models, Conditional Neural Processes (CNPs), that combine the bene- fits of both. CNPs are inspired by the flexibility of stochastic processes such as GPs, but are struc- tured as neural networks and trained via gradient descent. CNPs make accurate predictions after observing only a handful of training data points, yet scale to complex functions and large datasets. We demonstrate the performance and versatility of the approach on a range of canonical machine learning tasks, including regression, classification and image completion. 1. Introduction Deep neural networks have enjoyed remarkable success in recent years, but they require large datasets for effec- tive training (Lake et al., 2017; Garnelo et al., 2016). One bSupervisedLearning PredictTrain …x3x1 x5 x6x4 y5 y6y4 y3y1 g gg x2 y2 g … Observations Targets …x6 y6 f x5 y5 f x4 y4 f x3 y3 f x2 y2 f x1 y1 f aData … PredictObserve Aggregate r3r2r1 …x3x2x1 x5 x6x4 ahhh y5 y6y4 ry3y2y1 g gg cOurModel … Figure 1. Conditional Neural Process. a) Data description b) Training regime of conventional supervised deep learning mod- els c) Our model. of these can be framed as function approximation given a v:1807.01613v1[cs.LG]4Jul2018
  • 23. • 83 81 :1 : 2 1:8 ,+ 0 – [a • 82 3 : ℎ – v " = {(&', )')}',- ./0 l 1'Ar ]d el P – ,+ I • 1 : 1 : 2 – v Ln Cus P 1' l 1A GL L ]d M N L el P – :1 I • 3 2 3 : 3 – o it 4 = {&'}',. .5650 l o g el P » g L 7, 89 » g c dg L :; Abstract Deep neural networks excel at function approxi- mation, yet they are typically trained from scratch for each new function. On the other hand, Bayesian methods, such as Gaussian Processes (GPs), exploit prior knowledge to quickly infer the shape of a new function at test time. Yet GPs are computationally expensive, and it can be hard to design appropriate priors. In this paper we propose a family of neural models, Conditional Neural Processes (CNPs), that combine the bene- fits of both. CNPs are inspired by the flexibility of stochastic processes such as GPs, but are struc- tured as neural networks and trained via gradient descent. CNPs make accurate predictions after observing only a handful of training data points, yet scale to complex functions and large datasets. We demonstrate the performance and versatility of the approach on a range of canonical machine learning tasks, including regression, classification and image completion. 1. Introduction Deep neural networks have enjoyed remarkable success in recent years, but they require large datasets for effec- bSupervisedLearning PredictTrain …x3x1 x5 x6x4 y5 y6y4 y3y1 g gg x2 y2 g … Observations Targets …x6 y6 f x5 y5 f x4 y4 f x3 y3 f x2 y2 f x1 y1 f aData … PredictObserve Aggregate r3r2r1 …x3x2x1 x5 x6x4 ahhh y5 y6y4 ry3y2y1 g gg cOurModel … Figure 1. Conditional Neural Process. a) Data description b) Training regime of conventional supervised deep learning mod- els c) Our model. :1807.01613v1[cs.LG]4Jul2018 Conditional Neural Processes spect to some prior process. CNPs parametrize distributions over f(T) given a distributed representation of O of fixed dimensionality. By doing so we give up the mathematical guarantees associated with stochastic processes, trading this off for functional flexibility and scalability. Specifically, given a set of observations O, a CNP is a con- ditional stochastic process Q✓ that defines distributions over f(x) for inputs x 2 T. ✓ is the real vector of all parame- ters defining Q. Inheriting from the properties of stochastic processes, we assume that Q✓ is invariant to permutations of O and T. If O0 , T0 are permutations of O and T, re- spectively, then Q✓(f(T) | O, T) = Q✓(f(T0 ) | O, T0 ) = Q✓(f(T) | O0 , T). In this work, we generally enforce per- mutation invariance with respect to T by assuming a fac- tored structure. Specifically, we consider Q✓s that factor Q✓(f(T) | O, T) = Q x2T Q✓(f(x) | O, x). In the absence of assumptions on output space Y , this is the easiest way to ensure a valid stochastic process. Still, this framework can be extended to non-factored distributions, we consider such a model in the experimental section. The defining characteristic of a CNP is that it conditions on O via an embedding of fixed dimensionality. In more detail, we use the following architecture, ri = h✓(xi, yi) 8(xi, yi) 2 O (1) r = r1 r2 . . . rn 1 rn (2) i = g✓(xi, r) 8(xi) 2 T (3) where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd ! Re are neural ON = {(xi, yi)}N i minimize the nega L(✓) = Ef⇠P h Thus, the targets i and unobserved v estimates of the gr This approach shi edge from an ana the advantage of specify an analyti intended to summa emphasize that the conditionals for all does not guarantee 1. A CNP is a trained to mo of functions f 2. A CNP is per 3. A CNP is sca ity of O(n + observations. Within this specifi aspects that can b The exact impleme
  • 24. • 24:4 0 , 80 8 1 08 + – • G C s o Nn CM – • idGI tN L C – N ] G s ] ]P {(#$, &$)}$)* +,- – t n , sN ./ = {(#$, &$)}$)* / – ./ {#$}$)* +,- Nl {&$}$)* +,- r a Nu CM – o 1 N ] ]P ] [ c Nu N eCM Conditional Neural Processes metrize distributions tation of O of fixed p the mathematical ocesses, trading this lity. s O, a CNP is a con- es distributions over ector of all parame- perties of stochastic ant to permutations ns of O and T, re- ✓(f(T0 ) | O, T0 ) = nerally enforce per- by assuming a fac- der Q✓s that factor O, x). In the absence is the easiest way to this framework can ns, we consider such that it conditions on ON = {(xi, yi)}N i=0 ⇢ O, the first N elements of O. We minimize the negative conditional log probability L(✓) = Ef⇠P h EN h log Q✓({yi}n 1 i=0 |ON , {xi}n 1 i=0 ) ii (4) Thus, the targets it scores Q✓ on include both the observed and unobserved values. In practice, we take Monte Carlo estimates of the gradient of this loss by sampling f and N. This approach shifts the burden of imposing prior knowl- edge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately intended to summarize their empirical experience. Still, we emphasize that the Q✓ are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that. In summary, 1. A CNP is a conditional distribution over functions trained to model the empirical conditional distributions of functions f ⇠ P.
  • 25. • C 08D 1 8 8 2, 8 /. 3 – + • lo – ifk h s d N n ] » h s Iu w » 5 u w bM d G n ] a • x – 8 8 ("#, %#) d C G /.1 ]g r – :: 8: C '# d b – 8 8 ' [ "( d C G /.1 ifk I d • – , D C DC teu u w c ,1 [ ] L – 01 ,1 P G 01 8 8 8 C : 8 8 ]G N P d Nb – b[x G b 3.2. Meta-Learning Deep learning models are generally more scalable and are very successful at learning features and prior knowledge from the data directly. However they tend to be less flex- ible with regards to input size and order. Additionally, in general they only approximate one function as opposed to distributions over functions. Meta-learning approaches ad- dress the latter and share our core motivations. Recently meta-learning has been applied to a wide range of tasks like RL (Wang et al., 2016; Finn et al., 2017) or program induction (Devlin et al., 2017). Often meta-learning algorithms are implemented as deep generative models that learn to do few-shot estimations of the underlying density of the data. Generative Query Networks (GQN), for example, predict new viewpoints in 3D scenes given some context observations using a similar training regime to NPs (Eslami et al., 2018). As such, NPs can be seen as a generalisation of GQN to few-shot predic- tion tasks beyond scene understanding, such as regression and classification. Another way of carrying out few-shot density estimation is by updating existing models like Pix- elCNN (van den Oord et al., 2016) and augmenting them with attention mechanisms (Reed et al., 2017) or including a memory unit in a VAE model (Bornschein et al., 2017). Another successful latent variable approach is to explicitly condition on some context during inference (J. Rezende et al., 2016). Given the generative nature of these models they are usually applied to image generation tasks, but mod- els that include a conditioning class-variable can be used for classification as well. Classification itself is another common task in meta- learning. Few-shot classification algorithms usually rely on some distance metric in feature space to compare target images to the observations provided (Koch et al., 2015), (Santoro et al., 2016). Matching networks (Vinyals et al., 2016; Bartunov & Vetrov, 2016) are closely related to CNPs. In their case features of samples are compared with target features using an attention kernel. At a higher level one can interpret this model as a CNP where the aggregator is just the concatenation over all input samples and the decoder g contains an explicitly defined distance kernel. In this sense matching networks are closer to GPs than to CNPs, since they require the specification of a distance kernel that CNPs learn from the data instead. In addition, as MNs carry out all- to-all comparisons they scale with O(n ⇥ m), although they other generative models the neural statistician learns to esti- mate the density of the observed data but does not allow for targeted sampling at what we have been referring to as input positions xi. Instead, one can only generate i.i.d. samples from the estimated density. Finally, the latent variant of CNP can also be seen as an approximated amortized version of Bayesian DL (Gal & Ghahramani, 2016; Blundell et al., 2015; Louizos et al., 2017; Louizos & Welling, 2017) 4. Experimental Results Figure 2. 1-D Regression. Regression results on a 1-D curve (black line) using 5 (left column) and 50 (right column) context points (black dots). The first two rows show the predicted mean and variance for the regression of a single underlying kernel for GPs (red) and CNPs (blue). The bottom row shows the predictions of CNPs for a curve with switching kernel parameters.
  • 26. • 8 68 8 2 8 + -, – - +01G • ln – - +01 » IT og c N a • s u – ] [0, 1]& i u T ] – S] a ] [0, 1]C t [0, 1]' G – S] P • – Nb T ] C ] 1 : L – cdhoer T D [ c aM C [ , aD Conditional Neural Processes Figure 3. Pixel-wise image regression on MNIST. Left: Two examples of image regression with v provide the model with 1, 40, 200 and 728 context points (top row) and query the entire image. T variance (bottom row) at each pixel position is shown for each of the context images. Right: mode observations that are either chosen at random (blue) or by selecting the pixel with the highest varia 4.1. Function Regression As a first experiment we test CNP on the classical 1D re- gression task that is used as a common baseline for GPs. We generate two different datasets that consist of functions generated from a GP with an exponential kernel. In the first dataset we use a kernel with fixed parameters, and in the second dataset the function switches at some random point on the real line between two functions each sampled with different kernel parameters. At every training step we sample a curve from the GP, select a subset of n points (xi, yi) as observations, and a subset of points (xt, yt) as target points. Using the model described that increases in accuracy increases. Furthermore the model ach on the switching kernel t is not trivial for GPs whe change the dataset used fo 4.2. Image Completion We consider image comp functions in either f : [0, or f : [0, 1]2 ! [0, 1]3 fo 2D pixel coordinates nor Conditional Neural Processes Figure 3. Pixel-wise image regression on MNIST. Left: Two examples of image regression with varying numbers of observations. We provide the model with 1, 40, 200 and 728 context points (top row) and query the entire image. The resulting mean (middle row) and variance (bottom row) at each pixel position is shown for each of the context images. Right: model accuracy with increasing number of observations that are either chosen at random (blue) or by selecting the pixel with the highest variance (red). 4.1. Function Regression As a first experiment we test CNP on the classical 1D re- that increases in accuracy as the number of context points increases.
  • 27. • C C 8 0 G 8 2 C: EE E ,8 C - / 7 – L • r t – » O[n ul P k i • – /0- b • – a NaI b a MI , CG G a M]MST R[ – / +k a » 8 CA C 0 b k s P o mL • M I 02 c 00 ,2 he ]Mi • M I 00 ,2 b » 1 C 0 b k b P a L • ,2 00 c x a ] RS i[dI b ]e/ +c aSM • 02 c b ]Mi L Conditional Neural Processes 4.2.1. MNIST We first test CNP on the MNIST dataset and use the test set to evaluate its performance. As shown in Figure 3a the model learns to make good predictions of the underlying digit even for a small number of context points. Crucially, when conditioned only on one non-informative context point (e.g. a black pixel on the edge) the model’s prediction corre- sponds to the average over all MNIST digits. As the number of context points increases the predictions become more similar to the underlying ground truth. This demonstrates the model’s capacity to extract dataset specific prior knowl- edge. It is worth mentioning that even with a complete set of observations the model does not achieve pixel-perfect reconstruction, as we have a bottleneck at the representation level. Since this implementation of CNP returns factored outputs, the best prediction it can produce given limited context information is to average over all possible predictions that agree with the context. An alternative to this is to add latent variables in the model such that they can be sampled conditioned on the context to produce predictions with high probability in the data distribution. We consider this model later in section 4.2.3. An important aspect of the model is its ability to estimate the uncertainty of the prediction. As shown in the bottom row of Figure 3a, as we add more observations, the variance shifts from being almost uniformly spread over the digit positions to being localized around areas that are specific to the underlying digit, specifically its edges. Being able to model the uncertainty given some context can be helpful for many tasks. One example is active exploration, where the Figure 4. Pixel-wise image completion on CelebA. Two exam- ples of CelebA image regression with varying numbers of obser- vations. We provide the model with 1, 10, 100 and 1000 context points (top row) and query the entire image. The resulting mean (middle row) and variance (bottom row) at each pixel position is shown for each of the context images. An important aspect of CNPs demonstrated in Figure 5, is its flexibility not only in the number of observations and targets it receives but also with regards to their input values. It is interesting to compare this property to GPs on one hand, and to trained generative models (van den Oord et al., 2016; Gregor et al., 2015) on the other hand. The first type of flexibility can be seen when conditioning on subsets that the model has not encountered during training. Consider conditioning the model on one half of the image, fox example. This forces the model to not only predict pixel Figure 5. Flexible image completion. In contrast to standard con- ditional models, CNPs can be directly conditioned on observed pixels in arbitrary patterns, even ones which were never seen in the training set. Similarly, the model can predict values for pixel coordinates that were never included in the training set, like sub- pixel values in different resolutions. The dotted white lines were added for clarity after generation. and prediction task, and has the capacity to extract domain knowledge from a training set. We compare CNPs quantitatively to two related models: kNNs and GPs. As shown in Table 4.2.3 CNPs outperform the latter when number of context points is small (empiri- cally when half of the image or less is provided as context). When the majority of the image is given as context exact methods like GPs and kNN will perform better. From the ta- ble we can also see that the order in which the context points are provided is less important for CNPs, since providing the context points in order from top to bottom still results in good performance. Both insights point to the fact that CNPs learn a data-specific ‘prior’ that will generate good samples even when the number of context points is very small. 4.2.3. LATENT VARIABLE MODEL The main model we use throughout this paper is a factored model that predicts the mean and variance of the target outputs. Although we have shown that the mean is by itself a useful prediction, and that the variance is a good way to capture the uncertainty, this factored model prevents us from Figure 6. Image completion with a latent variable model. The latent variables capture the global uncertainty, allowing the sam- pling of different coherent images which conform to the obser- vations. As the number of observations increases, uncertainty is reduced and the samples converge to a single estimate. servations and targets. This forms a multivariate Gaussian which can be used to coherently draw samples. In order to maintain this property in a trained model, one approach is to train the model to predict a GP kernel (Wilson et al., 2016). However the difficulty is the need to back-propagate through the sampling which involves a large matrix inversion (or some approximation of it). Random Context Ordered Context # 10 100 1000 10 100 1000 kNN 0.215 0.052 0.007 0.370 0.273 0.007 GP 0.247 0.137 0.001 0.257 0.220 0.002 CNP 0.039 0.016 0.009 0.057 0.047 0.021 Table 1. Pixel-wise mean squared error for all of the pixels in the image completion task on the CelebA data set with increasing number of context points (10, 100, 1000). The context points are chosen either at random or ordered from the top-left corner to the bottom-right. With fewer context points CNPs outperform kNNs and GPs. In addition CNPs perform well regardless of the order of the context points, whereas GPs and kNNs perform worse when the context is ordered. In contrast, the approach we use is to simply add latent variables z to our decoder g, allowing our model to capture
  • 28. • : 8 8 :2 . 2 0 2 : + -, 1 – C • – I P » • Ici c N [or L I • g d ] C Ai I G • a c x Iu Aci c I M » • g d N [ I • a c x [t L I IAil eb s gnC 8 .. I N Conditional Figure 5. Flexible image completion. In contrast to standard con ditional models, CNPs can be directly conditioned on observed pixels in arbitrary patterns, even ones which were never seen in the training set. Similarly, the model can predict values for pixel coordinates that were never included in the training set, like sub pixel values in different resolutions. The dotted white lines were added for clarity after generation. and prediction task, and has the capacity to extract domain knowledge from a training set. Conditional Neural Processes Figure 5. Flexible image completion. In contrast to standard con- ditional models, CNPs can be directly conditioned on observed pixels in arbitrary patterns, even ones which were never seen in the training set. Similarly, the model can predict values for pixel coordinates that were never included in the training set, like sub- pixel values in different resolutions. The dotted white lines were added for clarity after generation. and prediction task, and has the capacity to extract domain knowledge from a training set. We compare CNPs quantitatively to two related models: kNNs and GPs. As shown in Table 4.2.3 CNPs outperform the latter when number of context points is small (empiri- cally when half of the image or less is provided as context). Figure 6. Image completi latent variables capture th pling of different coheren vations. As the number o reduced and the samples c servations and targets. which can be used to co maintain this property in train the model to predi However the difficulty is the sampling which inv some approximation of Random Co # 10 100
  • 29. • , AIA :C 2 G:C G 8-:G C ,10 9 – y isku 3 A C IR • oPmln – 3 A C I O av R » 2 L:M / I ,C: A A :IA 2isku h/ a oPmh ] fR • rot – G ,22 – : G :I G isk :K G: S :I • – 1+22 86: I G ,10 9 derot V w – 1:I A 2 IL G 8 A M:C 2 6 9 a g] [ – IA a 1+22 1:I A 2 IL G !(#$) f a cN,2 !(# + $) k Ps t Conditional Neural Processes a conditional Gaussian prior p(z|O) that is conditioned on the observations, and a Gaussian posterior p(z|O, T) that is also conditioned on the target points. We apply this model to MNIST and CelebA (Figure 6). We use the same models as before, but we concatenate the representation r to a vector of latent variables z of size 64 (for CelebA we use bigger models where the sizes of r and z are 1024 and 128 respectively). For both the prior and posterior models, we use three layered MLPs and average their outputs. We emphasize that the difference between the prior and posterior is that the prior only sees the observed pixels, while the posterior sees both the observed and the target pixels. When sampling from this model with a small number of observed pixels, we get coherent samples and we see that the variability of the datasets is captured. As the model is conditioned on more and more observations, the variability of the samples drops and they eventually converge to a single possibility. 4.3. Classification Finally, we apply the model to one-shot classification using the Omniglot dataset (Lake et al., 2015) (see Figure 7 for an overview of the task). This dataset consists of 1,623 classes of characters from 50 different alphabets. Each class has only 20 examples and as such this dataset is particularly suitable for few-shot learning algorithms. As in (Vinyals et al., 2016) we use 1,200 randomly selected classes as our training set and the remainder as our testing data set. In addition we augment the dataset following the protocol described in (Santoro et al., 2016). This includes cropping the image from 32 ⇥ 32 to 28 ⇥ 28, applying small random translations and rotations to the inputs, and also increasing the number of classes by rotating every character by 90 degrees and defining that to be a new class. We generate PredictObserve Aggregate r5r4r3 ahhh r Class E Class D Class C g gg r2r1 hh Class B Class A A B C E D 0 1 0 1 0 1 Figure 7. One-shot Omniglot classification. At test time the model is presented with a labelled example for each class, and outputs the classification probabilities of a new unlabelled example. The model uncertainty grows when the new example comes from an un-observed class. 5-way Acc 20-way Acc Runtime 1-shot 5-shot 1-shot 5-shot MANN 82.8% 94.9% - - O(nm) MN 98.1% 98.9% 93.8% 98.5% O(nm) CNP 95.3% 98.5% 89.9% 96.8% O(n + m) Table 2. Classification results on Omniglot. Results on the same task for MANN (Santoro et al., 2016), and matching networks (MN) (Vinyals et al., 2016) and CNP. using a significantly simpler architecture (three convolu- tional layers for the encoder and a three-layer MLP for the decoder) and with a lower runtime of O(n + m) at test time as opposed to O(nm). Conditional Neural Processes a conditional Gaussian prior p(z|O) that is conditioned on the observations, and a Gaussian posterior p(z|O, T) that is also conditioned on the target points. We apply this model to MNIST and CelebA (Figure 6). We use the same models as before, but we concatenate the representation r to a vector of latent variables z of size 64 (for CelebA we use bigger models where the sizes of r and z are 1024 and 128 respectively). For both the prior and posterior models, we use three layered MLPs and average their outputs. We emphasize that the difference between the prior and posterior is that the prior only sees the observed pixels, while the posterior sees both the observed and the target pixels. When sampling from this model with a small number of observed pixels, we get coherent samples and we see that the variability of the datasets is captured. As the model is conditioned on more and more observations, the variability of the samples drops and they eventually converge to a single possibility. 4.3. Classification Finally, we apply the model to one-shot classification using the Omniglot dataset (Lake et al., 2015) (see Figure 7 for an overview of the task). This dataset consists of 1,623 classes of characters from 50 different alphabets. Each class has only 20 examples and as such this dataset is particularly suitable for few-shot learning algorithms. As in (Vinyals et al., 2016) we use 1,200 randomly selected classes as our training set and the remainder as our testing data set. In addition we augment the dataset following the protocol described in (Santoro et al., 2016). This includes cropping the image from 32 ⇥ 32 to 28 ⇥ 28, applying small random translations and rotations to the inputs, and also increasing the number of classes by rotating every character by 90 degrees and defining that to be a new class. We generate PredictObserve Aggregate r5r4r3 ahhh r Class E Class D Class C g gg r2r1 hh Class B Class A A B C E D 0 1 0 1 0 1 Figure 7. One-shot Omniglot classification. At test time the model is presented with a labelled example for each class, and outputs the classification probabilities of a new unlabelled example. The model uncertainty grows when the new example comes from an un-observed class. 5-way Acc 20-way Acc Runtime 1-shot 5-shot 1-shot 5-shot MANN 82.8% 94.9% - - O(nm) MN 98.1% 98.9% 93.8% 98.5% O(nm) CNP 95.3% 98.5% 89.9% 96.8% O(n + m) Table 2. Classification results on Omniglot. Results on the same task for MANN (Santoro et al., 2016), and matching networks (MN) (Vinyals et al., 2016) and CNP. using a significantly simpler architecture (three convolu- tional layers for the encoder and a three-layer MLP for the decoder) and with a lower runtime of O(n + m) at test time as opposed to O(nm).
  • 30. • 2 . 0 2 : ,+ 8 1 – t • . n IL b – epILhl G i l kIM c ac N P [ L C l e u xL P • 32 2 : 2 23 h – 32 2 : 2 23 32 Ps ou i s ouG rv Neural Processes xC yT C T z xTyC (a) Graphical model yC xC h rC ra z yT xThrT g a r C T Generation Inference (b) Computational diagram Figure 1. Neural process model. (a) Graphical model of a neural process. x and y correspond to the data where y = f(x). C and T are the number of context points and target points respectively and z is the global latent variable. A grey background indicates that the variable is observed. (b) Diagram of our neural process implementation. Variables in circles correspond to the variables of the graphical ditional stochastic process Q✓ that defines distributions over f(x) for inputs x 2 T. ✓ is the real vector of all parame- ters defining Q. Inheriting from the properties of stochastic processes, we assume that Q✓ is invariant to permutations of O and T. If O0 , T0 are permutations of O and T, re- spectively, then Q✓(f(T) | O, T) = Q✓(f(T0 ) | O, T0 ) = Q✓(f(T) | O0 , T). In this work, we generally enforce per- mutation invariance with respect to T by assuming a fac- tored structure. Specifically, we consider Q✓s that factor Q✓(f(T) | O, T) = Q x2T Q✓(f(x) | O, x). In the absence of assumptions on output space Y , this is the easiest way to ensure a valid stochastic process. Still, this framework can be extended to non-factored distributions, we consider such a model in the experimental section. The defining characteristic of a CNP is that it conditions on O via an embedding of fixed dimensionality. In more detail, we use the following architecture, ri = h✓(xi, yi) 8(xi, yi) 2 O (1) r = r1 r2 . . . rn 1 rn (2) i = g✓(xi, r) 8(xi) 2 T (3) where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd ! Re are neural networks, is a commutative operation that takes elements in Rd and maps them into a single element of Rd , and i are parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend- ing on the task the model learns to parametrize a different output distribution. This architecture ensures permutation invariance and O(n + m) scaling for conditional prediction. We note that, since r1 . . . rn can be computed in O(1) from r1 . . . rn 1, this architecture supports streaming observations with minimal overhead. For regression tasks we use i to parametrize the mean and variance i = (µi, 2 i ) of a Gaussian distribution N(µi, 2 i ) for every xi 2 T. For classification tasks i parametrizes and unobserved values. In practice, we take Monte Carlo estimates of the gradient of this loss by sampling f and N. This approach shifts the burden of imposing prior knowl- edge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately intended to summarize their empirical experience. Still, we emphasize that the Q✓ are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that. In summary, 1. A CNP is a conditional distribution over functions trained to model the empirical conditional distributions of functions f ⇠ P. 2. A CNP is permutation invariant in O and T. 3. A CNP is scalable, achieving a running time complex- ity of O(n + m) for making m predictions with n observations. Within this specification of the model there are still some aspects that can be modified to suit specific requirements. The exact implementation of h, for example, can be adapted to the data type. For low dimensional data the encoder can be implemented as an MLP, whereas for inputs with larger dimensions and spatial correlations it can also include convolutions. Finally, in the setup described the model is not able to produce any coherent samples, as it learns to model only a factored prediction of the mean and the variances, disregarding the covariance between target points. This is a result of this particular implementation of the model. One way we can obtain coherent samples is by introducing a latent variable that we can sample from. We carry out some proof-of-concept experiments on such a model in section 4.2.3. . ditional stochastic process Q✓ that defines distributions over f(x) for inputs x 2 T. ✓ is the real vector of all parame- ters defining Q. Inheriting from the properties of stochastic processes, we assume that Q✓ is invariant to permutations of O and T. If O0 , T0 are permutations of O and T, re- spectively, then Q✓(f(T) | O, T) = Q✓(f(T0 ) | O, T0 ) = Q✓(f(T) | O0 , T). In this work, we generally enforce per- mutation invariance with respect to T by assuming a fac- tored structure. Specifically, we consider Q✓s that factor Q✓(f(T) | O, T) = Q x2T Q✓(f(x) | O, x). In the absence of assumptions on output space Y , this is the easiest way to ensure a valid stochastic process. Still, this framework can be extended to non-factored distributions, we consider such a model in the experimental section. The defining characteristic of a CNP is that it conditions on O via an embedding of fixed dimensionality. In more detail, we use the following architecture, ri = h✓(xi, yi) 8(xi, yi) 2 O (1) r = r1 r2 . . . rn 1 rn (2) i = g✓(xi, r) 8(xi) 2 T (3) where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd ! Re are neural networks, is a commutative operation that takes elements in Rd and maps them into a single element of Rd , and i are parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend- ing on the task the model learns to parametrize a different output distribution. This architecture ensures permutation invariance and O(n + m) scaling for conditional prediction. We note that, since r1 . . . rn can be computed in O(1) from r1 . . . rn 1, this architecture supports streaming observations with minimal overhead. For regression tasks we use i to parametrize the mean and variance i = (µi, 2 i ) of a Gaussian distribution N(µi, 2 i ) for every xi 2 T. For classification tasks i parametrizes and unobserved values. In practice, we estimates of the gradient of this loss by This approach shifts the burden of imp edge from an analytic prior to empiri the advantage of liberating a practition specify an analytic form for the prior, w intended to summarize their empirical e emphasize that the Q✓ are not necessaril conditionals for all observation sets, and does not guarantee that. In summary, 1. A CNP is a conditional distribut trained to model the empirical cond of functions f ⇠ P. 2. A CNP is permutation invariant in 3. A CNP is scalable, achieving a run ity of O(n + m) for making m observations. Within this specification of the model t aspects that can be modified to suit spe The exact implementation of h, for exam to the data type. For low dimensiona can be implemented as an MLP, wher larger dimensions and spatial correlation convolutions. Finally, in the setup describ able to produce any coherent samples, a only a factored prediction of the mean disregarding the covariance between t is a result of this particular implement One way we can obtain coherent sample a latent variable that we can sample fr some proof-of-concept experiments o section 4.2.3. . Specifically, given a set of observations O, a CNP is a con- ditional stochastic process Q✓ that defines distributions over f(x) for inputs x 2 T. ✓ is the real vector of all parame- ters defining Q. Inheriting from the properties of stochastic processes, we assume that Q✓ is invariant to permutations of O and T. If O0 , T0 are permutations of O and T, re- spectively, then Q✓(f(T) | O, T) = Q✓(f(T0 ) | O, T0 ) = Q✓(f(T) | O0 , T). In this work, we generally enforce per- mutation invariance with respect to T by assuming a fac- tored structure. Specifically, we consider Q✓s that factor Q✓(f(T) | O, T) = Q x2T Q✓(f(x) | O, x). In the absence of assumptions on output space Y , this is the easiest way to ensure a valid stochastic process. Still, this framework can be extended to non-factored distributions, we consider such a model in the experimental section. The defining characteristic of a CNP is that it conditions on O via an embedding of fixed dimensionality. In more detail, we use the following architecture, ri = h✓(xi, yi) 8(xi, yi) 2 O (1) r = r1 r2 . . . rn 1 rn (2) i = g✓(xi, r) 8(xi) 2 T (3) where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd ! Re are neural networks, is a commutative operation that takes elements in Rd and maps them into a single element of Rd , and i are parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend- ing on the task the model learns to parametrize a different output distribution. This architecture ensures permutation invariance and O(n + m) scaling for conditional prediction. We note that, since r1 . . . rn can be computed in O(1) from r1 . . . rn 1, this architecture supports streaming observations with minimal overhead. For regression tasks we use i to parametrize the mean and variance i = (µi, 2 ) of a Gaussian distribution N(µi, 2 ) Thus, the targets it scores Q✓ on include and unobserved values. In practice, we estimates of the gradient of this loss by This approach shifts the burden of imp edge from an analytic prior to empiri the advantage of liberating a practition specify an analytic form for the prior, w intended to summarize their empirical e emphasize that the Q✓ are not necessaril conditionals for all observation sets, and does not guarantee that. In summary, 1. A CNP is a conditional distribut trained to model the empirical cond of functions f ⇠ P. 2. A CNP is permutation invariant in 3. A CNP is scalable, achieving a run ity of O(n + m) for making m observations. Within this specification of the model t aspects that can be modified to suit spe The exact implementation of h, for exam to the data type. For low dimensiona can be implemented as an MLP, wher larger dimensions and spatial correlation convolutions. Finally, in the setup describ able to produce any coherent samples, a only a factored prediction of the mean disregarding the covariance between t is a result of this particular implement One way we can obtain coherent sample a latent variable that we can sample fr some proof-of-concept experiments o a less informed posterior of the underlying function. This formulation makes it clear that the posterior given a subset of the context points will serve as the prior when additional context points are included. By using this setup, and train- ing with different sizes of context, we encourage the learned model to be flexible with regards to the number and position of the context points. 2.4. The Neural process model In our implementation of NPs we accommodate for two ad- ditional desiderata: invariance to the order of context points and computational efficiency. The resulting model can be boiled down to three core components (see Figure 1b): • An encoder h from input space into representation space that takes in pairs of (x, y)i context values and produces a representation ri = h((x, y)i) for each of the pairs. We parameterise h as a neural network. • An aggregator a that summarises the encoded inputs. We are interested in obtaining a single order-invariant global representation r that parameterises the latent dis- tribution z ⇠ N(µ(r), I (r)). The simplest operation that ensures order-invariance and works well in prac- tice is the mean function r = a(ri) = 1 n Pn i=1 ri. Cru- cially, the aggregator reduces the runtime to O(n + m) where n and m are the number of context and target points respectively. • A conditional decoder g that takes as input the sam- pled global latent variable z as well as the new target locations xT and outputs the predictions ˆyT for the corresponding values of f(xT ) = yT . 3. Related work 3.1. Conditional neural processes Neural Processes (NPs) are a generalisation of Conditional lack a latent variable that allows for global sampl Figure 2c for a diagram of the model). As a resul are unable to produce different function samples same context data, which can be important if model uncertainty is desirable. It is worth mentioning that inal CNP formulation did include experiments with variable in addition to the deterministic connectio ever, given the deterministic connections to the p variables, the role of the global latent variable is n In contrast, NPs constitute a more clear-cut gener of the original deterministic CNP with stronger p to other latent variable models and approximate B methods. These parallels allow us to compare ou to a wide range of related research areas in the fo sections. Finally, NPs and CNPs themselves can be seen a alizations of recently published generative query n (GQN) which apply a similar training procedure to new viewpoints in 3D scenes given some context tions (Eslami et al., 2018). Consistent GQN (CG an extension of GQN that focuses on generating co samples and is thus also closely related to NPs (Kum 2018). 3.2. Gaussian processes We start by considering models that, like NPs, li spectrum between neural networks (NNs) and G processes (GPs). Algorithms on the NN end of the s fit a single function that they learn from a very large of data directly. GPs on the other hand can rep distribution over a family of functions, which is con by an assumption on the functional form of the co between two points. Scattered across this spectrum, we can place recent that has combined ideas from Bayesian non-para with neural networks. Methods like (Calandra et a Huang et al., 2015) remain fairly close to the GPs, b the prior on the context. As such this prior is equivalent to a less informed posterior of the underlying function. This formulation makes it clear that the posterior given a subset of the context points will serve as the prior when additional context points are included. By using this setup, and train- ing with different sizes of context, we encourage the learned model to be flexible with regards to the number and position of the context points. 2.4. The Neural process model In our implementation of NPs we accommodate for two ad- ditional desiderata: invariance to the order of context points and computational efficiency. The resulting model can be boiled down to three core components (see Figure 1b): • An encoder h from input space into representation space that takes in pairs of (x, y)i context values and produces a representation ri = h((x, y)i) for each of the pairs. We parameterise h as a neural network. • An aggregator a that summarises the encoded inputs. We are interested in obtaining a single order-invariant global representation r that parameterises the latent dis- tribution z ⇠ N(µ(r), I (r)). The simplest operation that ensures order-invariance and works well in prac- tice is the mean function r = a(ri) = 1 n Pn i=1 ri. Cru- cially, the aggregator reduces the runtime to O(n + m) where n and m are the number of context and target points respectively. • A conditional decoder g that takes as input the sam- pled global latent variable z as well as the new target locations xT and outputs the predictions ˆyT for the corresponding values of f(xT ) = yT . 3. Related work 3.1. Conditional neural processes a large part of the motivation behind n lack a latent variable that allows for g Figure 2c for a diagram of the model) are unable to produce different funct same context data, which can be import uncertainty is desirable. It is worth men inal CNP formulation did include exper variable in addition to the deterministi ever, given the deterministic connectio variables, the role of the global latent In contrast, NPs constitute a more clea of the original deterministic CNP wit to other latent variable models and app methods. These parallels allow us to to a wide range of related research are sections. Finally, NPs and CNPs themselves ca alizations of recently published genera (GQN) which apply a similar training p new viewpoints in 3D scenes given so tions (Eslami et al., 2018). Consisten an extension of GQN that focuses on g samples and is thus also closely related 2018). 3.2. Gaussian processes We start by considering models that, spectrum between neural networks (N processes (GPs). Algorithms on the NN fit a single function that they learn from of data directly. GPs on the other ha distribution over a family of functions, by an assumption on the functional for between two points. Scattered across this spectrum, we can p that has combined ideas from Bayesi Neural Processes yC zT yT C T z Neural statistician xC yT C T xTyC (c) Conditional neural process xC yT C T z xTyC (d) Neural process elated models (a-c) and of the neural process (d). Gray shading indicates the variable is observed. C T for target variables i.e. the variables to predict given C. . . ]W b b 0 2 : ,+ 8 1 . 0 2 : ,+ 8 1
  • 31. • 8C 8 8 2 =8 + , 1 3 – l hp • u – o t I C= 8 := D [ r d P MLT – l y • u – TI o o b ab M] ] y W kc n T I – o t I b ab M] y P sN ei C= 0 C G I IL Neural Processes 10 100 300 784 Number of context points ContextSample1Sample2Sample3 15 30 90 1024 Number of context points ContextSample1Sample2Sample3 Figure 4. Pixel-wise regression on MNIST and CelebA The diagram on the left visualises how pixel-wise image completion can be framed as a 2-D regression task where f(pixel coordinates) = pixel brightness. The figures to the right of the diagram show the results on image completion for MNIST and CelebA. The images on the top correspond to the context points provided to the model. For better clarity the unobserved pixels have been coloured blue for the MNIST images and white for CelebA. Each of the rows corresponds to a
  • 32. • C /C DD D C + , 3 C D – le G 2 D D : • – iIg u [ w – 2 D D : ST M d • – / / PR t STle drSR s ophm PR s RWa N] PR r d – 0 8 1 C d PR / / L r d PR d] R L – / [ C 8 C ekIkonIg cR d PTNeural Processes Figure 5. Thompson sampling with neural processes on a 1-D objective function. The plots show the optimisation process over five iterations. Each prediction function (blue) is drawn by sampling a latent variable conditioned on an incresing number of context points (black circles). The underlying ground truth function is depicted as a black dotted line. The red triangle indicates the next evaluation point which corresponds to the minimum value of the sampled NP curve. The red circle in the following iteration corresponds to this evaluation point with its underlying ground truth value that serves as a new context point to the NP. Neural Processes Figure 5. Thompson sampling with neural processes on a 1-D objective function. The plots show the optimisation process over five iterations. Each prediction function (blue) is drawn by sampling a latent variable conditioned on an incresing number of context points (black circles). The underlying ground truth function is depicted as a black dotted line. The red triangle indicates the next evaluation point which corresponds to the minimum value of the sampled NP curve. The red circle in the following iteration corresponds to this evaluation point with its underlying ground truth value that serves as a new context point to the NP. 4.3. Black-box optimisation with Thompson sampling To showcase the utility of sampling entire consistent trajecto- ries we apply neural processes to Bayesian optimisation on 1-D function using Thompson sampling (Thompson, 1933). Thompson sampling (also known as randomised proba- bility matching) is an approach to tackle the exploration- exploitation dilemma by maintaining a posterior distribution over model parameters. A decision is taken by drawing a sample of model parameters and acting greedily under the resulting policy. The posterior distribution is then updated and the process is repeated. Despite its simplicity, Thomp- son sampling has been shown to be highly effective both empirically and in theory. It is commonly applied to black box optimisation and multi-armed bandit problems (e.g. Agrawal & Goyal, 2012; Shahriari et al., 2016). function evaluations increases. Neural process Gaussian process Random Search 0.26 0.14 1.00 Table 1. Bayesian Optimisation using Thompson sampling. Average number of optimisation steps needed to reach the global minimum of a 1-D function generated by a Gaussian process. The values are normalised by the number of steps taken using random search. The performance of the Gaussian process with the correct kernel constitutes an upper bound on performance. 4.4. Contextual bandits
  • 33. • e n – u o a – t k G • o C l F – • F lC N • o C G G – • / i C ikG a a C k P • / 3 / – r r bF N Fv r GwN : : – N F G
  • 34. • / – 2 NFLO 8 S FS L" /ON JSJON L FT L ODFRRFR" 4N 4/87 " – 2 NFLO 8 S FS L" FT L ODFRRFR" 4N 4/87 VO KRIOP S DK " • 8FS 7F NJNH – J DIJN N 3THO 7 ODIFLLF" :PSJMJY SJON R MO FL GO GFV RIOS LF NJNH" 4N 4/7 " – ?JN LR : JOL FS L" 8 SDIJNH NFSVO KR GO ONF RIOS LF NJNH" 4N 4 " – 1JNN /IFLRF JFSF .CCFFL N F HF 7F JNF" 8O FL HNORSJD MFS LF NJNH GO G RS PS SJON OG FFP NFSVO KR" 4N 4/87 " – NSO O . M FS L" 8FS LF NJNH VJSI MFMO THMFNSF NFT L NFSVO KR" 4N 4/87 " – FF DOSS N O F 1 FJS R : JOL ?JN LR" 0FFP 7F NJNH, DSJDF N FN R" 4N 4 TSO J L " – ?J LR : JOL" 8O FL R :PSJMJY SJON 8FS 7F NJNH" 4N 4 8FS 7F NJNH MPORJTM " • 2 TRRJ N ODFRRFR – JLRON .N FV 2O ON FS L" 0FFP KF NFL LF NJNH" . SJGJDJ L 4NSFLLJHFNDF N S SJRSJDR" " – 6FN LL .LFW N A JN 2 L" I S TNDF S JNSJFR O VF NFF JN C FRJ N FFP LF NJNH GO DOMPTSF JRJON-" 4N 4 VO KRIOP S DK " – 7FF FIOON FS L" 0FFP NFT L NFSVO KR R H TRRJ N P ODFRRFR" 4N 4/7 "