Gradient-based Meta-learning with learned layerwise subspace and metric

1 / 26
Gradient-Based Meta-Learning with Learned
Layerwise Metric and Subspace
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
February 22, 2018

2 / 26

4 / 26
Meta-Learning
Which is Aconitum napellus?

5 / 26
Meta-Learning
Same information, but this version of the task is impossible
for humans. We clearly have something that helps us process
new visual information.

6 / 26
Meta-Learning
Some humans have (meta-)learned to answer this question.
Meta-learning can occur using acquired knowledge.

8 / 26
Previous Deep Meta-Learning Methods
Metric Learning1234
Learn a metric in image space
Speciﬁc to few-shot classiﬁcation(Omniglot, MiniImageNet
etc)
Learning=nearest neighbor, Meta-Learning=metric
1
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. “Siamese Neural
Networks for One-shot Image Recognition”. In: ICML (2015).
2
Oriol Vinyals et al. “Matching Networks for One Shot Learning”. In: NIPS
(2016).
3
Jake Snell, Kevin Swersky, and Richard S. Zemel. “Prototypical Networks
for Few-shot Learning”. In: NIPS (2017).
4
Flood Sung et al. “Learning to Compare: Relation Network for Few-Shot
Learning”. In: arXiv (2017).

9 / 26
RNNs as learners67
Should be able to approximate any learning algorithm.
Temporal convolutions5 have also been used in a similar way.
Learning=RNN rollforward, Meta-Learning=RNN weights
5
Nikhil Mishra et al. “A Simple Neural Attentive Meta-Learner”. In: ICLR
(2018).
6
Adam Santoro et al. “One-shot Learning with Memory-Augmented Neural
Networks”. In: ICML (2016).
7
Yan Duan et al. “RLˆ2: Fast Reinforcement Learning via Slow
Reinforcement Learning”. In: arXiv (2016).

10 / 26
Optimizer Learning89
Learn parameter update given gradients (search space includes
SGD, RMSProp, Adam etc)
Applicable to any architecture/task
Learning=generalized SGD with optimizer,
Meta-Learning=optimizer parameters
8
Marcin Andrychowicz et al. “Learning to learn by gradient descent by
gradient descent”. In: NIPS (2016).
9
Sachin Ravi and Hugo Larochelle. “Optimization as a Model for Few-shot
Learning”. In: ICLR (2017).

11 / 26

12 / 26
Gradient-Based Meta-Learning
MAML10
10
Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic
Meta-Learning for Fast Adaptation of Deep Networks”. In: ICML (2017).

13 / 26
Can approximate any learning algorithm11
Can be interpreted as hierarchical Bayes12
Unlike other methods, learning and meta-learning happen in
the same parameter space.
Learning=SGD, Meta-Learning=Initial parameters
11
Chelsea Finn and Sergey Levine. “Meta-Learning and Universality: Deep
Representations and Gradient Descent can Approximate any Learning
Algorithm”. In: ICLR (2018).
12
Erin Grant et al. “Recasting Gradient-Based Meta-Learning as Hierarchical
Bayes”. In: ICLR (2018).

14 / 26
Implicit assumption: meta-learning and learning require the
same number of parameters.

15 / 26
Yoonho Lee, Seungjin Choi
1801.05558, submitted to ICML 2018

16 / 26
MT-nets
Idea: task-speciﬁc learning should require less degrees of
freedom than meta-learning.

18 / 26
MT-nets
From a task-speciﬁc learner’s point of view, T alters the
activation space.

19 / 26
MT-nets
Proposition
Fix x and A. Let U be a d-dimensional subspace of Rn (d ≤ n).
There exist conﬁgurations of T, W, and ζ such that the span of
ynew − y is U while satisfying A = TW.
Proposition
Fix x, A, and a loss function LT . Let U be a d-dimensional
subspace of Rn, and g(·, ·) a metric tensor on U. There exist
conﬁgurations of T, W, and ζ such that the vector ynew − y is in
the steepest direction of descent on LT with respect to the metric
du.

20 / 26
Experiments
Ablation. All components are necessary.

21 / 26
Experiments
Robust to step size α, since T can change eﬀective step size.

22 / 26
Experiments
3 meta-tasks: regression to polynomials of order n
(n ∈ 0, 1, 2).
MT-nets choose to update more parameters for more
complicated meta-tasks.

23 / 26
Experiments
miniImagenet one-shot classiﬁcation

24 / 26
Experiments
miniImagenet one-shot classiﬁcation
5-way 1-shot classiﬁcation accuracy.

25 / 26
Summary
MT-nets are robust to step size because of T, and the mask
M reﬂects the complexity of the meta-task.
MT-nets achieve state-of-the-art performance on a
challenging few-shot learning task.

26 / 26
Future Work
Our work shows that gradient-based meta-learning can beneﬁt
from additional structure. Other architectures for
meta-learners?
Our method performs gradient descent on some metric that
makes learning faster, this might somehow relate to natural
gradients13.
Our metric is learned layerwise, which is similar to how a
recent work14 factors parameter space to tractably
approximate natural gradients.
13
Shun-Ichi Amari. “Natural gradient works eﬃciently in learning”. In:
Neural computation 10.2 (1998), pp. 251–276.
14
James Martens and Roger Grosse. “Optimizing neural networks with
kronecker-factored approximate curvature”. In: ICML. 2015.

27 / 26
References I
[1] Shun-Ichi Amari. “Natural gradient works eﬃciently in
learning”. In: Neural computation 10.2 (1998), pp. 251–276.
[2] Marcin Andrychowicz et al. “Learning to learn by gradient
descent by gradient descent”. In: NIPS (2016).
[3] Yan Duan et al. “RLˆ2: Fast Reinforcement Learning via
Slow Reinforcement Learning”. In: arXiv (2016).
[4] Chelsea Finn, Pieter Abbeel, and Sergey Levine.
“Model-Agnostic Meta-Learning for Fast Adaptation of Deep
Networks”. In: ICML (2017).
[5] Chelsea Finn and Sergey Levine. “Meta-Learning and
Universality: Deep Representations and Gradient Descent
can Approximate any Learning Algorithm”. In: ICLR (2018).
[6] Erin Grant et al. “Recasting Gradient-Based Meta-Learning
as Hierarchical Bayes”. In: ICLR (2018).

28 / 26
References II
[7] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov.
“Siamese Neural Networks for One-shot Image Recognition”.
In: ICML (2015).
[8] James Martens and Roger Grosse. “Optimizing neural
networks with kronecker-factored approximate curvature”.
In: ICML. 2015.
[9] Nikhil Mishra et al. “A Simple Neural Attentive
Meta-Learner”. In: ICLR (2018).
[10] Sachin Ravi and Hugo Larochelle. “Optimization as a Model
for Few-shot Learning”. In: ICLR (2017).
[11] Adam Santoro et al. “One-shot Learning with
Memory-Augmented Neural Networks”. In: ICML (2016).
[12] Jake Snell, Kevin Swersky, and Richard S. Zemel.
“Prototypical Networks for Few-shot Learning”. In: NIPS
(2017).

29 / 26
References III
[13] Flood Sung et al. “Learning to Compare: Relation Network
for Few-Shot Learning”. In: arXiv (2017).
[14] Oriol Vinyals et al. “Matching Networks for One Shot
Learning”. In: NIPS (2016).

Gradient-based Meta-learning with learned layerwise subspace and metric

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Gradient-based Meta-learning with learned layerwise subspace and metric

Similaire à Gradient-based Meta-learning with learned layerwise subspace and metric (20)

Plus de NAVER Engineering

Plus de NAVER Engineering (20)

Dernier

Dernier (20)

Gradient-based Meta-learning with learned layerwise subspace and metric