Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
1. 1 / 26
Gradient-Based Meta-Learning with Learned
Layerwise Metric and Subspace
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
February 22, 2018
5. 5 / 26
Meta-Learning
Which is Aconitum napellus?
Same information, but this version of the task is impossible
for humans. We clearly have something that helps us process
new visual information.
6. 6 / 26
Meta-Learning
Which is Aconitum napellus?
Some humans have (meta-)learned to answer this question.
Meta-learning can occur using acquired knowledge.
8. 8 / 26
Previous Deep Meta-Learning Methods
Metric Learning1234
Learn a metric in image space
Specific to few-shot classification(Omniglot, MiniImageNet
etc)
Learning=nearest neighbor, Meta-Learning=metric
1
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. “Siamese Neural
Networks for One-shot Image Recognition”. In: ICML (2015).
2
Oriol Vinyals et al. “Matching Networks for One Shot Learning”. In: NIPS
(2016).
3
Jake Snell, Kevin Swersky, and Richard S. Zemel. “Prototypical Networks
for Few-shot Learning”. In: NIPS (2017).
4
Flood Sung et al. “Learning to Compare: Relation Network for Few-Shot
Learning”. In: arXiv (2017).
9. 9 / 26
Previous Deep Meta-Learning Methods
RNNs as learners67
Should be able to approximate any learning algorithm.
Temporal convolutions5 have also been used in a similar way.
Learning=RNN rollforward, Meta-Learning=RNN weights
5
Nikhil Mishra et al. “A Simple Neural Attentive Meta-Learner”. In: ICLR
(2018).
6
Adam Santoro et al. “One-shot Learning with Memory-Augmented Neural
Networks”. In: ICML (2016).
7
Yan Duan et al. “RLˆ2: Fast Reinforcement Learning via Slow
Reinforcement Learning”. In: arXiv (2016).
10. 10 / 26
Previous Deep Meta-Learning Methods
Optimizer Learning89
Learn parameter update given gradients (search space includes
SGD, RMSProp, Adam etc)
Applicable to any architecture/task
Learning=generalized SGD with optimizer,
Meta-Learning=optimizer parameters
8
Marcin Andrychowicz et al. “Learning to learn by gradient descent by
gradient descent”. In: NIPS (2016).
9
Sachin Ravi and Hugo Larochelle. “Optimization as a Model for Few-shot
Learning”. In: ICLR (2017).
12. 12 / 26
Gradient-Based Meta-Learning
MAML10
10
Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic
Meta-Learning for Fast Adaptation of Deep Networks”. In: ICML (2017).
13. 13 / 26
Gradient-Based Meta-Learning
Can approximate any learning algorithm11
Can be interpreted as hierarchical Bayes12
Unlike other methods, learning and meta-learning happen in
the same parameter space.
Learning=SGD, Meta-Learning=Initial parameters
11
Chelsea Finn and Sergey Levine. “Meta-Learning and Universality: Deep
Representations and Gradient Descent can Approximate any Learning
Algorithm”. In: ICLR (2018).
12
Erin Grant et al. “Recasting Gradient-Based Meta-Learning as Hierarchical
Bayes”. In: ICLR (2018).
14. 14 / 26
Gradient-Based Meta-Learning
Implicit assumption: meta-learning and learning require the
same number of parameters.
15. 15 / 26
Gradient-Based Meta-Learning with Learned
Layerwise Metric and Subspace
Yoonho Lee, Seungjin Choi
1801.05558, submitted to ICML 2018
16. 16 / 26
MT-nets
Idea: task-specific learning should require less degrees of
freedom than meta-learning.
18. 18 / 26
MT-nets
From a task-specific learner’s point of view, T alters the
activation space.
19. 19 / 26
MT-nets
Proposition
Fix x and A. Let U be a d-dimensional subspace of Rn (d ≤ n).
There exist configurations of T, W, and ζ such that the span of
ynew − y is U while satisfying A = TW.
Proposition
Fix x, A, and a loss function LT . Let U be a d-dimensional
subspace of Rn, and g(·, ·) a metric tensor on U. There exist
configurations of T, W, and ζ such that the vector ynew − y is in
the steepest direction of descent on LT with respect to the metric
du.
22. 22 / 26
Experiments
3 meta-tasks: regression to polynomials of order n
(n ∈ 0, 1, 2).
MT-nets choose to update more parameters for more
complicated meta-tasks.
25. 25 / 26
Summary
MT-nets are robust to step size because of T, and the mask
M reflects the complexity of the meta-task.
MT-nets achieve state-of-the-art performance on a
challenging few-shot learning task.
26. 26 / 26
Future Work
Our work shows that gradient-based meta-learning can benefit
from additional structure. Other architectures for
meta-learners?
Our method performs gradient descent on some metric that
makes learning faster, this might somehow relate to natural
gradients13.
Our metric is learned layerwise, which is similar to how a
recent work14 factors parameter space to tractably
approximate natural gradients.
13
Shun-Ichi Amari. “Natural gradient works efficiently in learning”. In:
Neural computation 10.2 (1998), pp. 251–276.
14
James Martens and Roger Grosse. “Optimizing neural networks with
kronecker-factored approximate curvature”. In: ICML. 2015.
27. 27 / 26
References I
[1] Shun-Ichi Amari. “Natural gradient works efficiently in
learning”. In: Neural computation 10.2 (1998), pp. 251–276.
[2] Marcin Andrychowicz et al. “Learning to learn by gradient
descent by gradient descent”. In: NIPS (2016).
[3] Yan Duan et al. “RLˆ2: Fast Reinforcement Learning via
Slow Reinforcement Learning”. In: arXiv (2016).
[4] Chelsea Finn, Pieter Abbeel, and Sergey Levine.
“Model-Agnostic Meta-Learning for Fast Adaptation of Deep
Networks”. In: ICML (2017).
[5] Chelsea Finn and Sergey Levine. “Meta-Learning and
Universality: Deep Representations and Gradient Descent
can Approximate any Learning Algorithm”. In: ICLR (2018).
[6] Erin Grant et al. “Recasting Gradient-Based Meta-Learning
as Hierarchical Bayes”. In: ICLR (2018).
28. 28 / 26
References II
[7] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov.
“Siamese Neural Networks for One-shot Image Recognition”.
In: ICML (2015).
[8] James Martens and Roger Grosse. “Optimizing neural
networks with kronecker-factored approximate curvature”.
In: ICML. 2015.
[9] Nikhil Mishra et al. “A Simple Neural Attentive
Meta-Learner”. In: ICLR (2018).
[10] Sachin Ravi and Hugo Larochelle. “Optimization as a Model
for Few-shot Learning”. In: ICLR (2017).
[11] Adam Santoro et al. “One-shot Learning with
Memory-Augmented Neural Networks”. In: ICML (2016).
[12] Jake Snell, Kevin Swersky, and Richard S. Zemel.
“Prototypical Networks for Few-shot Learning”. In: NIPS
(2017).
29. 29 / 26
References III
[13] Flood Sung et al. “Learning to Compare: Relation Network
for Few-Shot Learning”. In: arXiv (2017).
[14] Oriol Vinyals et al. “Matching Networks for One Shot
Learning”. In: NIPS (2016).