This document discusses neural Turing machines, which are neural networks combined with external memory systems that allow them to be trained end-to-end using backpropagation. Neural Turing machines can learn simple algorithms and generalize well for tasks like language modeling and question answering. However, they are difficult to train due to numerical instability and optimizing memory usage. The document recommends techniques like gradient clipping, loss clipping, and curriculum learning to improve training. It also covers developments like dynamic neural computers that can allocate and deallocate memory.
8. 7
28
Neural Turing Machines can…
Learn simple algorithms (Copy, repeat,
recognize simple formal languages...)
Generalize
Do well at language modeling
Do well at bAbI
12. 11
28
bAbI dataset
1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary? bathroom 1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel? hallway 4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel? hallway 4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel? office 11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra? bathroom 8
1 Sandra travelled to the office.
2 Sandra went to the bathroom.
3 Where is Sandra? bathroom 2
Small vocabulary
Stories
Context
https://research.facebook.com/research/babi/
19. 18
28
Graves’ RMSprop
A version of back propagation used to train the network
Used in many of Graves’ RNN papers:
𝑛𝑖 = 𝛼 + 1 − 𝛼 𝜖𝑖
2
𝑔𝑖 = 𝛼𝑔𝑖 + 1 − 𝛼 𝜖𝑖
Δ𝑖 = 𝛽Δ𝑖 − 𝛾
𝜖𝑖
𝑛𝑖 − 𝑔𝑖
2
+ 𝛾 + 𝛿
𝑤𝑖 = 𝑤𝑖 + Δ𝑖
Similar to normalizing gradient updates by their variance, important
for the NTM’s high-variability changes in loss.
20. 19
28
Adam Optimizer
Works well for many tasks
Comes pre-loaded in most ML frameworks
Like Graves’ RMSprop, smooths gradients
21. 20
28
Attention to initialization
Memory initialization extremely important
Poor initialization can prevent convergence
Pay particularly close attention to the
starting value of the memory
22. 21
28
Short sequences first (“Curriculum
Learning”)
1) Feed in short training data
2) When loss hits a target, increase the size
of the input
3) Repeat
24. 23
28
Neural Turing Machines “V2”
Similar to NTMs, except…
No index shift based addressing
Can ‘allocate’ and ‘deallocate’ memory
Remembers recent memory use
29. 28
28
References
Implementations:
Tensorflow: https://github.com/carpedm20/NTM-tensorflow
Go: https://github.com/fumin/ntm
Torch: https://github.com/kaishengtai/torch-ntm
Node.JS: https://github.com/gcgibson/NTM
Lasagne: https://github.com/snipsco/ntm-lasagne
Theano: https://github.com/shawntan/neural-turing-machines
Papers:
Graves et al. 2016 – Hybrid computing using a neural network with dynamic
external memory
Graves et al. 2014 – Neural Turing Machines
Yu et al. 2015 – Empirical Study on Deep Learning Models for Question Answering
Rae et al. 2016 – Scaling Memory-Augmented Neural Networks with Sparse Reads
and Writes
A turing machine is a simplified model of a computer. Instead of RAM or a hard drive, it has only a ‘tape’ with symbols written on it. The ‘head’ of the machine can read a symbol in a given location, and decide whether to write a new symbol or move up and down the the tape. This models a program that makes use of an external memory source
A Neural Turing Machine is a kind of neural network with an architecture or underlying structure that is similar to that of a turing machine. Unlike the classical turing machine, a neural turing machine has access to external input as well as producing external output. There is an internal memory component or ‘tape’ that the turing machine can move up and down. The main achievement of the NTM is in representing the usual operations of a turing machine with differentiable functions, so that it can be trained like any other sequence to sequence RNN with backpropagation.
NTMs can learn discrete processes and (in theory) extrapolate them over variable inputs or data.
Empirical Study on Deep Learning Models for Question Answering (2015)
Don’t need, put at the end. Make the points shorter. Highlight key aspect of sentence, use that as bullet point.
BYRON NOTES(from memory): Compare to the *standard* RMSprop algorithm
BYRON NOTES (from memory): Describe what is happening in the slide before I launch into what it means. “As you can see…” means I haven’t explained enough.
These are the operations that take finite step procedures like accessing memory and make them understandable in terms of backpropagation. In other words, they let us understand choices such as "which location of memory should I access" in terms of relative 'loss' or cost of that decision.