Sequence to sequence (encoder-decoder) learning

Hello!I am Roberto Silveira
EE engineer, ML enthusiast
rsilveira79@gmail.com
@rsilveira79

Basic Recurrent cells (RNN)
Source: http://colah.github.io/
Issues
× Difficulties to deal with long term
dependencies
× Difficult to train - vanish gradient issues

Long term issues
Source: http://colah.github.io/,
CS224d notes
Sentence 1
"Jane walked into the room. John walked
in too. Jane said hi to ___"
Sentence 2
"Jane walked into the room. John walked in
too. It was late in the day, and everyone was
walking home after a long day at work. Jane
said hi to ___"

LSTM in 2 min...
Review
× Address long term dependencies
× More complex to train
× Very powerful, lots of data

LSTM in 2 min...
Review
× Address long term dependencies
× More complex to train
× Very powerful, lots of data
Cell state
Forget
gate
Input
gate
Output
gate

Gated recurrent unit (GRU) in 2 min ...
Review
× Fewer hyperparameters
× Train faster
× Better solution w/ less data
Source: http://www.wildml.com/,
arXiv:1412.3555

Gated recurrent unit (GRU) in 2 min ...
Review
× Fewer hyperparameters
× Train faster
× Better solution w/ less data
Source: http://www.wildml.com/,
arXiv:1412.3555
Reset
gate
Update
gate

Seq2seq
learning
Or encoder-decoder
architectures

Variable size input - output
Source: http://karpathy.github.io/

Basic idea
"Variable" size input (encoder) ->
Fixed size vector representation ->
"Variable" size output (decoder)
"Machine",
"Learning",
"is",
"fun"
"Aprendizado",
"de",
"Máquina",
"é",
"divertido"
0.636
0.122
0.981
Input
One word at a time Stateful
Model
Stateful
Model
Encoded
Sequence
Output
One word at a time
First RNN
(Encoder)
Second
RNN
(Decoder)
Memory of previous
word influence next
result
Memory of previous
word influence next
result

Sequence to Sequence Learning with NeuralNetworks (2014)
"Machine",
"Learning",
"is",
"fun"
"Aprendizado",
"de",
"Máquina",
"é",
"divertido"
0.636
0.122
0.981
1000d word
embeddings
4 layers
1000
cells/layer
Encoded
Sequence
LSTM
(Encoder)
LSTM
(Decoder)
Source: arXiv 1409.3215v3
TRAINING → SGD w/o momentum, fixed learning rate of 0.7, 7.5 epochs, batches of 128
sentences, 10 days of training (WMT 14 dataset English to French)
4 layers
1000
cells/layer

Recurrent encoder-decoders
Les chiens aiment les os <EOS> Dogs love bones
Dogs love bones <EOS>
Source Sequence Target Sequence

Les chiens aiment les os <EOS> Dogs love bones

Leschiensaimentlesos <EOS> Dogs love bones

Recurrent encoder-decoders - issues
● Difficult to cope with large sentences (longer than training corpus)
● Decoder w/ attention mechanism →relieve encoder to squash into
fixed length vector

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN ANDTRANSLATE (2015)
Decoder
Context vector for
each target word
Weights of each
annotation hj

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN ANDTRANSLATE (2015)
Decoder
Context vector for
each target word
Weights of each
annotation hj
Non-monotonic
alignment

Attention models for NLP
Les chiens aiment les os <EOS>
+
<EOS>

+
<EOS>
Dogs

+
<EOS>
Dogs
Dogs
love
+

+
<EOS>
Dogs
Dogs
love
+
love
bones+

Challenges in using the model
● Cannot handle true
variable size input
Source: http://suriyadeepan.github.io/
PADDING
BUCKETING
WORD
EMBEDDINGS
● Capture context
semantic meaning
● Hard to deal with both
short and large sentences

padding
Source: http://suriyadeepan.github.io/
EOS : End of sentence
PAD : Filler
GO : Start decoding
UNK : Unknown; word not in vocabulary
Q : "What time is it? "
A : "It is seven thirty."
Q : [ PAD, PAD, PAD, PAD, PAD, “?”, “it”,“is”, “time”, “What” ]
A : [ GO, “It”, “is”, “seven”, “thirty”, “.”, EOS, PAD, PAD, PAD ]

Source: https://www.tensorflow.org/
bucketing
Efficiently handle sentences of different lengths
Ex: 100 tokens is the largest sentence in corpus
How about short sentences like: "How are you?" → lots of PAD
Bucket list: [(5, 10), (10, 15), (20, 25), (40, 50)]
(defaut on Tensorflow translate.py)
Q : [ PAD, PAD, “.”, “go”,“I”]
A : [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD]

Word embeddings (remember previous presentation ;-)
Distributed representations → syntactic and semantic is captured
Take =
0.286
0.792
-0.177
-0.107
0.109
-0.542
0.349
0.271

Word embeddings (remember previous presentation ;-)
Linguistic regularities (recap)

Phrase representations(Paper - earning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation)

Phrase representations(Paper - earning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation)
1000d vector representation

Neural conversational model - chatbots

Google Smart reply
Interesting facts
● Currently responsible for 10% Inbox replies
● Training set 238 million messages

Google Smart reply
Seq2Seq
model
Interesting facts
● Currently responsible for 10% Inbox replies
● Training set 238 million messages
Feedforward
triggering
model
Semi-supervised
semantic clustering

Image captioning(Paper - Show and Tell: A Neural Image Caption Generator)

Image captioning(Paper - Show and Tell: A Neural Image Caption Generator)
Encoder
Decoder

Multi-task sequence to sequence(Paper - MULTI-TASK SEQUENCE TO SEQUENCE LEARNING)
One-to-Many
(common encoder)
Many-to-One
(common decoder)
Many-to-Many

Neural programmer(Paper - NEURAL PROGRAMMER: INDUCING LATENT PROGRAMS WITH GRADIENT DESCENT)

Unsupervised pre-training for seq2seq - 2017(Paper - UNSUPERVISED PRETRAINING FOR SEQUENCE TO SEQUENCE LEARNING)

Unsupervised pre-training for seq2seq - 2017(Paper - UNSUPERVISED PRETRAINING FOR SEQUENCE TO SEQUENCE LEARNING)
Pre-trained
Pre-trained

THANKS!
rsilveira79@gmail.com
@rsilveira79

Place your screenshot here
A Quick
example on
tensorflow

Sequence to sequence (encoder-decoder) learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à Sequence to sequence (encoder-decoder) learning

Similaire à Sequence to sequence (encoder-decoder) learning (20)

Dernier

Dernier (20)

Sequence to sequence (encoder-decoder) learning