SlideShare une entreprise Scribd logo
1  sur  86
Télécharger pour lire hors ligne
Sequence Modelling
with Deep Learning
ODSC London 2019 Tutorial
Natasha Latysheva
Overview
I. Introduction to sequence modelling
II. Quick neural network review
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention
Speaker Intro
• Welocalize
• We provide language services
• Fairly large, by revenue 8th largest globally,
4th largest US. 1500+ employees.
• Lots of localisation (translation)
• International marketing, site optimisation
• NLP engineering team
• 14 people remote across US, Ireland, UK,
Germany, China
• Various NLP things: machine translation,
text-to-speech, NER, sentiment, topics,
classification, etc.
I. Introduction to Sequence Modelling
Other sequence problems
Less conventional sequence data
• Activity on a website:
• [click_button, move_cursor, wait,
wait, click_subscribe, close_tab]
• Customer history:
• [inactive -> mildly_active ->
payment_made -> complaint_filed
-> inactive -> account_closed]
• Code (constrained language) is
sequential data – can learn the
structure
II. Quick Neural Network Review
Feed-forward networks
Simplifying the notation
• Single neurons
• Weight matrices, bias vectors
• Fully-connected layer
III. Recurrent Neural Networks
Why do we need fancy methods to
model sequences?
• Say we are training a translation
model, English->French
• ”The cat is black” to “Le chat is
noir”
• Could in theory use a feed-
forward network to translate
word-by-word
Why do we need fancy methods?
• A feed-forward network treats
time steps as completely
independent
• Even in this simple 1-to-1
correspondence example, things
are broken
• How you translate “black” depends
on noun gender (“noir” vs. “noire”)
• How you translate “The” also
depends on gender (“Le” vs. “La”)
• More generally, getting the
translation right requires context
Why do we need fancy methods?
• We need a way for the network
to remember information from
previous time steps
Recurrent neural networks
• Extremely popular way of modelling
sequential data
• Process data one time step at a
time, while updating a running
internal hidden state
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
• At each time step, RNN
passes on its activations
from previous time step
• In theory all the way back
to the first time step
Standard FF network to RNN
*Activation function probably tanh or ReLU
Standard FF network to RNN
• So you can say this is a
form of memory
• Cell hidden state
transferred
• Basis for RNNs
remembering context
Memory problems
• Basic RNNs not great at
long-term dependencies
but plenty of ways to
improve this
• Information gating
mechanisms
• Condensing input using
encoders
Gating mechanisms
• Gates regulate the flow of
information
• Very helpful - basic RNN cells not really
used anymore. Responsible for recent
RNN popularity.
• Add explicit mechanisms to remember
information and forget information
• Why use gates?
• Helps you learn long-term
dependencies
• Not all time points are equally relevant
– not everything has to be remembered
• Speeds up training/convergence
Gated recurrent
units (GRUs)
• GRUs were developed later
than LSTMs but are simpler
• Motivation is to get the main
benefits of LSTMs but with less
computation
• Reset gate: Mechanism to
decide when to remember vs.
forget/reset previous
information (hidden state)
• Update gate: Mechanism to
decide when to update
hidden state
GRU mechanics
• Reset gate controls how
much past info we use
• Rt = 0 means we are resetting
our RNN, not using any
previous information
• Rt = 1 means we use all of
previous information (back to
our normal vanilla RNN)
GRU mechanics
• Update gate controls whether
we bother updating our
hidden state using new
information
• Zt = 1 means you’re not
updating, you’re just using
previous hidden state
• Zt = 0 means you’re updating as
much as possible
LSTM mechanics
• LSTMs add a memory unit to
further control the flow of
information through the cell
• Also whereas GRUs have 2
gates, an LSTM cell has 3
gates:
• An input gate – should I ignore
or consider the input?
• A forget gate – should I keep
or throw away the information
in memory?
• An output gate – how should I
use input, hidden state and
memory to output my next
hidden state?
GRUs vs. LSTMs
• GRUs are simpler + train
faster
• LSTMs more popular – can
give slightly better
performance, but GRU
performance often on par
• LSTMs would in theory
outperform GRUs in tasks
requiring very long-range
modelling
IV. Game of Thrones Language Model
Notebook
• ~30 mins
• Jupyter
notebook on
building an RNN-
based language
model
• Python 3 + Keras
for neural
networks
tinyurl.com/wbay5o3
IV. Components of SOTA RNN models
Encoder-Decoder architectures
• Being forced to
immediately
output a French
word for every
English word
Encoder-Decoder architectures
Encoder-Decoder architectures
• Tends to work a lot
better than using a
single sequence-to-
sequence RNNs to
produce an output
for each input step
• You often need to
see the whole
sequence before
knowing what to
output
Bidirectionality in RNN encoder-decoders
• For the encoder,
bidirectional RNNs
(BRNNs) often used
• BRNNs read the
input sequences
forwards and
backwards
Bidirectional
RNNs
• Process input
sequences in both
directions
The problem with RNN encoder-decoders
• Serious information
bottleneck
• Condense input
sequence down to a
small vector?!
• Memorise long
sequence + regurgitate
• Not how humans work
• Long computation
paths
Attention concept
• Has been very influential in
deep learning
• Originally developed for
MT (Bahdanau, 2014)
• As you’re producing your
output sequence, maybe
not every part of your input
is as equally relevant
• Image captioning example
Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
Attention intuition
• Attention allows the
network to refer back
to the input
sequence, instead of
forcing it to encode
all information into
one fixed-length
vector
• Encoder: Used BRNN
to compute rich set of
features about source
words and their
surrounding words
• Decoder is asked to
choose which hidden
states to use and
ignore
• Weighted sum of
hidden states used to
predict the next word
Attention intuition
• Decoder RNN uses
attention parameters
to decide how much
to pay attention to
different parts of the
input
• Allows the model to
amplify the signal
from relevant parts of
the input sequence
• This improves
modelling
Attention intuition
Main benefits
• Encoder passes a lot
more data to
the decoder
• Not just last hidden
state
• Passes all hidden states
at every time step
• Computation path
problem: relevant
information is now
closer by
Summary so far
• Sequence modelling
• Recurrent neural
networks
• Some key components
of SOTA RNN-based
models:
• Gating mechanisms
(GRUs and LSTMs)
• Encoder-decoders
• Bidirectional encoding
• Attention
V. Transformers and self-attention
Transformers are taking over NLP
• Translation, language
models, question
answering, summarisation,
etc.
• Some of the best word
embeddings are based on
Transformers
• BERT, ELmO, OpenAI GPT-2
models
A single Transformer encoder block
• No recurrence, no convolutions
• “Attention is all you need” paper
• The core concept is the self-
attention mechanism
• Much more parallelisable than
RNN-based models, which
means faster training
Self-attention is a
sequence-to-sequence
operation
• At the highest level – self-
attention takes t input
vectors and outputs t
output vectors
• Take input embedding for
“the” and update it by
incorporating in
information from its
context
How is the vector for “the” updated?
• Each output vector
is a weighted sum
of the input vectors
• But all of these
weights are
different
These are not learned weights in the
traditional neural network sense
• The weights are
calculated by taking
dot products
• Can use different
functions over input
Example calculation of a single weight
Example calculation of a single weight
Calculating a weight matrix row
Attention weight matrix
• The dot product can be
anything (negative infinity to
positive infinity)
• We normalise by length
• We softmax this so that the
weights are positive values
summing to 1
• Attention weight matrix
summarises relationship
between words
• Because dot products capture
similarity between vectors
Multi-headed attention
• Attention weight matrix
captures relationship
between words
• But there’s many
different ways words can
be related
• And which ones you want
to capture depends on
your task
• Different attention heads
learn different relations
between word pairs
Img source
Difference to RNNs
• Whereas RNNs updates context
token-by-token by updating
internal hidden state, self-
attention captures context by
updating all word representations
simultaneously
• Lower computational complexity,
scales better with more data
• More parallelisable = faster
training
Connecting all
these concepts
• “Useful” input representations are
learned
• “Useful” weights for transforming
input vectors are learned
• These quantities should produce
“useful” dot products
• That lead to “useful” updated input
vectors
• That lead to “useful” input to the
feed-forward network layer
• … etc. … that eventually lead to
lower overall loss on the training set
Summary
I. Introduction to sequence modelling
II. Quick neural network review
• How a single neuron functions
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention
Further Reading
• More accessible: Andrew Ng
Sequence Course on Coursera
• https://www.coursera.org/learn/nlp-
sequence-models
• More technical: Deep Learning book
by Goodfellow et al.
• https://www.deeplearningbook.org/cont
ents/rnn.html
• Also: Alex Smola Berkeley Lectures
• https://www.youtube.com/user/smolix/vi
deos
Just for fun
• Talk to transformer
• https://talktotransformer.com/
• Using OpenAI’s “too
dangerous to release” GPT-
2 language model
Thanks, questions?
Extra slides
Sequences in natural language
• Sequence modelling very popular in
NLP because language is sequential by
nature
• Text
• Sequences of words
• Sequences of characters
• We process text sequentially, though in
principle could see all words at once
• Speech
• Sequence of amplitudes over time
• Frequency spectrogram over time
• Extracted frequency features over time
Img source
Sequences in biology
• Genomics, DNA and
RNA sequences
• Proteomics, protein
sequences,
structural biology
• Trying to represent
sequences in some
way, or predict some
function or
association of the
sequence
Img source
Sequences in finance
• Lots of time series data
• Numerical sequences (stocks,
indices)
• Lots of forecasting work –
predicting the future (trading
strategies)
• Deep learning for these
sequences perhaps not as
popular as you might think
• Quite well-developed methods
based on classical statistics,
interpretability important
Img source
Img source
Single neuron computation
• What computation is
happening inside 1
neuron?
• If you understand how 1
neuron computes output
given input, it’s a small
step to understand how an
entire network computes
output given input
Single neuron computation
• What computation is
happening inside 1
neuron?
• If you understand how 1
neuron computes output
given input, it’s a small
step to understand how an
entire network computes
output given input
Perceptrons
• Modelling a binary outcome using
binary input features
• Should I have a cup of tea?
• 0 = no
• 1 = yes
• Three features with 1 weight each:
• Do they have Earl Grey?
• earl_grey, 𝑤" = 3
• Have I just had a cup of tea?
• already_had, 𝑤# =-1
• Can I get it to go?
• to_go, 𝑤$ =2
Perceptrons
• Modelling a binary outcome using
binary input features
• Should I have a cup of tea?
• 0 = no
• 1 = yes
• Three features with 1 weight each:
• Do they have Earl Grey?
• earl_grey, 𝑤" = 3
• Have I just had a cup of tea?
• already_had, 𝑤# =-1
• Can I get it to go?
• to_go, 𝑤$ =2
Perceptrons
• Here weights are
cherry-picked, but
perceptrons learn these
weights automatically
from training data by
shifting parameters to
minimise error
Perceptrons
• Formalising the perceptron
calculation
• Instead of a threshold, more
common to see a bias term
• Instead of writing out the
sums using sigma notation,
more common to see dot
products.
• Vectorisation for efficiency
• Here, I manually chose these
values – but given a dataset of
past inputs/outputs, you could
learn the optimal parameter
values
Perceptrons
• Formalising the
perceptron calculation
• Instead of a threshold,
more common to see a
bias term
• Instead of writing out
the sums using sigma
notation, more common
to see dot products.
• Vectorisation for
efficiency
Sigmoid neurons
• Want to handle continuous
values
• Where input can be
something other than just 0 or
1
• Where output can be
something other than just 0 or
1
• We put the weighted sum of
inputs through an activation
function
• Sigmoid or logistic function
Sigmoid neurons
• The sigmoid function is
basically a smoothed out
perceptron!
• Output no longer a
sudden jump
• It’s the smoothness of the
function that we care
about
Img source
Activation functions
• Which activation function
to use?
• Heuristics based on
experiments, not proof-
based
Img source
More layers!
• Increase
number of
layers to
increase
capacity for
abstraction,
hierarchical
processing of
input
Training on big window sizes
• How much of window size? On very long sequence, unrolled
RNN becomes a very deep network
• Same problems with vanishing/exploding gradients as normal
networks
• And takes a longer time to train
• The normal tricks can help – good initialization of parameters, non-
saturating activation functions, gradient clipping, batch norm
• Training over a limited number of steps – truncated
backpropagation through time
LSTM mechanics
• Input, forget, output gates are
little neural networks within the
cell
• Memory being updated via
forget gate and candidate
memory
• Hidden state being updated by
output gate, which weighs up all
information
Query, Key, and Value transformations
• Notice that we are using
each input vector on 3
separate occasions
• E.g. vector x2
1. To take dot products
with each other input
vector when calculating
y2
2. In dot products with
other output vectors (y1,
y3, y4) are calculated
3. And in the weighted
sum to produce output
vector y2
Query, Key, and Value transformations
• To model these 3
different functions for
each input vector, and
give the model extra
expressivity and
flexibility, we are going
to modify the input
vectors
• Apply simple linear
transformations
Input transformation
matrices
• These weight matrices
are learnable
parameters
• Gives something else
to learn by gradient
descent

Contenu connexe

Tendances

What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learningmilad abbasi
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term MemoryYan Xu
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural NetworksAniket Maurya
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
 
Vanishing & Exploding Gradients
Vanishing & Exploding GradientsVanishing & Exploding Gradients
Vanishing & Exploding GradientsSiddharth Vij
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 

Tendances (20)

What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Deep learning.pptx
Deep learning.pptxDeep learning.pptx
Deep learning.pptx
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
LSTM
LSTMLSTM
LSTM
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Vanishing & Exploding Gradients
Vanishing & Exploding GradientsVanishing & Exploding Gradients
Vanishing & Exploding Gradients
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 

Similaire à Sequence Modelling with Deep Learning

Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 
Sequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdfSequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdfFEG
 
Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxArunKumar674066
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyRimzim Thube
 
240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxddddddddddddddddssuser2624f71
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Vishal Mishra
 
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learningPoo Kuan Hoong
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Impetus Technologies
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerPoo Kuan Hoong
 
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningMadhu Sanjeevi (Mady)
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptxthanhdowork
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep LearningPoo Kuan Hoong
 
Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnnkartikaursang53
 
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Vijay Srinivas Agneeswaran, Ph.D
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learningViet-Trung TRAN
 

Similaire à Sequence Modelling with Deep Learning (20)

Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Sequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdfSequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdf
 
Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptx
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
 
Deep learning
Deep learningDeep learning
Deep learning
 
240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
 
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learning
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
 
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep Learning
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnn
 
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 

Dernier

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 

Dernier (20)

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 

Sequence Modelling with Deep Learning

  • 1. Sequence Modelling with Deep Learning ODSC London 2019 Tutorial Natasha Latysheva
  • 2. Overview I. Introduction to sequence modelling II. Quick neural network review • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  • 3. Speaker Intro • Welocalize • We provide language services • Fairly large, by revenue 8th largest globally, 4th largest US. 1500+ employees. • Lots of localisation (translation) • International marketing, site optimisation • NLP engineering team • 14 people remote across US, Ireland, UK, Germany, China • Various NLP things: machine translation, text-to-speech, NER, sentiment, topics, classification, etc.
  • 4. I. Introduction to Sequence Modelling
  • 5.
  • 7. Less conventional sequence data • Activity on a website: • [click_button, move_cursor, wait, wait, click_subscribe, close_tab] • Customer history: • [inactive -> mildly_active -> payment_made -> complaint_filed -> inactive -> account_closed] • Code (constrained language) is sequential data – can learn the structure
  • 8. II. Quick Neural Network Review
  • 10. Simplifying the notation • Single neurons • Weight matrices, bias vectors • Fully-connected layer
  • 12. Why do we need fancy methods to model sequences? • Say we are training a translation model, English->French • ”The cat is black” to “Le chat is noir” • Could in theory use a feed- forward network to translate word-by-word
  • 13. Why do we need fancy methods? • A feed-forward network treats time steps as completely independent • Even in this simple 1-to-1 correspondence example, things are broken • How you translate “black” depends on noun gender (“noir” vs. “noire”) • How you translate “The” also depends on gender (“Le” vs. “La”) • More generally, getting the translation right requires context
  • 14. Why do we need fancy methods? • We need a way for the network to remember information from previous time steps
  • 15. Recurrent neural networks • Extremely popular way of modelling sequential data • Process data one time step at a time, while updating a running internal hidden state
  • 20. Standard FF network to RNN • At each time step, RNN passes on its activations from previous time step • In theory all the way back to the first time step
  • 21. Standard FF network to RNN *Activation function probably tanh or ReLU
  • 22. Standard FF network to RNN • So you can say this is a form of memory • Cell hidden state transferred • Basis for RNNs remembering context
  • 23. Memory problems • Basic RNNs not great at long-term dependencies but plenty of ways to improve this • Information gating mechanisms • Condensing input using encoders
  • 24. Gating mechanisms • Gates regulate the flow of information • Very helpful - basic RNN cells not really used anymore. Responsible for recent RNN popularity. • Add explicit mechanisms to remember information and forget information • Why use gates? • Helps you learn long-term dependencies • Not all time points are equally relevant – not everything has to be remembered • Speeds up training/convergence
  • 25. Gated recurrent units (GRUs) • GRUs were developed later than LSTMs but are simpler • Motivation is to get the main benefits of LSTMs but with less computation • Reset gate: Mechanism to decide when to remember vs. forget/reset previous information (hidden state) • Update gate: Mechanism to decide when to update hidden state
  • 26. GRU mechanics • Reset gate controls how much past info we use • Rt = 0 means we are resetting our RNN, not using any previous information • Rt = 1 means we use all of previous information (back to our normal vanilla RNN)
  • 27. GRU mechanics • Update gate controls whether we bother updating our hidden state using new information • Zt = 1 means you’re not updating, you’re just using previous hidden state • Zt = 0 means you’re updating as much as possible
  • 28. LSTM mechanics • LSTMs add a memory unit to further control the flow of information through the cell • Also whereas GRUs have 2 gates, an LSTM cell has 3 gates: • An input gate – should I ignore or consider the input? • A forget gate – should I keep or throw away the information in memory? • An output gate – how should I use input, hidden state and memory to output my next hidden state?
  • 29. GRUs vs. LSTMs • GRUs are simpler + train faster • LSTMs more popular – can give slightly better performance, but GRU performance often on par • LSTMs would in theory outperform GRUs in tasks requiring very long-range modelling
  • 30. IV. Game of Thrones Language Model
  • 31. Notebook • ~30 mins • Jupyter notebook on building an RNN- based language model • Python 3 + Keras for neural networks tinyurl.com/wbay5o3
  • 32. IV. Components of SOTA RNN models
  • 33. Encoder-Decoder architectures • Being forced to immediately output a French word for every English word
  • 35.
  • 36. Encoder-Decoder architectures • Tends to work a lot better than using a single sequence-to- sequence RNNs to produce an output for each input step • You often need to see the whole sequence before knowing what to output
  • 37.
  • 38. Bidirectionality in RNN encoder-decoders • For the encoder, bidirectional RNNs (BRNNs) often used • BRNNs read the input sequences forwards and backwards
  • 39.
  • 41. The problem with RNN encoder-decoders • Serious information bottleneck • Condense input sequence down to a small vector?! • Memorise long sequence + regurgitate • Not how humans work • Long computation paths
  • 42. Attention concept • Has been very influential in deep learning • Originally developed for MT (Bahdanau, 2014) • As you’re producing your output sequence, maybe not every part of your input is as equally relevant • Image captioning example Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
  • 43. Attention intuition • Attention allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector
  • 44. • Encoder: Used BRNN to compute rich set of features about source words and their surrounding words • Decoder is asked to choose which hidden states to use and ignore • Weighted sum of hidden states used to predict the next word Attention intuition
  • 45. • Decoder RNN uses attention parameters to decide how much to pay attention to different parts of the input • Allows the model to amplify the signal from relevant parts of the input sequence • This improves modelling Attention intuition
  • 46. Main benefits • Encoder passes a lot more data to the decoder • Not just last hidden state • Passes all hidden states at every time step • Computation path problem: relevant information is now closer by
  • 47. Summary so far • Sequence modelling • Recurrent neural networks • Some key components of SOTA RNN-based models: • Gating mechanisms (GRUs and LSTMs) • Encoder-decoders • Bidirectional encoding • Attention
  • 48. V. Transformers and self-attention
  • 49. Transformers are taking over NLP • Translation, language models, question answering, summarisation, etc. • Some of the best word embeddings are based on Transformers • BERT, ELmO, OpenAI GPT-2 models
  • 50. A single Transformer encoder block • No recurrence, no convolutions • “Attention is all you need” paper • The core concept is the self- attention mechanism • Much more parallelisable than RNN-based models, which means faster training
  • 51. Self-attention is a sequence-to-sequence operation • At the highest level – self- attention takes t input vectors and outputs t output vectors • Take input embedding for “the” and update it by incorporating in information from its context
  • 52. How is the vector for “the” updated?
  • 53. • Each output vector is a weighted sum of the input vectors • But all of these weights are different
  • 54. These are not learned weights in the traditional neural network sense • The weights are calculated by taking dot products • Can use different functions over input
  • 55. Example calculation of a single weight
  • 56. Example calculation of a single weight
  • 57. Calculating a weight matrix row
  • 58. Attention weight matrix • The dot product can be anything (negative infinity to positive infinity) • We normalise by length • We softmax this so that the weights are positive values summing to 1 • Attention weight matrix summarises relationship between words • Because dot products capture similarity between vectors
  • 59. Multi-headed attention • Attention weight matrix captures relationship between words • But there’s many different ways words can be related • And which ones you want to capture depends on your task • Different attention heads learn different relations between word pairs Img source
  • 60. Difference to RNNs • Whereas RNNs updates context token-by-token by updating internal hidden state, self- attention captures context by updating all word representations simultaneously • Lower computational complexity, scales better with more data • More parallelisable = faster training
  • 61. Connecting all these concepts • “Useful” input representations are learned • “Useful” weights for transforming input vectors are learned • These quantities should produce “useful” dot products • That lead to “useful” updated input vectors • That lead to “useful” input to the feed-forward network layer • … etc. … that eventually lead to lower overall loss on the training set
  • 62. Summary I. Introduction to sequence modelling II. Quick neural network review • How a single neuron functions • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  • 63. Further Reading • More accessible: Andrew Ng Sequence Course on Coursera • https://www.coursera.org/learn/nlp- sequence-models • More technical: Deep Learning book by Goodfellow et al. • https://www.deeplearningbook.org/cont ents/rnn.html • Also: Alex Smola Berkeley Lectures • https://www.youtube.com/user/smolix/vi deos
  • 64. Just for fun • Talk to transformer • https://talktotransformer.com/ • Using OpenAI’s “too dangerous to release” GPT- 2 language model
  • 67. Sequences in natural language • Sequence modelling very popular in NLP because language is sequential by nature • Text • Sequences of words • Sequences of characters • We process text sequentially, though in principle could see all words at once • Speech • Sequence of amplitudes over time • Frequency spectrogram over time • Extracted frequency features over time Img source
  • 68. Sequences in biology • Genomics, DNA and RNA sequences • Proteomics, protein sequences, structural biology • Trying to represent sequences in some way, or predict some function or association of the sequence Img source
  • 69. Sequences in finance • Lots of time series data • Numerical sequences (stocks, indices) • Lots of forecasting work – predicting the future (trading strategies) • Deep learning for these sequences perhaps not as popular as you might think • Quite well-developed methods based on classical statistics, interpretability important Img source Img source
  • 70. Single neuron computation • What computation is happening inside 1 neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  • 71. Single neuron computation • What computation is happening inside 1 neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  • 72. Perceptrons • Modelling a binary outcome using binary input features • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, 𝑤" = 3 • Have I just had a cup of tea? • already_had, 𝑤# =-1 • Can I get it to go? • to_go, 𝑤$ =2
  • 73. Perceptrons • Modelling a binary outcome using binary input features • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, 𝑤" = 3 • Have I just had a cup of tea? • already_had, 𝑤# =-1 • Can I get it to go? • to_go, 𝑤$ =2
  • 74. Perceptrons • Here weights are cherry-picked, but perceptrons learn these weights automatically from training data by shifting parameters to minimise error
  • 75. Perceptrons • Formalising the perceptron calculation • Instead of a threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency • Here, I manually chose these values – but given a dataset of past inputs/outputs, you could learn the optimal parameter values
  • 76. Perceptrons • Formalising the perceptron calculation • Instead of a threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency
  • 77. Sigmoid neurons • Want to handle continuous values • Where input can be something other than just 0 or 1 • Where output can be something other than just 0 or 1 • We put the weighted sum of inputs through an activation function • Sigmoid or logistic function
  • 78. Sigmoid neurons • The sigmoid function is basically a smoothed out perceptron! • Output no longer a sudden jump • It’s the smoothness of the function that we care about Img source
  • 79. Activation functions • Which activation function to use? • Heuristics based on experiments, not proof- based Img source
  • 80. More layers! • Increase number of layers to increase capacity for abstraction, hierarchical processing of input
  • 81. Training on big window sizes • How much of window size? On very long sequence, unrolled RNN becomes a very deep network • Same problems with vanishing/exploding gradients as normal networks • And takes a longer time to train • The normal tricks can help – good initialization of parameters, non- saturating activation functions, gradient clipping, batch norm • Training over a limited number of steps – truncated backpropagation through time
  • 82. LSTM mechanics • Input, forget, output gates are little neural networks within the cell • Memory being updated via forget gate and candidate memory • Hidden state being updated by output gate, which weighs up all information
  • 83.
  • 84. Query, Key, and Value transformations • Notice that we are using each input vector on 3 separate occasions • E.g. vector x2 1. To take dot products with each other input vector when calculating y2 2. In dot products with other output vectors (y1, y3, y4) are calculated 3. And in the weighted sum to produce output vector y2
  • 85. Query, Key, and Value transformations • To model these 3 different functions for each input vector, and give the model extra expressivity and flexibility, we are going to modify the input vectors • Apply simple linear transformations
  • 86. Input transformation matrices • These weight matrices are learnable parameters • Gives something else to learn by gradient descent