Deep learning has been making headlines around the world for its unparalleled performance in everything from recognizing cat videos to playing go. However deep learning remains well beyond the reach of average developers and researchers due to its massive compute and data requirements. Transfer learning is an old field of study going through a renaissance that promises to change by re-using portions of deep network models to drastically reduce both compute and data requirements. Come learn about some traditional transfer learning techniques such as word2vec and newer technologies like GPT and BERT.
5. Network Models?
Hebbian Learning
Maybe this is
actually the
opposite of how
things work?
Spike timing
dependent plasticity
Oh, I guess this
doesn't really work
in machine learning
Backprop
All-or-nothing
neurons all wired
together
Connectivity in the
brain is complex,
all-or-nothing isn't
an absolute rule
???
Non-linearities are
critical, step
functions don't work
that well
ReLUs,
convolution,
recurrence
1940 Today1980
6. “Neuroscientists have long
criticised [sic] deep learning
algorithms as incompatible with
current knowledge of
neurobiology.”
- Yoshua Bengio et al
Towards Biologically Plausible Deep
Learning (2015)
14. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Name
15. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Gender
• Location
• Age
Feature(s)
• Name
16. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel
Feature(s)
• Name
Feature(s)
• Gender
• Location
• Age
17. The Problem With Text
Problem Traditional Solution Traditional Problem
Linguistic Context • Stemming
• Synonym sets
• Lexicons
• Brittle
• Labor-intensive
• Messy real-world data
Local Context • Parse trees
• N-grams
• Phrase lexicon
• Inaccurate parsing
• Limited Context
• Messy real-world data
Out of Vocabulary Issues • Lemmatization
• Expanded vocabulary
• Ignore
• Computationally expensive
• Diminishing returns
• Messy real-world data
26. What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Embedding Method
(e.g. Word2Vec)
Linguistic Context
(e.g. Wikipedia)
27. Pitfalls
• Sufficient, Diverse Linguistic Context
• Clean Test/Train Splits
• The Curse of Dimensionality
• Effective Benchmarking
28. King
Queen
- man
+ woman
(Royalty)
How do Embeddings Work?
• Meaning is “encoded” into the
embedding space
• Individual dimensions are not
human interpretable
• Embedding method learns by
examining large corpora of
generic language
• Goal is accurate language
representation as a proxy for
downstream performance
31. “Word” Embeddings
Token Value
“great” [0.1, 0.3, …]
… …
Examples In Practice
Training
The quick brown fox _____ over the lazy dog
___ ___ ____ ___ jumps ___ __ ___ ___
CBOW
Skip Gram
• Word2vec
• GloVe
• fastText
32. Do They Really Preserve Algorithmic Value?
• Embeddings generally
outperform raw text at low data
volumes
• Leveraging large, generic text
corpora improves
generalizability
• This is 4 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Glove Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
39. Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Problems with
Small Data
40. The Power of Context
We used a bytepair encoding (BPE) vocabulary…
significantly improving upon the state of the art in 9 out of
the 12 tasks studied
- Improving Language Understanding by Generative Pre-Training*
* https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf
42. Do They Really Preserve Algorithmic Value?
• Newer transfer learning
techniques have made deep
learning at low data volumes
tractable
• Even when operating on top of
byte-pair encodings sufficient
context is retained to achieve
sota performance
• 4x error reduction over tf-idf
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Finetune Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
Finetune