Ekaterina vylomova-what-do-neural models-know-about-language-p1

What Do Neural Models "Know" About Natural Language?
Ekaterina Vylomova
Vylomova, Ekaterina Neural models and Natural Language 1 / 53

1943: Artiﬁcial Neuron (McCulloch-Pitts)
... or, in other words, ˆy = f ( n
i=1 wi xi + b),

1943: Artiﬁcial Neuron (McCulloch-Pitts)
... or, in other words, ˆy = f ( n
i=1 wi xi + b),
and activation function might be sigmoid: sig(x) = 1
1+e−x

1957: Simple Perceptron
The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain
Trained with trial-and-error method
It can:
– generalize over characters
– discover character-speciﬁc features
But:
– failed to recognized badly written/diﬀerent
size/partially closed characters

1960s: Single Layer Perceptron
The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain
Perceptrons: an introduction to
computational geometry
XOR Problem

1980s: Multi-Layer Perceptrons with Back-Propagation
Learning Internal Representations by Error Propagation
Solving problems with non-linearly
separable cases

1980s: The Past Tense Debate
Rumelhart & McClelland (1985): On learning the past tenses of
English verbs

1980s: The Past Tense Debate
Rumelhart & McClelland (1985): On learning the past tenses of
English verbs
Pinker &Prince, 1988: Extremely poor
empirical performance!

1990s: RNNs
Finding structure in time
Exploring
– context-dependent learning
– structure in letter sequences
– learning lexical classes from word order

1990s: CNNs
Backpropagation Applied to Handwritten Zip Code Recognition
Training
Data: 9,298 segmented numerals from U.S.
mail
Mislassiﬁed: Training – 0.14%; Test – 5.0%

Meanwhile in NLP: Language Modelling (mostly Ngrams with Kneser-Ney
smoothing)
OK, Marvin, which word comes next: Two cats are ___
Hmmm, let me guess ...
sitting 3.01 ∗ 10−4
play 2.87 ∗ 10−4
running 2.53 ∗ 10−4
nice 2.32 ∗ 10−4
lost 1.97 ∗ 10−4
playing 1.66 ∗ 10−4
sat 1.54 ∗ 10−4
plays 1.32 ∗ 10−4
. .Vylomova, Ekaterina Neural models and Natural Language 11 / 53

2013: Word2Vec Skip-Gram
Distributed Representations of Words and Phrases and their
Compositionality
Training Objective
1
T
T
t=1 −c≤j≤c logp(wt+j |wt)
p(wo|wi ) = exp(v T
wo vwi )
W
w=1 exp(v T
w vwi )
For eﬃciency, softmax was replaced with Negative
Sampling.
Levy et al., 2015 experimented with positive pointwise
mutual information (PMI) matrix and showed that
Word2vec Skip-Gram with NS is implicit matrix
factorization.

2013: Word2Vec CBOW
Eﬃcient Estimation of Word Representations in Vector Space
Training Objective
1
T
T
t=1 logp(wt|w[t−c,t+c])
p(wo|wi ) =
exp(v T
wo −c≤j≤c vwi+j
)
W
w=1 exp(v T
w −c≤j≤c vwi+j
)

2013: Word2Vec
Linear Relations and Compositionality

2013: Word2Vec: Word Analogies
Linear Relations and Compositionality: Russia + river =
Volga_river

2013: Word2Vec: Word Analogies
Linear Relations and Compositionality: king-man+woman = queen?

Word Analogies on other embeddings
Word Embeddings, Analogies, and Machine Learning: Beyond King
- Man+ Woman= Queen

- Man+ Woman= Queen

Pre-trained Word2Vec (Google News): Bias and Stereotypes
Man is to Computer Programmer as Woman is to Homemaker?

Word2vec trained of Reddit data: Bias and Stereotypes
Black is to Criminal as Caucasian is to Police

Data Bias and Stereotypes
Gendered Language
Positive adjectives describing women are often related to their bodies, while positive adjectives
describing men are often related to their behavior.

Word2Vec and similar models
What do the models learn?
Morphology
– Are capable of learning inﬂections but not much derivations (less regular and compositional)
Lexical Semantics
– Challenging, especially meronyms, antonyms, synonyms
Major Diﬃculties
– Polysemy (all word senses in a single vector)
– Negation

Broader context – back to RNNs!

Neural Machine Translation: Seq2Seq Models (Sutskever et al., 2014)
The resulting LSTM has 384M params
64M are pure recurrent connections

BUT: Longer contexts – lower quality (vanishing gradient)
Long Short-Term Memory will solve it!

Neural Machine Translation: Seq2Seq Models (Sutskever et al., 2014
PCA projection of LSTM hidden state of the corresponding sequences

We can also use both directions (to encode source language)

Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al,
2014)
A whole sentence shouldn’t be compressed into a single vector! Use
Attention!

2014)
A whole sentence shouldn’t be compressed into a single vector! Use
Attention!

2014)
It learns alignment and it can be visualized!

2014)
Belinkov et al., 2018a, 2018b
– Higher layers are better at learning semantics while lower layers tend to be better for
part-of-speech tagging
– Lower layers of the neural network are better at capturing morphology
Linzen et al., 2018, 2020
English Subject-Verb agreement:
–LSTMs were able to learn to perform the verb-number agreement task in most cases, although
their error rate increased on particularly diﬃcult sentences.
– the LM objective is not by itself suﬃcient for learning structure-sensitive dependencies, and
suggest a joint training objective

2014)
Vylomova et al., 2019
– Contextual inﬂection in 10 languages: Three little kitten were _sit_ on the mat. Predict:
sitting
– Agreement: Adjective-Noun ok, Subject-Verb more challenging
– Morphological complexity matters (Uralic languages are more challenging than Germanic)
– Inherent vs. contextual categories. Inherent (tense, noun number, w/o agreement or extra
signal) cannot be predicted

Back to Past Tense Debate: Seq2Seq Models w/Attention
Kirov & Cotterell,2018: The model obviates most of Pinker and
Prince’s criticisms
SIGMORPHON 2016 Shared Task
Task 1: run + V;PRES;3SG → (runs)
On Arabic, Finnish, Georgian, German, Hungarian, Maltese, Navajo, Russian, Spanish

Lake et al., 2018: Compositionality of RNNs
Simpliﬁed version of the CommAI Navigation tasks

Lake et al., 2018: Compositionality of RNNs
Simpliﬁed version of the CommAI Navigation tasks
Successful zero-shot generalizations when the diﬀerences between training and test command
Trained on "run", "jump" and "run twice" fails on "jump twice"

Contextualized Embeddings: Addressing the problem with polysemy!
Context matters! ELMo: Let’s make context-speciﬁc embeddings!
Features
– Two Independent(!) LSTMs
– Pre-trained embeddings
– Weighted-task speciﬁc sum of
embeddings (two hidden state +
word vector)

Self-Attention (Cheng et al., 2016)
Relate parts of a single sequence to compute its representation
Shows similarity to other parts!
Helpful for coreference resolution!

Contextualized Embeddings
Transformer: Attention is All you Need
Features
– No recursion but wide window (somewhat similar to
CNN)
– positional embeddings (to access token positions)
– Self-attention with several heads (matrices) and separate
key, query and value (masks)

BERT: Deep Bidirectional Transformers
Features
– Trained on: Masked tokens prediction + Next sentence prediction (binary) – BPE tokenization
– Window: 512, CLS – classiﬁcation

BERT: Deep Bidirectional Transformers

BERTs
BERT BASE(L=12, H=768, A=12, Total Parameters=110M)
BERT LARGE(L=24, H=1024,A=16, Total Parameters=340M).

Contextualized Embeddings: BERT

Contextualized Embeddings: Word Sense Disambiguation
Word Sense Disambiguation
"A mouse consists of an object held in one’s hand, with one or more buttons."
"Mouse" – an electronic device

Contextualized Embeddings: Coreference Resolution
Coreference resolution task
The secretary called the physician and told _him_ about a new patient.
him → physician

Contextualized Embeddings: Coreference Resolution
Gender Bias in Coreference Resolution
WinoBias: a Winograd-schema style sentences with entities corresponding to people referred by
their occupation

Contextualized Embeddings: Bias, bias, bias
Zhao et al., 2019
– Coref SOTA system that depends on ELMo inherits its bias and demonstrates signiﬁcant bias
on the WinoBias
– training data for ELMo contains signiﬁcantly more male than female entities
– the trained ELMo embeddings systematically encode gender information
– ELMo unequally encodes gender information about male and female entities

Contextualized Embeddings: What does BERT know (Rogers et al., 2020)?
Syntax
– Representations are hierarchical rather than linear and encode POS and syntactic roles(Liu et
al., 2019a,b)
– Does not “understand” negation and is insensitive to malformed input (Ettinger, 2019)
Semantics
– Has some knowledge for semantic roles(Ettinger, 2019)
– Struggles with representations of numbers (ﬂoating point; Wallace et al., 2019b)
World Knowledge
– Cannot reason based on its world knowledge ("A dog entered the room" doesn’t yield that
"room is larger than the dog")

Extra resources
NLP Progress
Hugging Face – Models
"Embeddings in Natural Language Processing" book
"Dive into Deep Learning" interactive book

Thank you! Questions?

Ekaterina vylomova-what-do-neural models-know-about-language-p1

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Ekaterina vylomova-what-do-neural models-know-about-language-p1

Similaire à Ekaterina vylomova-what-do-neural models-know-about-language-p1 (13)

Plus de Katerina Vylomova

Plus de Katerina Vylomova (15)

Dernier

Dernier (20)

Ekaterina vylomova-what-do-neural models-know-about-language-p1