Slides of my talk at EMNLP 2018:
"Explaining Character-Aware Neural Networks for Word-Level Prediction: Do They Discover Linguistic Rules?"
Character-level features are currently used in different neural network-based natural language processing algorithms. However, little is known about the character-level patterns those models learn. Moreover, models are often compared only quantitatively while a qualitative analysis is missing. In this paper, we investigate which character-level patterns neural networks learn and if those patterns coincide with manually-defined word segmentations and annotations. To that end, we extend the contextual decomposition (Murdoch et al., 2018) technique to convolutional neural networks which allows us to compare convolutional neural networks and bidirectional long short-term memory networks. We evaluate and compare these models for the task of morphological tagging on three morphologically different languages and show that these models implicitly discover understandable linguistic rules.
Genome organization in virus,bacteria and eukaryotes.pptx
Explaining Character-Aware Neural Networks for Word-Level Prediction: Do They Discover Linguistic Rules?
1. Explaining Character-Aware Neural
Networks for Word-Level Prediction
Frederic Godin, Kris Demuynck, Joni Dambre, Wesley Deneve and Thomas Demeester
Department of Electronics and Information Systems
Ghent University, Belgium
Do They Discover Linguistic Rules?
3. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Example: Rule-based tagger for PoS tagging
Brill (1994)’s transformation-based error-driven tagger
3
Template
Change the most-likely tag X to
Y if the last (1,2,3,4) characters
of the word are x
Rule
Change the tag common noun to
plural common noun if the word has
suffix -s
Easily interpretable
4. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Interpretability in NLP used to be easy
Rule-based/Tree-based models
Shallow statistical models (E.g., Logistic regression, CRF)
4
Very transparent: follow the trace
Essentially: weight + feature
5. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Current NLP interpretability...
5
6. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Our proposed method
6
We present contextual decomposition (CD) for CNNs
- Extends CD for LSTMs (Murdoch et al. 2018)
- White box approach to interpretability
We trace back morphological tagging decisions to the
character-level
- Which characters are important?
- Same patterns as linguistically known?
- Difference CNN and BiLSTM?
8. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition
Idea: every output value can be “decomposed” in
- Relevant contributions originating from the input we are interested in
(E.g., some characters)
- Irrelevant contributions originating from all the other inputs (E.g., all
the other characters in a word)
8
CNNeconomicas plural
economicas
economicas
economicas
economicas
Relevant
relevant irrelevantrelevant
9. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition for CNNs
Three main components of CNN
̶ Convolution
̶ Activation function
̶ Max-over-time pooling
Classification layer
9
^ e c o n o m i c a s $
...
Max over time
FC
Gender = feminine
CNN filters
10. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition for CNNs: Convolution
Output of single convolutional filter at timestep t:
10
Relevant Irrelevant
n = filter size
S = Indexes of of relevant inputs
Wi = i-th column of filter W
^ e c o n o m i c a s $
Indexes: 8, 9, 10, 11
9 8, 10, 11
11. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition for CNNs: Activation func.
Goal: Linearize activation function to be able to split output.
Linearization formula:
11
12. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition for CNNs: Max pooling
Max-over-time pooling:
Determine t first and just copy that split:
12
13. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition of classification layer
Probability of certain class:
13
We simplify:
Relevant contribution to class j
15. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Task
15
Morphological tagging: predict morphological labels for a word (gender,
tense, singular/plural,..)
economicas
For a subset of words, we have manual segmentations and
annotations
lemma=económico
gender=feminine
number=plural
economicas
lemma=económico
gender=feminine
number=pluraleconomicas
economicas
16. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Datasets
Universal dependencies 1.4:
̶ Finnish, Spanish and Swedish
̶ Select all unique words and their morphological labels
Manual annotations and segmentations of 300 test set words
16
17. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Architectures: CNN vs BiLSTM
17
^ e c o n o m i c a s $
FC
Gender = feminine
^ e c o n o m i c a s $
...
Max over time
FC
Gender = feminine
CNN filters
CNN BiLSTM
18. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Do the NN patterns follow manual segmentations?
18
All = every possible combination of characters
Cons = all consecutive character n-grams
19. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Visualizing contributions: 1 character
19
Spanish
^ g r a t u i t a $
Label: Gender=feminine
20. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Visualizing contributions: 2 characters (Swedish)
20
CNN BiLSTM
^ k r o n o r $ ^ k r o n o r $
^
k
r
o
n
o
r
$
^
k
r
o
n
o
r
$
Label: number=plural
21. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Most important patterns per language: Spanish
21
Linguistic rules for feminine gender:
- Feminine adjectives often end with “a”
- Nouns ending with “dad” or “ión” are often feminine
Found pattern:
- “a” is a very important pattern
- “dad” and “sió” are import trigrams
22. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Most important patterns per language: Swedish
22
Linguistic rules for plural form:
- 5 suffixes: or, ar, (e)r, n, and no ending
“na” is definite article in plural forms
Found pattern:
- “or” and “ar”
- But also “na” and “rn”
23. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Interactions/compositions of patterns
How do positive and negative patterns interact?
Consider the Spanish verb “gusta”
- Gender=Not Applicable (NA)
- We know that suffix “a” is indicator for gender=feminine
23
Consider most positive/negative set of characters per class:
The stem provides counterevidence for gender=feminine
25. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Summary
We introduced a white box approach to understanding CNNs
We showed that:
̶ BiLSTMs and CNNs sometimes choose different patterns
̶ The learned patterns coincide with our linguistic knowledge
̶ Sometimes other plausible patterns are used
25