In recent years, deep learning (DL) has proven to be a transformative force that has made impressive advances in different fields. In fact, within the area of natural language processing (NLP), deep learning has outperformed many former state of the art approaches, such as in machine translation or named entity recognition (NER). In this talk I will present various deep learning algorithms and architectures for NLP, with examples of how they can be leveraged to real world applications.
Ana Peleitero - Tendam
https://dataxday.fr
video available: https://www.youtube.com/watch?v=qpkt1sVHzd0
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
DataXDay - The wonders of deep learning: how to leverage it for natural language processing
1. THE WONDERS OF
DEEP LEARNING: HOW
TO LEVERAGE IT FOR
NLP
DATAXDAY 2018
Paris
17/05/2018
DR. ANA PELETEIRO RAMALLO
DATA SCIENCE DIRECTOR
@PeleteiroAna
@TendamRetail
@DataXDay
3. DEEP LEARNING FOR NLP
Deep learning is having a transformative impact in many areas where machine learning has been
applied.
NLP was somewhat behind other fields in terms of adopting deep learning for applications.
However, this has changed over the last few years, thanks to the use of RNNs, specifically LSTMs,
as well as word embeddings.
Distinct areas in which deep learning can be beneficial for NLP tasks, such as in named entity
recognition, machine translation and language modelling, parsing, chunking, POS tagging,
amongst others.
3
@DataXDay
4. WORD EMBEDDINGS
4
Representing as ids.
Encodings are arbitrary.
No information about the relationship between words.
Data sparsity.
https://www.tensorflow.org/tutorials/word2vec
Better representation for words.
Words in a continuous vector space where semantically similar words are mapped to nearby points.
Learn dense embedding vectors.
Skip-gram and CBOW
• CBOW predicts target words from the context. E.g., Tendam ?? Talk
• Skip-gram predicts source context-words from the target words. E.g., ?? conference ??
Standard preprocessing step for NLP.
Used also as a feature in supervised approaches (e.g., clustering).
Several parameters we can experiment with, e.g., the size of the word
embedding or the context window.
@DataXDay
5. CHARACTER EMBEDDINGS
Word embeddings are able to capture syntactic and semantic information.
POS-tagging and NER not enough.
Not the intra-word morphological and shape information, learn sub-token patterns (suffix, prefix), etc.
Out-of-vocabulary word (OOV) issue.
In languages where text is not composed of separated words but individual characters (Chinese).
We can overcome these problems by using character embeddings
5
@DataXDay
6. CNNs in NLP
CNNs:
effectiveness in
computer vision
tasks
Ability to extract
salient n-gram
features from the
input sentence to
create an
informative latent
semantic
Representa?on of
the sentence for
downstream tasks
Several tasks:
sentence
classifica?on,
summariza?on
6
@DataXDay
8. 8
Why not basic Deep Nets or CNNs?
@DataXDay
Traditional neural networks and CNNs do not use information from the past,
each entry is independent.
This is fine for several applica=ons, such as classifying images.
However, several applications, such as video, or language modelling, rely on
what has happened in the past to predict the future.
Recurrent Neural Networks (RNN) are capable of conditioning the model on
previous units in the corpus.
Capability of handling inputs of arbitrary length
9. RNNs
Make use of sequen+al informa+on.
Output is dependent on the previous informa+on.
RNN shares the same parameter W for each step,
so less parameters we need to learn.
9
@DataXDay
h"p://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf
11. In theory, RNNs
are absolutely
capable of
handling such
long-term
dependencies.
Practice is ”a
bit” different.
11
RNNs (II)
1.
Parameters are
shared by all
>me steps in
the network,
the gradient at
each output
depends not
only on the
calcula>ons of
the current >me
step, but also
the previous
>me steps.
2.
Exploding
gradients:
3.
Vanishing
gradients:
4.
Easier to spot.
3.1.
Clip the gradient to a
maximum
3.2.
Relus instead of
sigmoid
4.3.
@DataXDay
4.2.
4.1.
Initialization of the
matrix to identity
matrix
Harder to iden>fy
12. The oversized mannish coats looked positively edible over the bun-
skimming dresses while combined with novelty knitwear such as punk-
like fisherman's sweaters. As other look, the ballet pink Elizabeth and
James jacket provides a cozy cocoon for the 20-year-old to top off her
ensemble of a T-shirt and Parker Smith jeans. But I have to admit that
my favorite is the bun-skimming dresses with the ??
• In theory, RNNs can handle of handling such long-term dependencies.
12
@DataXDay
• However, in reality, they cannot.
• LSTMs and GRUs avoid the long-term dependency problem.
• Remove or add informaEon to the cell state, carefully regulated by
structures called gates.
• Gates are a way to opEonally let informaEon through.
17. APPLICATIONS
Word level classifica-on: NER
Sentence classifica-on: tweet sen-ment polarity. Seman-c matching between text
Text classifica-on
Language modelling
Speech recogni-on
Cap-on genera-on
Machine transla-on
Document summariza-on
Ques-on answering
17
18. EX1: TEXT GENERATION
All text from Shakespeare (4.4MB)
3-layer RNN with 512 hidden nodes on
each layer.
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
https://github.com/martin-gorner/tensorflow-rnn-shakespeare
18
@DataXDay
19. Q&A 19
Pedro del Hierro
SS18
How can I help you today?
I was wondering
what is trending this
spring
This spring is all about new
wave slip, in for example
jumpsuits
Is that appropriate
for a work dinner?
Yes, it totally works! I would
recommend you to use this
chilly oil jumpsuit. You can
combine it with a dark brown
belt and cherry tomato heels.
All from Pedro del Hierro
That sounds great!
@DataXDay