SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
Adams Wei Yu
Deview 2018, Seoul
Quoc
Le
Thang
Luong
Rui Zhao Mohammad
Norouzi
Kai Chen
Collaborators
David Dohan
Bio
Adams Wei Yu
● Ph.D Candidate @ MLD, CMU
○ Advisor: Jaime Carbonell, Alex Smola
○ Large scale optimization
○ Machine reading comprehension
Question Answering
Concrete Answer No clear answer
Early Success
http://www.aaai.org/Magazine/Watson/watson.php
Watson: complex multi-stage system
Moving towards end-to-end systems
● Translation
● Question Answering
Lots of Datasets Available
TriviaQA
Narrative QA
MS Marco
Stanford Question Answer Dataset (SQuAD)
In education, teachers facilitate student learning, often in a school or
academy or perhaps in another environment such as outdoors. A teacher
who teaches on an individual basis may be described as a tutor.
Passage:
What is the role of teachers in education?Question:
facilitate student learningGroundtruth:
facilitate student learningPrediction 1: EM = 1, F1 = 1
student learningPrediction 2: EM = 0, F1 = 0.8
teachers facilitate student learningPrediction 3: EM = 0, F1 = 0.86
Data: Crowdsourced 100k question-answer pairs on 500 Wikipedia articles.
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
That movie was awful.
That movie was awful .
embed embed embed embed embed
sum
Bag of words
hout
Continuous bag-of-words and skip-gram
architectures (Mikolov et al., 2013a;
2013b)
That movie was awful .
conv conv conv conv conv
sum
embed embed embed embed embed
Bag of N-Grams hout
That movie was awful .
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)
hinit hout
Recurrent Neural Networks
The quick brown fox jumped over the lazy doo
The quick brown fox jumped over the lazy dog
A feed-forward neural network language
model (Bengio et al., 2001; 2003)
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
The quick brown fox jumped
Language Models
Yes please <de> ja bitte
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
Ja bitte </s>
Language Models Seq2Seq
Yes please <s> ja bitte
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
Ja bitte </s>
Seq2Seq + Attention
Encoder Decoder
hinit
?
https://distill.pub/2016/augmented-rnns/#
https://distill.pub/2016/augmented-rnns/#
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Convolution:
Different linear transformations by relative position.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Multi-head Attention
Parallel attention layers with different linear transformations on input and output.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Yes please <s> Ja
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
Ja ...bitte </s>
Seq2Seq + Attention
Encoder Decoder
hinit
w1
w2
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)
project
The quick brown fox jumped
Language Models with attention
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
General (Doc, Question) → Answer Model
General framework neural QA Systems
Bi-directional Attention Flow (BiDAF)
[Seo et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
36
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
RNN RNN
RNN
RNN
Two Challenges with RNNs Remain...
Base Model (BiDAF)
First challenge: hard to capture long dependency
h1
h3
h4
h5
h6
h2
Being a long-time fan of Japanese film, I expected more than this. I can't really be
bothered to write too much, as this movie is just so poor. The story might be the cutest
romantic little something ever, pity I couldn't stand the awful acting, the mess they called
pacing, and the standard "quirky" Japanese story. If you've noticed how many Japanese
movies use characters, plots and twists that seem too "different", forcedly so, then steer
clear of this movie. Seriously, a 12-year old could have told you how this movie was
going to move along, and that's not a good thing in my book. Fans of "Beat" Takeshi: his
part in this movie is not really more than a cameo, and unless you're a rabid fan, you
don't need to suffer through this waste of film.
Second challenge: hard to compute in parallel
Strictly Sequential!
1. local context
input
hidden state
h1
h3
h4
h5
h6
h2
2. global interaction
3. Temporal info
What do RNNs Capture?
Substitution?
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
The todayniceisweather
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
1.8 0.90.30.41.6
k = 3k = 2
d = 3
1.2 0.81.40.52.1
k = 3
0.0
0.0
0.0
k-gram features
Fully parallel!
How about Global Interaction?
The todayniceisweather
layer 1
layer 2
layer 3
1. May need O(logk
N) layers
2. Interaction may become weaker
N: Seq length.
k: Filter size.
The todayniceisweather
The todayniceisweather
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
The todayniceisweather
w1
x + w2
x + w3
x + w4
x + w5
x
1.8
2.3
0.4
The
=
w1 w2
w3
w4
w5
w1
, w2
, w3
, w4
, w5
= softmax ( )
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.6 0.2 0.8 x
The
The todayniceisweather
[Vaswani et al., NIPS’17]
The todayniceisweather
The todayniceisweather
Self-attention is fully parallel & all-to-all!
Per Unit Total
Per Layer
Sequential Op
(Path Memory)
Self-Attn O(Nd) O(N2
d) O(1)
Conv O(kd2
) O(kNd2
) O(1)
RNN O(d2
) O(Nd2
) O(N)
Complexity
Self-Attn
Conv
RNN
N: Seq length.
d: Dim. (N > d)
k: Filter size.
Explicitly Encode Temporal Info
+ ++++
RNN
Position
Embedding
1 5432
1 5432
Implicit encode
explicit encode
Position Emb
Feedforward
Layer Norm
Self Attention
Layer Norm
Convolution
Layer Norm
+
+
+
Repeat
Position Emb
Feedforward
Self Attention
Repeat
Convolution
if you want to
go deeper
QANet Encoder
[Yu et al., ICLR’18]
RNN RNN
RNN
RNN
Base Model (BiDAF) → QANet
QANet Encoder QANet Encoder
QANet Encoder
QANet Encoder
130
layers
QANet – 130+ layers (Deepest NLP NN)
QANet – First QA system with No Recurrence
● Very fast!
○ Training: 3x - 13x
○ Inference: 4x - 9x
QANet – 130+ layers (Deepest NLP NN)
● Layer normalization
● Residual connections
● L2
regularization
● Stochastic Depth
● Squeeze and Excitation
● ...
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Data augmentation: popular in vision & speech
More data with NMT back-translation
Input
Paraphrase
Translation
English → French
English ← French
Previously, tea had been used primarily for
Buddhist monks to stay awake during meditation.
Autrefois, le thé avait
été utilisé surtout pour
les moines bouddhistes
pour rester éveillé
pendant la méditation.
In the past, tea was used mostly for Buddhist
monks to stay awake during the meditation.
More data with NMT back-translation
Input
Paraphrase
Translation
English → French
English ← French
Previously, tea had been used primarily for
Buddhist monks to stay awake during meditation.
In the past, tea was used mostly for Buddhist
monks to stay awake during the meditation.
● More data
○ (Input, label)
○ (Paraphrase, label)
Applicable to virtually any NLP tasks!
QANet augmentation
Input
Paraphrase
Translation
English → French
English ← French
Improvement: +1.1 F1
Use 2 language pairs: English-French, English-German. 3x data.
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Proprietary + Confidential
Transfer learning for richer presentation
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
The quick brown fox jumped
Language Models
Sebastian Ruder @ Indaba 2018
Transfer learning for richer presentation
● Pretrained language model
(ELMo, [Peters et al., NAACL’18])
○ + 4.0 F1
Transfer learning for richer presentation
71
● Pretrained language model
(ELMo, [Peters et al., NAACL’18])
○ + 4.0 F1
● Pretrained machine translation
model (CoVe [McCann, NIPS’17])
○ + 0.3 F1
QANet – 3 key ideas
● Deep Architecture without RNN
○ 130-layer (Deepest in NLP)
● Transfer Learning
○ leverage unlabeled data
● Data Augmentation
○ with back-translation
#1 on SQuAD (Mar-Aug 2018)
QA is not Solved!!
QA is not Solved!!
Thank you!

Contenu connexe

Similaire à [246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD

Deep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorchDeep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorch
inside-BigData.com
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
Exploiting large-scale graph analytics for unsupervised Entity Linking
Exploiting large-scale graph analytics for unsupervised Entity LinkingExploiting large-scale graph analytics for unsupervised Entity Linking
Exploiting large-scale graph analytics for unsupervised Entity Linking
NECST Lab @ Politecnico di Milano
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 

Similaire à [246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD (20)

NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Icpp power ai-workshop 2018
Icpp power ai-workshop 2018Icpp power ai-workshop 2018
Icpp power ai-workshop 2018
 
Swimming upstream
Swimming upstreamSwimming upstream
Swimming upstream
 
Swimming upstream
Swimming upstreamSwimming upstream
Swimming upstream
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Deep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorchDeep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorch
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 
Exploiting large-scale graph analytics for unsupervised Entity Linking
Exploiting large-scale graph analytics for unsupervised Entity LinkingExploiting large-scale graph analytics for unsupervised Entity Linking
Exploiting large-scale graph analytics for unsupervised Entity Linking
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
 
Understanding Deep Learning
Understanding Deep LearningUnderstanding Deep Learning
Understanding Deep Learning
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
 

Plus de NAVER D2

Plus de NAVER D2 (20)

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
 

Dernier

Dernier (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD

  • 1. Adams Wei Yu Deview 2018, Seoul Quoc Le Thang Luong Rui Zhao Mohammad Norouzi Kai Chen Collaborators David Dohan
  • 2. Bio Adams Wei Yu ● Ph.D Candidate @ MLD, CMU ○ Advisor: Jaime Carbonell, Alex Smola ○ Large scale optimization ○ Machine reading comprehension
  • 6. Moving towards end-to-end systems ● Translation ● Question Answering
  • 7. Lots of Datasets Available TriviaQA Narrative QA MS Marco
  • 8. Stanford Question Answer Dataset (SQuAD) In education, teachers facilitate student learning, often in a school or academy or perhaps in another environment such as outdoors. A teacher who teaches on an individual basis may be described as a tutor. Passage: What is the role of teachers in education?Question: facilitate student learningGroundtruth: facilitate student learningPrediction 1: EM = 1, F1 = 1 student learningPrediction 2: EM = 0, F1 = 0.8 teachers facilitate student learningPrediction 3: EM = 0, F1 = 0.86 Data: Crowdsourced 100k question-answer pairs on 500 Wikipedia articles.
  • 9. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 10. That movie was awful.
  • 11. That movie was awful . embed embed embed embed embed sum Bag of words hout
  • 12.
  • 13. Continuous bag-of-words and skip-gram architectures (Mikolov et al., 2013a; 2013b)
  • 14. That movie was awful . conv conv conv conv conv sum embed embed embed embed embed Bag of N-Grams hout
  • 15. That movie was awful . embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h) hinit hout Recurrent Neural Networks
  • 16. The quick brown fox jumped over the lazy doo
  • 17. The quick brown fox jumped over the lazy dog
  • 18. A feed-forward neural network language model (Bengio et al., 2001; 2003)
  • 19. <s> The quick brown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project The quick brown fox jumped Language Models
  • 20. Yes please <de> ja bitte embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project Ja bitte </s> Language Models Seq2Seq
  • 21. Yes please <s> ja bitte embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit Ja bitte </s> Seq2Seq + Attention Encoder Decoder hinit ?
  • 24. Attention: a weighted average The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 25. Convolution: Different linear transformations by relative position. The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 26. Attention: a weighted average The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 27. Multi-head Attention Parallel attention layers with different linear transformations on input and output. The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 28. Yes please <s> Ja embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit Ja ...bitte </s> Seq2Seq + Attention Encoder Decoder hinit w1 w2
  • 29. <s> The quick brown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h) project The quick brown fox jumped Language Models with attention
  • 30.
  • 31. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 32. General (Doc, Question) → Answer Model
  • 33. General framework neural QA Systems Bi-directional Attention Flow (BiDAF) [Seo et al., ICLR’17]
  • 34. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 35. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 36. Base Model (BiDAF) 36 Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 37. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 38. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 39. RNN RNN RNN RNN Two Challenges with RNNs Remain... Base Model (BiDAF)
  • 40. First challenge: hard to capture long dependency h1 h3 h4 h5 h6 h2 Being a long-time fan of Japanese film, I expected more than this. I can't really be bothered to write too much, as this movie is just so poor. The story might be the cutest romantic little something ever, pity I couldn't stand the awful acting, the mess they called pacing, and the standard "quirky" Japanese story. If you've noticed how many Japanese movies use characters, plots and twists that seem too "different", forcedly so, then steer clear of this movie. Seriously, a 12-year old could have told you how this movie was going to move along, and that's not a good thing in my book. Fans of "Beat" Takeshi: his part in this movie is not really more than a cameo, and unless you're a rabid fan, you don't need to suffer through this waste of film.
  • 41. Second challenge: hard to compute in parallel Strictly Sequential!
  • 42. 1. local context input hidden state h1 h3 h4 h5 h6 h2 2. global interaction 3. Temporal info What do RNNs Capture? Substitution?
  • 43. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 44. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 The todayniceisweather
  • 45. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 46. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 47. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 48. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 1.8 0.90.30.41.6 k = 3k = 2 d = 3 1.2 0.81.40.52.1 k = 3 0.0 0.0 0.0 k-gram features Fully parallel!
  • 49. How about Global Interaction? The todayniceisweather layer 1 layer 2 layer 3 1. May need O(logk N) layers 2. Interaction may become weaker N: Seq length. k: Filter size.
  • 50. The todayniceisweather The todayniceisweather 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 The todayniceisweather w1 x + w2 x + w3 x + w4 x + w5 x 1.8 2.3 0.4 The = w1 w2 w3 w4 w5 w1 , w2 , w3 , w4 , w5 = softmax ( ) 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.6 0.2 0.8 x The The todayniceisweather [Vaswani et al., NIPS’17]
  • 52. Per Unit Total Per Layer Sequential Op (Path Memory) Self-Attn O(Nd) O(N2 d) O(1) Conv O(kd2 ) O(kNd2 ) O(1) RNN O(d2 ) O(Nd2 ) O(N) Complexity Self-Attn Conv RNN N: Seq length. d: Dim. (N > d) k: Filter size.
  • 53. Explicitly Encode Temporal Info + ++++ RNN Position Embedding 1 5432 1 5432 Implicit encode explicit encode
  • 54. Position Emb Feedforward Layer Norm Self Attention Layer Norm Convolution Layer Norm + + + Repeat Position Emb Feedforward Self Attention Repeat Convolution if you want to go deeper QANet Encoder [Yu et al., ICLR’18]
  • 55. RNN RNN RNN RNN Base Model (BiDAF) → QANet QANet Encoder QANet Encoder QANet Encoder QANet Encoder
  • 56. 130 layers QANet – 130+ layers (Deepest NLP NN)
  • 57. QANet – First QA system with No Recurrence ● Very fast! ○ Training: 3x - 13x ○ Inference: 4x - 9x
  • 58. QANet – 130+ layers (Deepest NLP NN) ● Layer normalization ● Residual connections ● L2 regularization ● Stochastic Depth ● Squeeze and Excitation ● ...
  • 59. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 60. Data augmentation: popular in vision & speech
  • 61. More data with NMT back-translation Input Paraphrase Translation English → French English ← French Previously, tea had been used primarily for Buddhist monks to stay awake during meditation. Autrefois, le thé avait été utilisé surtout pour les moines bouddhistes pour rester éveillé pendant la méditation. In the past, tea was used mostly for Buddhist monks to stay awake during the meditation.
  • 62. More data with NMT back-translation Input Paraphrase Translation English → French English ← French Previously, tea had been used primarily for Buddhist monks to stay awake during meditation. In the past, tea was used mostly for Buddhist monks to stay awake during the meditation. ● More data ○ (Input, label) ○ (Paraphrase, label) Applicable to virtually any NLP tasks!
  • 63. QANet augmentation Input Paraphrase Translation English → French English ← French Improvement: +1.1 F1 Use 2 language pairs: English-French, English-German. 3x data.
  • 64. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 66.
  • 67. Transfer learning for richer presentation
  • 68. <s> The quick brown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project The quick brown fox jumped Language Models
  • 69. Sebastian Ruder @ Indaba 2018
  • 70. Transfer learning for richer presentation ● Pretrained language model (ELMo, [Peters et al., NAACL’18]) ○ + 4.0 F1
  • 71. Transfer learning for richer presentation 71 ● Pretrained language model (ELMo, [Peters et al., NAACL’18]) ○ + 4.0 F1 ● Pretrained machine translation model (CoVe [McCann, NIPS’17]) ○ + 0.3 F1
  • 72. QANet – 3 key ideas ● Deep Architecture without RNN ○ 130-layer (Deepest in NLP) ● Transfer Learning ○ leverage unlabeled data ● Data Augmentation ○ with back-translation #1 on SQuAD (Mar-Aug 2018)
  • 73. QA is not Solved!!
  • 74. QA is not Solved!! Thank you!