BERT: Bidirectional Encoder Representations from Transformers

•

4 j'aime•1,990 vues

BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.

Sciences

1
BERT: Bidirectional Encoder
Representations from Transformers
Liangqun Lu
MS in CS and PhD in Biology
2019 - 02 - 25
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT:
Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv [cs.CL].
arXiv. http://arxiv.org/abs/1810.04805.

Related Previous Work
● Attention: Neural Machine Translation by Jointly Learning to Align and
Translate (Bahdanau et al. 2014)
● Transformer: Attention is All you Need (Vaswani et al. 2017)
● ELMo: Deep Contextualized Word Representations (Peters et al. 2018)
● GPT: Improving language understanding by generative pre-training (Radford
et al. 2018)
2
Seq2seq NMT Attention Transformer
Bert
Glove ELMo GPTWord2Vec

Sequence to sequence neural network
● Many NLP tasks can be phrased as sequence-to-sequence:
○ Language translation (input → output)
○ Summarization (long text → short text)
○ Dialogue (previous utterances → next utterance)
○ Parsing (input text → output parse as sequence)
○ Code generation (natural language → Python code)
3
Encoder DecoderInput Output

NMT: Neural machine translation
4
● 2 RNN models are involved: Encoder and Decoder

Pros and cons of NMT
● Pros:
○ Better performance than previous statistical-based machine translation
○ Requires much less human engineering effort
○ A single neural network to be optimized end-to-end
● Cons:
○ less interpretable
○ difficult to control (can’t easily specify rules or guidelines for translation)
○ Information bottleneck
6

8
Attention provides a solution to the
bottleneck problem: each step of the
decoder, focus on a particular part of the
source sequence

Attention is great !
● Attention significantly improves NMT performance
● Attention helps with vanishing gradient problem
● Attention provides some interpretability
○ By inspecting attention distribution, we can
see the alignment between words which
shows that the neural network learns the
alignment
14
Attention is a way to focus on particular parts of the
input; Improves sequence-to-sequence a lot

Attention is a general Deep Learning technique
● More general definition of attention:
● Given a set of vector values, and a vector query, attention is a
technique to compute a weighted sum of the values, dependent on the
query.
● For example, in the seq2seq + attention model, each decoder hidden state
attends to the encoder hidden states.
15

● Intuition:
● The weighted sum is a selective summary of the information
contained in the values, where the query determines which values to
focus on.
● Attention is a way to obtain a fixed-size representation of an arbitrary
set of representations (the values), dependent on some other
representation (the query).
16

Transformer Overview
● Sequence-to-sequence Encoder to
Decoder
● Task: machine translation with parallel
corpus
● Predict each translated word
● Final cost/error function is standard
cross-entropy error on top of a softmax
classifier
17

Bert outline
● Contextual word representations
● Masked language model
● Next sentence prediction
● Model architecture
● Experiments
a. Sentence Pair Classification [MNLI]
b. Single Sentence Classification [SST-2]
c. Question Answering [SQuAD]
d. Single Sentence Tagging [CoNLL-NER]
24

SQuAD -- Stanford Question Answering Dataset
41

Conclusion
● BERT is strong pre-trained language model that uses bidirectional
transformer
● BERT can be fine-tuned to achieve good performance in many NLP tasks
● The source code is available at github
44

References
● Stanford CS224n: Natural Language Processing with Deep Learning
● Stanford CS231n: Convolutional Neural Networks for Visual Recognition
● http://people.ee.duke.edu/~lcarin/Kevin8.3.2018.pdf
● https://zhuanlan.zhihu.com/p/52282552
● https://zhuanlan.zhihu.com/p/46178084
● https://zhuanlan.zhihu.com/p/39034683
46

Contenu connexe

Tendances

BERT Finetuning Webinar Presentationbhavesh_physics

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim

Natural language processing and transformer modelsDing Li

[Paper review] BERTJEE HYUN PARK

1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow

BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong

GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks

[Paper Reading] Attention is All You NeedDaiki Tanaka

Word embedding ShivaniChoudhary74

Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev

Transformer Introduction (Seminar Material)Yuta Niki

NLP using transformers Arvind Devaraj

Word2Vechyunyoung Lee

Gpt modelsDanbi Cho

Word Embeddings, why the hype ? Hady Elsahar

Introduction to Named Entity RecognitionTomer Lieber

Gpt1 and 2 model reviewSeoung-Ho Choi

Word Embeddings - IntroductionChristian Perone

Natural language processing (NLP) introductionRobert Lujo

Tendances (20)

BERT Finetuning Webinar Presentation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Natural language processing and transformer models

[Paper review] BERT

1909 BERT: why-and-how (CODE SEMINAR)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

GPT-2: Language Models are Unsupervised Multitask Learners

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...

[Paper Reading] Attention is All You Need

Word embedding

Introduction to Transformers for NLP - Olga Petrova

Transformer Introduction (Seminar Material)

NLP using transformers

Word2Vec

Gpt models

Word Embeddings, why the hype ?

Introduction to Named Entity Recognition

Gpt1 and 2 model review

Word Embeddings - Introduction

Natural language processing (NLP) introduction

Similaire à BERT: Bidirectional Encoder Representations from Transformers

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Universitat Politècnica de Catalunya

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe

A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal

Notes on attention mechanismKhang Pham

Natural Language Processing - Research and Application TrendsShreyas Suresh Rao

Nlp and transformer (v3s)H K Yoon

[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...IJET - International Journal of Engineering and Techniques

Arabic named entity recognition using deep learning approachIJECEIAES

TensorflowKnoldus Inc.

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig

EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...csandit

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya

Fast and Accurate Preordering for SMT using Neural NetworksSDL

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc

BERT Explained_ State of the art language model for NLP.pdfsudeshnakundu10

Similaire à BERT: Bidirectional Encoder Representations from Transformers (20)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...

A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH

Notes on attention mechanism

Natural Language Processing - Research and Application Trends

Nlp and transformer (v3s)

[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...

Arabic named entity recognition using deep learning approach

Tensorflow

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)

Fast and Accurate Preordering for SMT using Neural Networks

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...

BERT Explained_ State of the art language model for NLP.pdf

Plus de Liangqun Lu

NFL_intros.pptxLiangqun Lu

Gan summaryLiangqun Lu

Data integration lab_meetingLiangqun Lu

NLP DLforDSLiangqun Lu

LassoLiangqun Lu

IrganLiangqun Lu

Deep Learning Application in BiologyLiangqun Lu

Liangqun ms defense.pptxLiangqun Lu

Thesis ms llqLiangqun Lu

Liangqun lu 1st_gss_version2Liangqun Lu

Presentation orientationLiangqun Lu

Journal club.pptxLiangqun Lu

Final.projectLiangqun Lu

Plus de Liangqun Lu (13)

NFL_intros.pptx

Gan summary

Data integration lab_meeting

NLP DLforDS

Lasso

Irgan

Deep Learning Application in Biology

Liangqun ms defense.pptx

Thesis ms llq

Liangqun lu 1st_gss_version2

Presentation orientation

Journal club.pptx

Final.project

Dernier

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74

module for grade 9 for distance learninglevieagacer

Chemistry 4th semester series (krishna).pdfSumit Kumar yadav

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani

High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit

Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju

GBSN - Biochemistry (Unit 1)Areesha Ahmad

CELL -Structural and Functional unit of life.pdfNistarini College, Purulia (W.B) India

Proteomics: types, protein profiling steps etc.Silpa

FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson

COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed

Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju

GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha

Nanoparticles synthesis and characterization kaibalyasahoo82800

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330

Dernier (20)

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...

module for grade 9 for distance learning

Chemistry 4th semester series (krishna).pdf

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...

High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑

Pests of cotton_Sucking_Pests_Dr.UPR.pdf

GBSN - Biochemistry (Unit 1)

CELL -Structural and Functional unit of life.pdf

Proteomics: types, protein profiling steps etc.

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry

COST ESTIMATION FOR A RESEARCH PROJECT.pptx

Pests of mustard_Identification_Management_Dr.UPR.pdf

GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000

Nanoparticles synthesis and characterization

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE

BERT: Bidirectional Encoder Representations from Transformers

1. 1 BERT: Bidirectional Encoder Representations from Transformers Liangqun Lu MS in CS and PhD in Biology 2019 - 02 - 25 Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1810.04805.

2. Related Previous Work ● Attention: Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2014) ● Transformer: Attention is All you Need (Vaswani et al. 2017) ● ELMo: Deep Contextualized Word Representations (Peters et al. 2018) ● GPT: Improving language understanding by generative pre-training (Radford et al. 2018) 2 Seq2seq NMT Attention Transformer Bert Glove ELMo GPTWord2Vec

3. Sequence to sequence neural network ● Many NLP tasks can be phrased as sequence-to-sequence: ○ Language translation (input → output) ○ Summarization (long text → short text) ○ Dialogue (previous utterances → next utterance) ○ Parsing (input text → output parse as sequence) ○ Code generation (natural language → Python code) 3 Encoder DecoderInput Output

4. NMT: Neural machine translation 4 ● 2 RNN models are involved: Encoder and Decoder

5. NMT training 5

6. Pros and cons of NMT ● Pros: ○ Better performance than previous statistical-based machine translation ○ Requires much less human engineering effort ○ A single neural network to be optimized end-to-end ● Cons: ○ less interpretable ○ difficult to control (can’t easily specify rules or guidelines for translation) ○ Information bottleneck 6

7. 7

8. 8 Attention provides a solution to the bottleneck problem: each step of the decoder, focus on a particular part of the source sequence

9. 9

10. 10

11. 11

12. 12

13. 13

14. Attention is great ! ● Attention significantly improves NMT performance ● Attention helps with vanishing gradient problem ● Attention provides some interpretability ○ By inspecting attention distribution, we can see the alignment between words which shows that the neural network learns the alignment 14 Attention is a way to focus on particular parts of the input; Improves sequence-to-sequence a lot

15. Attention is a general Deep Learning technique ● More general definition of attention: ● Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query. ● For example, in the seq2seq + attention model, each decoder hidden state attends to the encoder hidden states. 15

16. ● Intuition: ● The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on. ● Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query). 16

17. Transformer Overview ● Sequence-to-sequence Encoder to Decoder ● Task: machine translation with parallel corpus ● Predict each translated word ● Final cost/error function is standard cross-entropy error on top of a softmax classifier 17

18. Scaled Dot-Production Attention 18

19. 19

20. 20

21. 21

22. 22

23. 23

24. Bert outline ● Contextual word representations ● Masked language model ● Next sentence prediction ● Model architecture ● Experiments a. Sentence Pair Classification [MNLI] b. Single Sentence Classification [SST-2] c. Question Answering [SQuAD] d. Single Sentence Tagging [CoNLL-NER] 24

25.

26.

27.

28.

29.

30.

31.

32.

33.

34. 34

35. 35

36.

37.

38.

39.

40.

41. SQuAD -- Stanford Question Answering Dataset 41

42.

43. 43 SQuAD1.1 Leaderboard

44. Conclusion ● BERT is strong pre-trained language model that uses bidirectional transformer ● BERT can be fine-tuned to achieve good performance in many NLP tasks ● The source code is available at github 44

45.

46. References ● Stanford CS224n: Natural Language Processing with Deep Learning ● Stanford CS231n: Convolutional Neural Networks for Visual Recognition ● http://people.ee.duke.edu/~lcarin/Kevin8.3.2018.pdf ● https://zhuanlan.zhihu.com/p/52282552 ● https://zhuanlan.zhihu.com/p/46178084 ● https://zhuanlan.zhihu.com/p/39034683 46

BERT: Bidirectional Encoder Representations from Transformers

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à BERT: Bidirectional Encoder Representations from Transformers

Similaire à BERT: Bidirectional Encoder Representations from Transformers (20)

Plus de Liangqun Lu

Plus de Liangqun Lu (13)

Dernier

Dernier (20)

BERT: Bidirectional Encoder Representations from Transformers