SlideShare une entreprise Scribd logo
1  sur  64
Télécharger pour lire hors ligne
Who we are?
What is Grammarly?
Grammarly’s AI-powered writing
assistant helps you make your
communication clear and effective,
wherever you type.
● Correctness: grammar, spelling, punctuation, writing
mechanics, and updated terminologies
● Clarity: clear writing, conciseness, complexity, and
readability
● Engagement: vocabulary and variety
● Delivery: formality, politeness, and confidence
Grammarly’s
writing assistant
350+
team members across offices
in San Francisco, New York,
Kyiv, and Vancouver
2,000+
Institutional and enterprise
clients
20M+
daily active users around
the world
Grammarly by the numbers
Grammarly’s NLP team
Our team which works on NLP contains 30+ people of the
following roles:
- applied research scientists
- machine learning engineers
- computational linguists
This work
This presentation is based on the paper written by:
- Kostiantyn Omelianchuk
- Oleksandr Skurzhanskyi
- Vitaliy Atrasevych
- Artem Chernodub
What task we are
trying to solve?
The goal of GEC task is to produce the grammatically correct
sentence from the sentence with mistakes.
Example:
Source: He go at school.
Target: He goes to school.
GEC: Grammatical
Error Correction
A Comprehensive Survey of Grammar Error Correction (Yu
Wang*, Yuelin Wang*, Jie Liuɨ , and Zhuo Liu)
GEC: History overview
Categorization of research
in GEC
GEC: Data
GEC: Shared task
- CoNLL-2014 (the test set is composed of 50 essays written
by non-native English students; metric - M2Score)
- BEA-2019 (new data; the test set consists of 4477
sentences; metric - Errant)
GEC: Progress
Sequence
Tagging
We approach the GEC task as a sequence tagging problem.
In this formulation for each token in the source sentence a GEC
system should produce a tag (edit) which represent a required
correction operation for this token.
For solving this problem we use non-autoregressive
transformer-based model.
Sequence Tagging
keep the current token
unchanged
APPEND_ti
DELETE
REPLACE_ti
KEEP
delete current token
append new token ti
replace the current token
with token ti
SPLIT
MERGE
NOUN
NUMBER
CASE
VERB
FORM
Basic transformations
change the case of the
current token
g-transformations
merge the current and
the next token
split the current token
into two
convert singular nouns
to plurals and vice versa
change the form of
regular/irregular verbs
Token-level transformations
A ten years old boy go school
A ⇨ A:
ten ⇨ ten, -: ,
years ⇨ year, -:
old ⇨ old:
go ⇨ goes, to:
school ⇨ school, .:
A ten years old boy go school
A ten years old boy go school
$KEEP
$MERGE_HYPH
$NOUN_NUMBER_SINGULAR
$VERB_FORM_VB_VBZ
$APPEND_{.}
$KEEP
$MERGE_HYPH
$KEEP
$APPEND_{to}
$KEEP
A ten-year-old boy goes to school.
1. Generating tags is fully independent and easy to parallelize
operation.
2. Smaller output vocabulary size compare to seq2seq
models.
3. Less usage memory and faster inference compare to
encoder-decoder models.
Sequence Tagging: Pros
Sequence Tagging: Cons
1. Independent generating of tags relies on assumption that
errors are independent between each other.
2. Word reordering is tricky for this architecture.
First baselines
Academia vs Industry
- different data (both training/evaluation)
- latency/throughput matters
- aggressive deadlines
Baseline architecture
Encoder
Input sentence
Error
replacement
linear layer
Predicted tags
Corrected
sentence
Error
detection
linear layer
Postprocess
Error insertion
linear layer(s)
Embeddings
Baseline hyperparameters
- Emedings (trainable random, glove, bert, distill-bert)
- Encoder (cnn, lstm, stacked-lstm, pass_through, transformers)
- Output layers (1-2 for insertions, 1 for replacement, 1 for
detection)
- Output vocabulary size (100-50000)
- Other (dropouts, training schedule, tp/tn ratio, etc)
Baseline results
Baseline/m2 score CoNLL-2014 (test)
Grammarly Spellchecker
26.8
Stacked LSTM (vocab_size=1000) 30.5
Stacked LSTM (vocab_size=1000; + g-transformations) 35.6
Stacked LSTM (vocab_size=1000; + g-transformations; +
BERT embeddings) 46
Academic SOTA (single model) 61.3
Insights
- Increasing size of output vocabulary does not help
- Adding BERT as emdebbings helps a lot
- Training BERT with small batch_size failed (not converge);
training with bigger batches required gradient accumulating
Sudden
competitor
PIE paper
Our SOTA on NUCLE: 46
PIE paper
Our SOTA on NUCLE: 46
PIE paper
Our SOTA on NUCLE: 46
Transformers
Transformer
Attention Is All You Need
(Vaswani et al. 2017)
Multihead attention
BERT-like architecture
BERT (Devlin et al. 2018)
Training
transformers
huggingface / transformers
Training transformers.
First try
Thing that made our transformers works:
● small learning rate (1e-5)
● big batch size (at least 128 sentences)
or use gradient accumulation
● freeze BERT encoder first and train only linear layer,
then train jointly
Transformers recipe
*on CoNLL-2014 (test)
BERTology works
LSTM 35.6
LSTM + BERT
embeddings
46.0
DistillBERT 52.8
BERT-base 57.3
Additional tricks
A ten years old boy go school (zero corrections)
Iteration 1: A ten-years old boy goes school (2 total corrections)
Iteration 2 A ten-year-old boy goes to school (5 total corrections)
Iteration 3: A ten-year-old boy goes to school. (6 total corrections)
Iteration 2
Iterative Approach
Iteration 1
Iteration 3
Sequence
GECToR model: iterative
pipeline
Encoder (transformer-based)
Input sentence
Error correction
linear layer
Predicted tags
Corrected
sentence
Error detection
linear layer
Postprocess
Repeat N times
● Training is splitted in N stages
● Each stage has its own data
● Each stage has its own hyperparameters
● On each stage model is initialized by the best weights of
previous stage
Staged training
I. Pre-training on synthetic
errorful sentences as in
(Awasthi et al., 2019)
II. Fine-tuning on errorful-only
sentences
III. Fine-tuning on subset of
errorful and errorfree
sentences as in
(Kiyono et al., 2019)
Training stages
Dataset # sentences
% errorful
sentences
PIE-synthetic 9,000,000 100.0%
Lang-8 947,344 52.5%
NUCLE 56,958 38.0%
FCE 34,490 62.4%
W&I+LOCNESS 34,304 67.3%
I. Pre-training on synthetic
errorful sentences as in
(Awasthi et al., 2019)
II. Fine-tuning on errorful-only
sentences(new!)
III. Fine-tuning on subset of
errorful and errorfree
sentences as in
(Kiyono et al., 2019)
Training stages
Dataset # sentences
% errorful
sentences
PIE-synthetic 9,000,000 100.0%
Lang-8 947,344 52.5%
NUCLE 56,958 38.0%
FCE 34,490 62.4%
W&I+LOCNESS 34,304 67.3%
I. Pre-training on synthetic
errorful sentences as in
(Awasthi et al., 2019)
II. Fine-tuning on errorful-only
sentences
III. Fine-tuning on subset of
errorful and errorfree
sentences as in
(Kiyono et al., 2019)
Training stages
Dataset # sentences
% errorful
sentences
PIE-synthetic 9,000,000 100.0%
Lang-8 947,344 52.5%
NUCLE 56,958 38.0%
FCE 34,490 62.4%
W&I+LOCNESS 34,304 67.3%
● Adam optimizer (Kingma and Ba, 2015);
● Early stopping after 3 epochs of 10K
updates each w/a improvement;
● Epochs & batch sizes:
○ Stage I (pretraining):
batch size=256, 20 epochs
○ Stages II, III (finetuning):
batch size=128, 2-3 epochs
Training details
Training
stage #
CoNLL-2014
(test)*, F0.5
BEA-2019
(dev)*, F0.5
I 49.9 33.2
II 59.7 44.4
III 62.5 50.3
III + Inf.
tweaks
65.3 55.5
* Results are given for GECToR (XLNet).
Inference tweaks 1
By increasing probability of $KEEP tag we can force model to make
only confident actions.
In such a way, we can increase precision by trading recall.
Inference tweaks 2
We also compute the minimum probability of incorrect class across
all tokens in the sentence.
This value (min_erorr_probability) should be higher than threshold in
order to run next iteration.
BERT family
Varying encoders from
pretrained transformers
* Training was performed on data from training stage II only. ** Baseline.
Encoder
CONLL-2014 (test) BEA-2019 (test)
P R F0.5
P R F0.5
LSTM** 51.6 15.3 35.0 - - -
ALBERT 59.5 31.0 50.3 43.8 22.3 36.7
BERT 65.6 36.9 56.8 48.3 29.0 42.6
GPT-2 61.0 6.3 22.2 44.5 5.0 17.2
RoBERTa 67.5 38.3 58.6 50.3 30.5 44.5
XLNet 64.6 42.6 58.6 47.1 34.2 43.8
Varying encoders from
pretrained transformers
* Training was performed on data from training stage II only. ** Baseline.
Encoder
CONLL-2014 (test) BEA-2019 (test)
P R F0.5
P R F0.5
LSTM** 51.6 15.3 35.0 - - -
ALBERT 59.5 31.0 50.3 43.8 22.3 36.7
BERT 65.6 36.9 56.8 48.3 29.0 42.6
GPT-2 61.0 6.3 22.2 44.5 5.0 17.2
RoBERTa 67.5 38.3 58.6 50.3 30.5 44.5
XLNet 64.6 42.6 58.6 47.1 34.2 43.8
Model inference
time
Speed comparison
● NVIDIA Tesla V100
● CoNLL-2014 (test)
● single model
● batch size=128
Speed comparison
Final results
Results
Big picture
P.S. Writing paper
Writing paper
Writing paper
- It takes us approx 2 months for 4 authors to write 4 pages
- Better to start as early as possible
Support your code
Thank you!
Links
● The paper
● The code
● GEC survey

Contenu connexe

Tendances

Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
ThyrixYang1
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
nozyh
 

Tendances (19)

BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Bert
BertBert
Bert
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
 
SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment fr...
SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment fr...SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment fr...
SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment fr...
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Capturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language ModelingCapturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language Modeling
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
 
22_11_2019 «Gamifying a software testing course with the Code Defenders Testi...
22_11_2019 «Gamifying a software testing course with the Code Defenders Testi...22_11_2019 «Gamifying a software testing course with the Code Defenders Testi...
22_11_2019 «Gamifying a software testing course with the Code Defenders Testi...
 
Electra
ElectraElectra
Electra
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
[論文紹介] Deep contextualized word representations
[論文紹介] Deep contextualized word representations[論文紹介] Deep contextualized word representations
[論文紹介] Deep contextualized word representations
 
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
 
Ics1019 ics5003
Ics1019 ics5003Ics1019 ics5003
Ics1019 ics5003
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 

Similaire à Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art approach to Grammatical Error Correction"

IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
KtonNguyn2
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
MLAI2
 

Similaire à Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art approach to Grammatical Error Correction" (20)

ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators
ELECTRA_Pretraining Text Encoders as Discriminators rather than GeneratorsELECTRA_Pretraining Text Encoders as Discriminators rather than Generators
ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators
 
Free Writing - Grammatical Error Correction System: Sequence Tagging
Free Writing - Grammatical Error Correction System: Sequence TaggingFree Writing - Grammatical Error Correction System: Sequence Tagging
Free Writing - Grammatical Error Correction System: Sequence Tagging
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
 
American sign language recognizer
American sign language recognizerAmerican sign language recognizer
American sign language recognizer
 
Your Own Metric System
Your Own Metric SystemYour Own Metric System
Your Own Metric System
 
BloombergGPT.pdfA Large Language Model for Finance
BloombergGPT.pdfA Large Language Model for FinanceBloombergGPT.pdfA Large Language Model for Finance
BloombergGPT.pdfA Large Language Model for Finance
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Refactoring legacy code driven by tests - ENG
Refactoring legacy code driven by tests - ENGRefactoring legacy code driven by tests - ENG
Refactoring legacy code driven by tests - ENG
 
05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdf
 
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
 
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
Analysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approachAnalysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approach
 
From_seq2seq_to_BERT
From_seq2seq_to_BERTFrom_seq2seq_to_BERT
From_seq2seq_to_BERT
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
 
Building high productivity applications
Building high productivity applicationsBuilding high productivity applications
Building high productivity applications
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
 
NL to OCL via SBVR
NL to OCL via SBVRNL to OCL via SBVR
NL to OCL via SBVR
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 

Plus de Fwdays

Plus de Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art approach to Grammatical Error Correction"

  • 1.
  • 3. What is Grammarly? Grammarly’s AI-powered writing assistant helps you make your communication clear and effective, wherever you type.
  • 4. ● Correctness: grammar, spelling, punctuation, writing mechanics, and updated terminologies ● Clarity: clear writing, conciseness, complexity, and readability ● Engagement: vocabulary and variety ● Delivery: formality, politeness, and confidence Grammarly’s writing assistant
  • 5. 350+ team members across offices in San Francisco, New York, Kyiv, and Vancouver 2,000+ Institutional and enterprise clients 20M+ daily active users around the world Grammarly by the numbers
  • 6. Grammarly’s NLP team Our team which works on NLP contains 30+ people of the following roles: - applied research scientists - machine learning engineers - computational linguists
  • 7. This work This presentation is based on the paper written by: - Kostiantyn Omelianchuk - Oleksandr Skurzhanskyi - Vitaliy Atrasevych - Artem Chernodub
  • 8. What task we are trying to solve?
  • 9. The goal of GEC task is to produce the grammatically correct sentence from the sentence with mistakes. Example: Source: He go at school. Target: He goes to school. GEC: Grammatical Error Correction
  • 10. A Comprehensive Survey of Grammar Error Correction (Yu Wang*, Yuelin Wang*, Jie Liuɨ , and Zhuo Liu) GEC: History overview
  • 13. GEC: Shared task - CoNLL-2014 (the test set is composed of 50 essays written by non-native English students; metric - M2Score) - BEA-2019 (new data; the test set consists of 4477 sentences; metric - Errant)
  • 16. We approach the GEC task as a sequence tagging problem. In this formulation for each token in the source sentence a GEC system should produce a tag (edit) which represent a required correction operation for this token. For solving this problem we use non-autoregressive transformer-based model. Sequence Tagging
  • 17. keep the current token unchanged APPEND_ti DELETE REPLACE_ti KEEP delete current token append new token ti replace the current token with token ti SPLIT MERGE NOUN NUMBER CASE VERB FORM Basic transformations change the case of the current token g-transformations merge the current and the next token split the current token into two convert singular nouns to plurals and vice versa change the form of regular/irregular verbs
  • 18. Token-level transformations A ten years old boy go school A ⇨ A: ten ⇨ ten, -: , years ⇨ year, -: old ⇨ old: go ⇨ goes, to: school ⇨ school, .: A ten years old boy go school A ten years old boy go school $KEEP $MERGE_HYPH $NOUN_NUMBER_SINGULAR $VERB_FORM_VB_VBZ $APPEND_{.} $KEEP $MERGE_HYPH $KEEP $APPEND_{to} $KEEP A ten-year-old boy goes to school.
  • 19. 1. Generating tags is fully independent and easy to parallelize operation. 2. Smaller output vocabulary size compare to seq2seq models. 3. Less usage memory and faster inference compare to encoder-decoder models. Sequence Tagging: Pros
  • 20. Sequence Tagging: Cons 1. Independent generating of tags relies on assumption that errors are independent between each other. 2. Word reordering is tricky for this architecture.
  • 22. Academia vs Industry - different data (both training/evaluation) - latency/throughput matters - aggressive deadlines
  • 23. Baseline architecture Encoder Input sentence Error replacement linear layer Predicted tags Corrected sentence Error detection linear layer Postprocess Error insertion linear layer(s) Embeddings
  • 24. Baseline hyperparameters - Emedings (trainable random, glove, bert, distill-bert) - Encoder (cnn, lstm, stacked-lstm, pass_through, transformers) - Output layers (1-2 for insertions, 1 for replacement, 1 for detection) - Output vocabulary size (100-50000) - Other (dropouts, training schedule, tp/tn ratio, etc)
  • 25. Baseline results Baseline/m2 score CoNLL-2014 (test) Grammarly Spellchecker 26.8 Stacked LSTM (vocab_size=1000) 30.5 Stacked LSTM (vocab_size=1000; + g-transformations) 35.6 Stacked LSTM (vocab_size=1000; + g-transformations; + BERT embeddings) 46 Academic SOTA (single model) 61.3
  • 26. Insights - Increasing size of output vocabulary does not help - Adding BERT as emdebbings helps a lot - Training BERT with small batch_size failed (not converge); training with bigger batches required gradient accumulating
  • 28. PIE paper Our SOTA on NUCLE: 46
  • 29. PIE paper Our SOTA on NUCLE: 46
  • 30. PIE paper Our SOTA on NUCLE: 46
  • 32. Transformer Attention Is All You Need (Vaswani et al. 2017)
  • 38. Thing that made our transformers works: ● small learning rate (1e-5) ● big batch size (at least 128 sentences) or use gradient accumulation ● freeze BERT encoder first and train only linear layer, then train jointly Transformers recipe
  • 39. *on CoNLL-2014 (test) BERTology works LSTM 35.6 LSTM + BERT embeddings 46.0 DistillBERT 52.8 BERT-base 57.3
  • 41. A ten years old boy go school (zero corrections) Iteration 1: A ten-years old boy goes school (2 total corrections) Iteration 2 A ten-year-old boy goes to school (5 total corrections) Iteration 3: A ten-year-old boy goes to school. (6 total corrections) Iteration 2 Iterative Approach Iteration 1 Iteration 3 Sequence
  • 42. GECToR model: iterative pipeline Encoder (transformer-based) Input sentence Error correction linear layer Predicted tags Corrected sentence Error detection linear layer Postprocess Repeat N times
  • 43. ● Training is splitted in N stages ● Each stage has its own data ● Each stage has its own hyperparameters ● On each stage model is initialized by the best weights of previous stage Staged training
  • 44. I. Pre-training on synthetic errorful sentences as in (Awasthi et al., 2019) II. Fine-tuning on errorful-only sentences III. Fine-tuning on subset of errorful and errorfree sentences as in (Kiyono et al., 2019) Training stages Dataset # sentences % errorful sentences PIE-synthetic 9,000,000 100.0% Lang-8 947,344 52.5% NUCLE 56,958 38.0% FCE 34,490 62.4% W&I+LOCNESS 34,304 67.3%
  • 45. I. Pre-training on synthetic errorful sentences as in (Awasthi et al., 2019) II. Fine-tuning on errorful-only sentences(new!) III. Fine-tuning on subset of errorful and errorfree sentences as in (Kiyono et al., 2019) Training stages Dataset # sentences % errorful sentences PIE-synthetic 9,000,000 100.0% Lang-8 947,344 52.5% NUCLE 56,958 38.0% FCE 34,490 62.4% W&I+LOCNESS 34,304 67.3%
  • 46. I. Pre-training on synthetic errorful sentences as in (Awasthi et al., 2019) II. Fine-tuning on errorful-only sentences III. Fine-tuning on subset of errorful and errorfree sentences as in (Kiyono et al., 2019) Training stages Dataset # sentences % errorful sentences PIE-synthetic 9,000,000 100.0% Lang-8 947,344 52.5% NUCLE 56,958 38.0% FCE 34,490 62.4% W&I+LOCNESS 34,304 67.3%
  • 47. ● Adam optimizer (Kingma and Ba, 2015); ● Early stopping after 3 epochs of 10K updates each w/a improvement; ● Epochs & batch sizes: ○ Stage I (pretraining): batch size=256, 20 epochs ○ Stages II, III (finetuning): batch size=128, 2-3 epochs Training details Training stage # CoNLL-2014 (test)*, F0.5 BEA-2019 (dev)*, F0.5 I 49.9 33.2 II 59.7 44.4 III 62.5 50.3 III + Inf. tweaks 65.3 55.5 * Results are given for GECToR (XLNet).
  • 48. Inference tweaks 1 By increasing probability of $KEEP tag we can force model to make only confident actions. In such a way, we can increase precision by trading recall.
  • 49. Inference tweaks 2 We also compute the minimum probability of incorrect class across all tokens in the sentence. This value (min_erorr_probability) should be higher than threshold in order to run next iteration.
  • 51. Varying encoders from pretrained transformers * Training was performed on data from training stage II only. ** Baseline. Encoder CONLL-2014 (test) BEA-2019 (test) P R F0.5 P R F0.5 LSTM** 51.6 15.3 35.0 - - - ALBERT 59.5 31.0 50.3 43.8 22.3 36.7 BERT 65.6 36.9 56.8 48.3 29.0 42.6 GPT-2 61.0 6.3 22.2 44.5 5.0 17.2 RoBERTa 67.5 38.3 58.6 50.3 30.5 44.5 XLNet 64.6 42.6 58.6 47.1 34.2 43.8
  • 52. Varying encoders from pretrained transformers * Training was performed on data from training stage II only. ** Baseline. Encoder CONLL-2014 (test) BEA-2019 (test) P R F0.5 P R F0.5 LSTM** 51.6 15.3 35.0 - - - ALBERT 59.5 31.0 50.3 43.8 22.3 36.7 BERT 65.6 36.9 56.8 48.3 29.0 42.6 GPT-2 61.0 6.3 22.2 44.5 5.0 17.2 RoBERTa 67.5 38.3 58.6 50.3 30.5 44.5 XLNet 64.6 42.6 58.6 47.1 34.2 43.8
  • 54. Speed comparison ● NVIDIA Tesla V100 ● CoNLL-2014 (test) ● single model ● batch size=128
  • 61. Writing paper - It takes us approx 2 months for 4 authors to write 4 pages - Better to start as early as possible
  • 64. Links ● The paper ● The code ● GEC survey