SlideShare a Scribd company logo
1 of 33
Download to read offline
Paraphrase Detection in NLP
Yuriy Guts
ML Engineer
=
?
Paraphrasing
Any trip to Italy should include a visit to Tuscany to sample their exquisite wines.
Be sure to include a Tuscan wine-tasting experience when visiting Italy.
Paraphrase Identification
Where can I get very professional and reliable
envelope printing service in Sydney?
Where can I get very affordable branded
envelope printing service in Sydney?
Why are doctors always late?
Why doctors always make you wait for 15-20 minutes
before they see you?
404,290 pairs
Training set
2,345,796 pairs
Test set
Data
Q1 (ID, Title) Q2 (ID, Title)
Target
Binary (metric: log-loss)
Challenges
What are the best books on IT leadership?
What are the best books about leadership?
Sometimes a single word matters
Will the Miami Heat win the NBA championship in 2011?
Will the Miami Heat win the NBA championship in 2012?
How do I lose 20 pounds?
How do I lose 15 pounds?
Sometimes there are almost no shared words
I am unable to talk to girls, leave being friendly with them. Why?
I am shy to talk to any woman because i get nervous and
freaked out around them. What is the solution?
Is there a Quora user who have seen a UFO?
Have you seen an alien?
Sometimes ALL the words are shared
Is the Government of Pakistan encouraging India by not
taking any real action against ceasefire violations?
Is the Government of India encouraging Pakistan by not
taking any real action against ceasefire violations?
What is the most interesting thing we learned about
Portugal's World Cup team in their match against Germany?
What is the most interesting thing we learned about
Germany's World Cup team in their match against Portugal?
Approaches
1. Counting Stuff.
2. Information Theory.
3. Linguistics.
4. Deep Learning.
BoW Representations
Image copyright © S. Mukherjee
Cosine Similarity
Image copyright © C. Perone
Jaccard Similarity
Image copyright © C. Wagner
Edit Distances
1. Levenshtein
2. Damerau-Levenshtein
3. Jaro
4. Jaro-Winkler
… … ...
Lexical Databases and Ontologies
house, home
dwelling
hermitage cottage
backyard
veranda
study
“A place that serves as the living
quarters of one or more families”
. . .
penthouse
Meronym
Hyponym
HyponymHypernym
Hyponym
Def
WordNet
“The complete meaning of a word is always contextual, and no study of
meaning apart from context can be taken seriously”
Distributed Hypothesis of Language
John R. Firth. The technique of semantics.
Transactions of the Philological Society, 1935.
The minimum amount of “work” needed to transform document 1 to document 2.
Inspired by Earth Mover’s Distance, a well-studied transportation problem.
Word Mover’s Distance (WMD)
M. Kusner et al. “From Word Embeddings to Document Distances”, 2015.
http://proceedings.mlr.press/v37/kusnerb15.pdf
WMD: Linear Optimization Problem
M. Kusner et al. “From Word Embeddings to Document Distances”, 2015.
http://proceedings.mlr.press/v37/kusnerb15.pdf
nBOW frequency of the
i-th word in the document
“How much” of word i
travels to word j
Morphology & Syntax Features
https://demos.explosion.ai/displacy/
Architectural Principles
for State-of-the-Art Neural NLP
Embed
Encode
Attend
Predict
State-of-the-Art NLP Pipeline
Image copyright © Explosion AI
Recurrent Highway Networks
Highway layer:
Srivastava et al. “Recurrent Highway Networks”,
2015
https://arxiv.org/pdf/1607.03474.pdf
Two Input Documents? Siamese Network
Run the same encoding step for every input.
Share encoder weights, don’t learn WE1
and WE2
separately.
Distance Learning, Contrastive Loss
Penalize similar pairs by a monotonically increasing function of their learned distance.
Penalize different pairs by a monotonically decreasing function of their learned distance.
Hadsell, Chopra, LeCun “Dimensionality Reduction by Learning an Invariant Mapping”, 2006.
http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
404,290 pairs
Training set
2,345,796 pairs
Test set
Data
Q1 (ID, Title) Q2 (ID, Title)
Target
Binary (metric: log-loss)
How can I prevent pneumonia?
How do you prevent pneumonia?
Sometimes the labeling is just plain wrong
What is the car in this picture?
What car is in this picture?
How do I protect single phasing of a 3 phase motor?
What is single phase and 3 phase?
People misspell. A lot.
demonitization demonitozation masturbution masturbtion
demonitizing demonetiztion masturation mastburation
demonetising demoneitisation mastrubating masturabate
demonitisation demoneticzation masturbuation mastrubration
demonitize demonitizition masterbuting mastubrate
demonatisation demonetisations mastubate masturbute
demonitazation demonitasation mastribution mastribute
demonitesation demonestisation masturburation masterbuate
demoneytisation demotenization mastrabutation mastubating
demonestation demonitised mastubation
demonitising demonsation
100,000+
spelling corrections
Different target balance for Train and Test
Final Solution (Simplified)
Tokenized Text
Sequences of
Pre-trained
Embeddings
Statistical Features
Fuzzy Distances
Linguistic Features
Topic Features
Leaks ¯_(ツ)_/¯
Deep NN Predictions
Embedding Features
Final Model
(GBM)
Prediction
Calibration
~200 Features
for the final model
facebook.com/yuriy.guts
github.com/YuriyGuts/kaggle-quora-question-pairs

More Related Content

What's hot

Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
weak slot and filler structure
weak slot and filler structureweak slot and filler structure
weak slot and filler structureAmey Kerkar
 
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Weiwei Guo
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model reviewSeoung-Ho Choi
 
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...taeseon ryu
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Edureka!
 
Convolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language ProcessingConvolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language ProcessingThomas Delteil
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 

What's hot (20)

NLP
NLPNLP
NLP
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Bert
BertBert
Bert
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Lstm
LstmLstm
Lstm
 
weak slot and filler structure
weak slot and filler structureweak slot and filler structure
weak slot and filler structure
 
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
 
Convolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language ProcessingConvolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language Processing
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 

Similar to Paraphrase Detection in NLP

Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonAlyona Medelyan
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part IDelip Rao
 
Why use video by david mann
Why use video by david mannWhy use video by david mann
Why use video by david mannDavid Mann
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games ResearchJose Zagal
 
transfer.pptx
transfer.pptxtransfer.pptx
transfer.pptxHaibinSu2
 
Deep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDeep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDebdoot Sheet
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowValeria de Paiva
 
The expanding palette: emergent CALL paradigms
The expanding palette: emergent CALL paradigmsThe expanding palette: emergent CALL paradigms
The expanding palette: emergent CALL paradigmsLawrie Hunter
 
From Silver Bullets to First Principles: Effectively Leveraging Technology in...
From Silver Bullets to First Principles: Effectively Leveraging Technology in...From Silver Bullets to First Principles: Effectively Leveraging Technology in...
From Silver Bullets to First Principles: Effectively Leveraging Technology in...Peter Doolittle
 
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...Amit Sheth
 

Similar to Paraphrase Detection in NLP (20)

Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with Python
 
Nltk
NltkNltk
Nltk
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
download
downloaddownload
download
 
download
downloaddownload
download
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part I
 
Why use video by david mann
Why use video by david mannWhy use video by david mann
Why use video by david mann
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
transfer.pptx
transfer.pptxtransfer.pptx
transfer.pptx
 
Deep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDeep Learning - What's the buzz all about
Deep Learning - What's the buzz all about
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and How
 
The expanding palette: emergent CALL paradigms
The expanding palette: emergent CALL paradigmsThe expanding palette: emergent CALL paradigms
The expanding palette: emergent CALL paradigms
 
Bird05 nltk-intro
Bird05 nltk-introBird05 nltk-intro
Bird05 nltk-intro
 
From Silver Bullets to First Principles: Effectively Leveraging Technology in...
From Silver Bullets to First Principles: Effectively Leveraging Technology in...From Silver Bullets to First Principles: Effectively Leveraging Technology in...
From Silver Bullets to First Principles: Effectively Leveraging Technology in...
 
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
 

More from Yuriy Guts

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Yuriy Guts
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine LearningYuriy Guts
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2Yuriy Guts
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)Yuriy Guts
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivYuriy Guts
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of RedisYuriy Guts
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in ScalaYuriy Guts
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET DevelopersYuriy Guts
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETYuriy Guts
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional RequirementsYuriy Guts
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software ArchitectureYuriy Guts
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business AnalystsYuriy Guts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceYuriy Guts
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseYuriy Guts
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesYuriy Guts
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingYuriy Guts
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryYuriy Guts
 

More from Yuriy Guts (18)

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine Learning
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG Lviv
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of Redis
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET Developers
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NET
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software Architecture
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business Analysts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT Audience
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash Course
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - Databases
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - Multithreading
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and Memory
 

Recently uploaded

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Recently uploaded (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Paraphrase Detection in NLP

  • 1. Paraphrase Detection in NLP Yuriy Guts ML Engineer = ?
  • 2. Paraphrasing Any trip to Italy should include a visit to Tuscany to sample their exquisite wines. Be sure to include a Tuscan wine-tasting experience when visiting Italy.
  • 3. Paraphrase Identification Where can I get very professional and reliable envelope printing service in Sydney? Where can I get very affordable branded envelope printing service in Sydney? Why are doctors always late? Why doctors always make you wait for 15-20 minutes before they see you?
  • 4. 404,290 pairs Training set 2,345,796 pairs Test set Data Q1 (ID, Title) Q2 (ID, Title) Target Binary (metric: log-loss)
  • 6. What are the best books on IT leadership? What are the best books about leadership? Sometimes a single word matters Will the Miami Heat win the NBA championship in 2011? Will the Miami Heat win the NBA championship in 2012? How do I lose 20 pounds? How do I lose 15 pounds?
  • 7. Sometimes there are almost no shared words I am unable to talk to girls, leave being friendly with them. Why? I am shy to talk to any woman because i get nervous and freaked out around them. What is the solution? Is there a Quora user who have seen a UFO? Have you seen an alien?
  • 8. Sometimes ALL the words are shared Is the Government of Pakistan encouraging India by not taking any real action against ceasefire violations? Is the Government of India encouraging Pakistan by not taking any real action against ceasefire violations? What is the most interesting thing we learned about Portugal's World Cup team in their match against Germany? What is the most interesting thing we learned about Germany's World Cup team in their match against Portugal?
  • 9. Approaches 1. Counting Stuff. 2. Information Theory. 3. Linguistics. 4. Deep Learning.
  • 13. Edit Distances 1. Levenshtein 2. Damerau-Levenshtein 3. Jaro 4. Jaro-Winkler … … ...
  • 14. Lexical Databases and Ontologies house, home dwelling hermitage cottage backyard veranda study “A place that serves as the living quarters of one or more families” . . . penthouse Meronym Hyponym HyponymHypernym Hyponym Def
  • 16. “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously” Distributed Hypothesis of Language John R. Firth. The technique of semantics. Transactions of the Philological Society, 1935.
  • 17. The minimum amount of “work” needed to transform document 1 to document 2. Inspired by Earth Mover’s Distance, a well-studied transportation problem. Word Mover’s Distance (WMD) M. Kusner et al. “From Word Embeddings to Document Distances”, 2015. http://proceedings.mlr.press/v37/kusnerb15.pdf
  • 18. WMD: Linear Optimization Problem M. Kusner et al. “From Word Embeddings to Document Distances”, 2015. http://proceedings.mlr.press/v37/kusnerb15.pdf nBOW frequency of the i-th word in the document “How much” of word i travels to word j
  • 19. Morphology & Syntax Features https://demos.explosion.ai/displacy/
  • 22. State-of-the-Art NLP Pipeline Image copyright © Explosion AI
  • 23. Recurrent Highway Networks Highway layer: Srivastava et al. “Recurrent Highway Networks”, 2015 https://arxiv.org/pdf/1607.03474.pdf
  • 24. Two Input Documents? Siamese Network Run the same encoding step for every input. Share encoder weights, don’t learn WE1 and WE2 separately.
  • 25. Distance Learning, Contrastive Loss Penalize similar pairs by a monotonically increasing function of their learned distance. Penalize different pairs by a monotonically decreasing function of their learned distance. Hadsell, Chopra, LeCun “Dimensionality Reduction by Learning an Invariant Mapping”, 2006. http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
  • 26. 404,290 pairs Training set 2,345,796 pairs Test set Data Q1 (ID, Title) Q2 (ID, Title) Target Binary (metric: log-loss)
  • 27. How can I prevent pneumonia? How do you prevent pneumonia? Sometimes the labeling is just plain wrong What is the car in this picture? What car is in this picture? How do I protect single phasing of a 3 phase motor? What is single phase and 3 phase?
  • 28. People misspell. A lot. demonitization demonitozation masturbution masturbtion demonitizing demonetiztion masturation mastburation demonetising demoneitisation mastrubating masturabate demonitisation demoneticzation masturbuation mastrubration demonitize demonitizition masterbuting mastubrate demonatisation demonetisations mastubate masturbute demonitazation demonitasation mastribution mastribute demonitesation demonestisation masturburation masterbuate demoneytisation demotenization mastrabutation mastubating demonestation demonitised mastubation demonitising demonsation 100,000+ spelling corrections
  • 29. Different target balance for Train and Test
  • 30. Final Solution (Simplified) Tokenized Text Sequences of Pre-trained Embeddings Statistical Features Fuzzy Distances Linguistic Features Topic Features Leaks ¯_(ツ)_/¯ Deep NN Predictions Embedding Features Final Model (GBM) Prediction Calibration ~200 Features for the final model
  • 31.
  • 32.