SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Dat Tran - Head of Data Science
Transformer based clustering:
Identifying product clusters for E-commerce
Christopher Lennan
Sebastian Wanner
13/04/2022 PyConDE & PyData Berlin
Sebastian Wanner
Senior ML Engineer
Christopher Lennan
Lead ML Engineer
20 More than 20 years
experience
900+ "idealos" from 40
nations
Active in 6 different countries
(DE, AT, ES, IT, FR, UK)
18 million visitors/month
50.000 shops
Over 330 million offers and
2 million products
Germany's 4th largest
eCommerce website
idealo key facts
idealo product catalogue
idealo product catalogue
Problem: vast majority of offers are not mapped to
product catalogue!
idealo open catalogue
Cluster A
Cluster B
Cluster C
idealo open catalogue
Offer clustering – EAN matching
EAN: 123
EAN: 123
EAN: 123
EAN: null
EAN: 321
EAN: null
EAN: 234
Cluster A
Cluster B
Cluster C
idealo open catalogue
Offer clustering – ML on text attributes
EAN: 123
title: abc
colour: lmn
EAN: 123
title: abc
colour: lmn
EAN: 123
title: cde
colour: stu
EAN: null
title: cde
colour: null
EAN: 321
title: cd-e
colour: stu
EAN: null
title: bc d
colour: mno
EAN: 234
title: bcd
colour: null
Cluster A
Cluster B
Cluster C
So we tried various ML approaches ...
Results 10k products (shoe category) ⌀ 17 offers per product
Dataset
* no exhaustive hyper-parameter tuning performed
scaling
ruleset
precision 👍
recall 👎
https://github.com/moj-analytical-services/splink
KNN
clustering
Transformer
encoders
Embeddings based clustering
EAN: 123
title: abc
colour: lmn
EAN: 123
title: cde
colour: stu
EAN: 234
title: bcd
colour: null
Offers ML model Offers as
vectors
1
2
3
2
3
4
1
2
3
x
y
z
cluster A
Cluster
similar vectors
text
attributes
as features
outputs
embeddings
cluster
embeddings
Siamese network
with Transformer models perform best …
Transfer Learning with Transformers
Learn one task, transfer knowledge to a new task
Pretraining Fine-tuning
Masked language modelling
• Sentence: Where are we [MASK]
• Label: going
Training objective:
Unlabeled
Text data Pretrained model
Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110M. parameters
• 160GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
microsoft / mpnet-base
Transformer
Pre
training
Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110M. parameters
• 160GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+ languages
through Multi-Lingual
Knowledge Distillation
Transformer Transformer
Pre
training
Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110 M. parameters
• 160 GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+
languages through Multi-
Lingual Knowledge
Distillation
• trained on >5 million idealo
offer pairs
• training time 28 hours on a
NVIDIA V100 GPU
fine-tuning
idealo-offer-clustering
Transformer Transformer Transformer
Pre
training
Siamese Networks
Train on positive and negative training pairs.
Label:
1 = similar
0 = not similar
Siamese Networks
Train on positive and negative training pairs. Before fine-tuning: 0.58
After fine-tuning: 0.76
+18 pp
Label:
1 = similar
0 = not similar
Sentence Transformers
v Provide access to language models fine-tuned on 1 billion sentence pairs
v Integrated with Hugging Face Modelhub
v Multilingual Models available, support for 100+ languages
v 10+ Loss functions implemented and ready to use
Sentence Transformers
v Provide access to language models fine-tuned on 1 billion sentence pairs
v Integrated with Hugging Face Modelhub
v Multilingual Models available, support for 100+ languages
v 10+ Loss functions implemented and ready to use
Training pair generation makes a difference …
Generate Training Pairs
Choose positive pairs and negative pairs randomly
v Randomly selected negative pairs
are too easy for the model.
v Random negative pairs do not
contribute much to training
progress.
v Model quickly converges and
performance is not enough.
Lessons Learned
Generate Training Pairs
Select Hard-negative pairs Offline Strategy
Average embedding
for each product cluster
Generate Pairs
Training
Compute embeddings
Epoch
Search for neighbors
+6 pp
Building product clusters can be challenging …
Building product cluster
v Scale to millions of vector
searches
v Search quality is important
v Search time should be small
Challenges
Find K-Nearest Neighbor and apply
threshold
K=10
Threshold
Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Performance
Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Performance
Let‘s talk about challenges ...
Identify final product clusters
KNN for two offers KNN graph clusters after LPA algorithm
• create KNN graph with edge weights = cosine similarity
• use Label Propagation Algorithm (LPA) to identify clusters
• GraphFrames Spark library
Approach
Noisy Text Attributes
Hard to identify product variants
Title:
Adidas Originals Superstar UNISEX schwarz weiß
Title:
Adidas Originals Sportschuhe FV3139_35, 5 Sneakers White, 35.5 EU
Next Steps …
Thank you!

Contenu connexe

Similaire à Transformer_Clustering_PyData_2022.pdf

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 

Similaire à Transformer_Clustering_PyData_2022.pdf (20)

Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
How to Improve Translation Productivity
How to Improve Translation ProductivityHow to Improve Translation Productivity
How to Improve Translation Productivity
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosa
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Introducing Language-Oriented Business Applications - Markus Voelter
Introducing Language-Oriented Business Applications - Markus VoelterIntroducing Language-Oriented Business Applications - Markus Voelter
Introducing Language-Oriented Business Applications - Markus Voelter
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLP
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
 
Applying NLP to product comparison at visual meta
Applying NLP to product comparison at visual metaApplying NLP to product comparison at visual meta
Applying NLP to product comparison at visual meta
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Bol.com
Bol.comBol.com
Bol.com
 
ShaREing is Caring
ShaREing is CaringShaREing is Caring
ShaREing is Caring
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Build 2019 Recap
Build 2019 RecapBuild 2019 Recap
Build 2019 Recap
 
Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...
Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...
Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...
 

Dernier

Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Dernier (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

Transformer_Clustering_PyData_2022.pdf

  • 1. Dat Tran - Head of Data Science Transformer based clustering: Identifying product clusters for E-commerce Christopher Lennan Sebastian Wanner 13/04/2022 PyConDE & PyData Berlin
  • 2. Sebastian Wanner Senior ML Engineer Christopher Lennan Lead ML Engineer
  • 3. 20 More than 20 years experience 900+ "idealos" from 40 nations Active in 6 different countries (DE, AT, ES, IT, FR, UK) 18 million visitors/month 50.000 shops Over 330 million offers and 2 million products Germany's 4th largest eCommerce website idealo key facts
  • 6. Problem: vast majority of offers are not mapped to product catalogue!
  • 7. idealo open catalogue Cluster A Cluster B Cluster C
  • 8. idealo open catalogue Offer clustering – EAN matching EAN: 123 EAN: 123 EAN: 123 EAN: null EAN: 321 EAN: null EAN: 234 Cluster A Cluster B Cluster C
  • 9. idealo open catalogue Offer clustering – ML on text attributes EAN: 123 title: abc colour: lmn EAN: 123 title: abc colour: lmn EAN: 123 title: cde colour: stu EAN: null title: cde colour: null EAN: 321 title: cd-e colour: stu EAN: null title: bc d colour: mno EAN: 234 title: bcd colour: null Cluster A Cluster B Cluster C
  • 10. So we tried various ML approaches ...
  • 11. Results 10k products (shoe category) ⌀ 17 offers per product Dataset * no exhaustive hyper-parameter tuning performed scaling ruleset precision 👍 recall 👎 https://github.com/moj-analytical-services/splink
  • 12. KNN clustering Transformer encoders Embeddings based clustering EAN: 123 title: abc colour: lmn EAN: 123 title: cde colour: stu EAN: 234 title: bcd colour: null Offers ML model Offers as vectors 1 2 3 2 3 4 1 2 3 x y z cluster A Cluster similar vectors text attributes as features outputs embeddings cluster embeddings
  • 13. Siamese network with Transformer models perform best …
  • 14. Transfer Learning with Transformers Learn one task, transfer knowledge to a new task Pretraining Fine-tuning Masked language modelling • Sentence: Where are we [MASK] • Label: going Training objective: Unlabeled Text data Pretrained model
  • 15. Transfer Learning with Transformers Leverage large scale pre-trained language models • Transformer encoder with 110M. parameters • 160GB uncompressed texts (five English-language corpora ) • training time 35 days on 32 GPUs microsoft / mpnet-base Transformer Pre training
  • 16. Transfer Learning with Transformers Leverage large scale pre-trained language models • Transformer encoder with 110M. parameters • 160GB uncompressed texts (five English-language corpora ) • training time 35 days on 32 GPUs fine-tuning microsoft / mpnet-base sentence-transformers / all-mpnet-base-v2 • trained on 1.2 billion English sentence pairs • transferred to 100+ languages through Multi-Lingual Knowledge Distillation Transformer Transformer Pre training
  • 17. Transfer Learning with Transformers Leverage large scale pre-trained language models • Transformer encoder with 110 M. parameters • 160 GB uncompressed texts (five English-language corpora ) • training time 35 days on 32 GPUs fine-tuning microsoft / mpnet-base sentence-transformers / all-mpnet-base-v2 • trained on 1.2 billion English sentence pairs • transferred to 100+ languages through Multi- Lingual Knowledge Distillation • trained on >5 million idealo offer pairs • training time 28 hours on a NVIDIA V100 GPU fine-tuning idealo-offer-clustering Transformer Transformer Transformer Pre training
  • 18. Siamese Networks Train on positive and negative training pairs. Label: 1 = similar 0 = not similar
  • 19. Siamese Networks Train on positive and negative training pairs. Before fine-tuning: 0.58 After fine-tuning: 0.76 +18 pp Label: 1 = similar 0 = not similar
  • 20. Sentence Transformers v Provide access to language models fine-tuned on 1 billion sentence pairs v Integrated with Hugging Face Modelhub v Multilingual Models available, support for 100+ languages v 10+ Loss functions implemented and ready to use
  • 21. Sentence Transformers v Provide access to language models fine-tuned on 1 billion sentence pairs v Integrated with Hugging Face Modelhub v Multilingual Models available, support for 100+ languages v 10+ Loss functions implemented and ready to use
  • 22. Training pair generation makes a difference …
  • 23. Generate Training Pairs Choose positive pairs and negative pairs randomly v Randomly selected negative pairs are too easy for the model. v Random negative pairs do not contribute much to training progress. v Model quickly converges and performance is not enough. Lessons Learned
  • 24. Generate Training Pairs Select Hard-negative pairs Offline Strategy Average embedding for each product cluster Generate Pairs Training Compute embeddings Epoch Search for neighbors +6 pp
  • 25. Building product clusters can be challenging …
  • 26. Building product cluster v Scale to millions of vector searches v Search quality is important v Search time should be small Challenges Find K-Nearest Neighbor and apply threshold K=10 Threshold
  • 27. Faiss built by Facebook Research v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper) v Native distributed GPU-support v Out of the box optimization strategies: v Compressed representation by using product quantization methods v Approximate nearest neighbor search Source: https://github.com/facebookresearch/faiss/wiki Index size: 25 GB Vectors: > 13 million Hardware: NVIDIA V100 (Multi-GPUs) Time: 4,3 hrs (⌀ 1,2 ms per vector) Performance
  • 28. Index size: 25 GB Vectors: > 13 million Hardware: NVIDIA V100 (Multi-GPUs) Time: 4,3 hrs (⌀ 1,2 ms per vector) Faiss built by Facebook Research v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper) v Native distributed GPU-support v Out of the box optimization strategies: v Compressed representation by using product quantization methods v Approximate nearest neighbor search Source: https://github.com/facebookresearch/faiss/wiki Performance
  • 29. Let‘s talk about challenges ...
  • 30. Identify final product clusters KNN for two offers KNN graph clusters after LPA algorithm • create KNN graph with edge weights = cosine similarity • use Label Propagation Algorithm (LPA) to identify clusters • GraphFrames Spark library Approach
  • 31. Noisy Text Attributes Hard to identify product variants Title: Adidas Originals Superstar UNISEX schwarz weiß Title: Adidas Originals Sportschuhe FV3139_35, 5 Sneakers White, 35.5 EU