idealo.de offers a price comparison service on millions of products from a wide range of categories. Each day we receive millions of offers that we cannot map to our product catalogue. We started clustering these offers to create new product clusters to ultimately enhance our product catalogue. For this we mainly use two open-source libraries:
Sentence-Transformers to encode the offers into a vector space
Facebook Faiss to do K-Nearest-Neighbours search in vector space
We will present our results for various optimisation strategies to fine-tune Transformers for our clustering use case. The strategies include siamese and triplet network architectures, as well as an approach with an additive angular margin loss. Results will also be compared against a probabilistic record linkage and TF-IDF approach.
Further, we will share our lessons learned e.g. how both libraries make Machine Learning Engineer‘s life fairly easy and how we created informative training data for our best performing solution.
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Transformer_Clustering_PyData_2022.pdf
1. Dat Tran - Head of Data Science
Transformer based clustering:
Identifying product clusters for E-commerce
Christopher Lennan
Sebastian Wanner
13/04/2022 PyConDE & PyData Berlin
3. 20 More than 20 years
experience
900+ "idealos" from 40
nations
Active in 6 different countries
(DE, AT, ES, IT, FR, UK)
18 million visitors/month
50.000 shops
Over 330 million offers and
2 million products
Germany's 4th largest
eCommerce website
idealo key facts
11. Results 10k products (shoe category) ⌀ 17 offers per product
Dataset
* no exhaustive hyper-parameter tuning performed
scaling
ruleset
precision 👍
recall 👎
https://github.com/moj-analytical-services/splink
12. KNN
clustering
Transformer
encoders
Embeddings based clustering
EAN: 123
title: abc
colour: lmn
EAN: 123
title: cde
colour: stu
EAN: 234
title: bcd
colour: null
Offers ML model Offers as
vectors
1
2
3
2
3
4
1
2
3
x
y
z
cluster A
Cluster
similar vectors
text
attributes
as features
outputs
embeddings
cluster
embeddings
14. Transfer Learning with Transformers
Learn one task, transfer knowledge to a new task
Pretraining Fine-tuning
Masked language modelling
• Sentence: Where are we [MASK]
• Label: going
Training objective:
Unlabeled
Text data Pretrained model
15. Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110M. parameters
• 160GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
microsoft / mpnet-base
Transformer
Pre
training
16. Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110M. parameters
• 160GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+ languages
through Multi-Lingual
Knowledge Distillation
Transformer Transformer
Pre
training
17. Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110 M. parameters
• 160 GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+
languages through Multi-
Lingual Knowledge
Distillation
• trained on >5 million idealo
offer pairs
• training time 28 hours on a
NVIDIA V100 GPU
fine-tuning
idealo-offer-clustering
Transformer Transformer Transformer
Pre
training
19. Siamese Networks
Train on positive and negative training pairs. Before fine-tuning: 0.58
After fine-tuning: 0.76
+18 pp
Label:
1 = similar
0 = not similar
20. Sentence Transformers
v Provide access to language models fine-tuned on 1 billion sentence pairs
v Integrated with Hugging Face Modelhub
v Multilingual Models available, support for 100+ languages
v 10+ Loss functions implemented and ready to use
21. Sentence Transformers
v Provide access to language models fine-tuned on 1 billion sentence pairs
v Integrated with Hugging Face Modelhub
v Multilingual Models available, support for 100+ languages
v 10+ Loss functions implemented and ready to use
23. Generate Training Pairs
Choose positive pairs and negative pairs randomly
v Randomly selected negative pairs
are too easy for the model.
v Random negative pairs do not
contribute much to training
progress.
v Model quickly converges and
performance is not enough.
Lessons Learned
24. Generate Training Pairs
Select Hard-negative pairs Offline Strategy
Average embedding
for each product cluster
Generate Pairs
Training
Compute embeddings
Epoch
Search for neighbors
+6 pp
26. Building product cluster
v Scale to millions of vector
searches
v Search quality is important
v Search time should be small
Challenges
Find K-Nearest Neighbor and apply
threshold
K=10
Threshold
27. Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Performance
28. Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Performance