SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#1 What is Skroutz.gr?
• Skroutz.gr is a marketplace & shopping assistant which
makes online shopping easier and more reliable
• It includes more than 11,000,000 products from 3,200
different e-shops
• On a monthly basis the website welcomes more than 8
million unique visitors ranking in the top positions in the
Greek Web
#1 Some Numbers
3,200
merchants
11
million products
270 mil.
pageviews /
mo
1.1 mil.
searches/day
33 mil.
sessions/mo
#1 The Problem
• Each day we collect thousands of new products by
downloading e-shop feeds (XML, CSV etc. - product
catalogs)
• We want to categorize incoming product payloads as
provided by eshops to the most relevant categories in
Skroutz category tree taxonomy with the minimum human
intervention.
- Difficult
- Important
#1 Why Difficult?
• Many leaf categories in
Skroutz taxonomy (>2k)
• Sibling categories
(subjective categorization)
• Misleading product titles
and shop-categories from
shops
#1 Why Important?
Robot MO collects
products from shop
feeds and stores them
to DB
Megatron category
classifier categorizes
products to the correct
category
Tron groups similar
products to entities
called SKUs to be
ready for indexing
Elasticsearch indexes
products to be
searchable from user
interface
#1 Facts
•Merchants send more than ~15k new products every day in
Skroutz!!!
•2.3k unique leaf categories in our category tree (taxonomy)
•Manual “move-to-category” action:
- Costs ~7.8s on average for content managers
- Subjective decisions may add extra overhead
#1 Old Solution - Overview
•Use Elasticseach to match specific product attributes:
- PN (manufacturer part number)
- Name
- Shop-category
•Aggregate matches and group by categories
•Normalize results and use custom weights to calculate a score
•Take Top-K results
#1 Old Solution - Limitations
•Plain cosine similarity distance on TF/IDF weights:
- No learning feedback loop
- No advanced statistics utilization (e.g. correlation between price
value and text features)
•No easy way to tune custom weights applied on final scoring
•Heuristics don’t take into account category specific context
•Heuristics don’t take into account word level context. E.g.
word “samsung” is followed by word “galaxy” most of the time
and then probably follows a model number.
#1 Old Solution - Good Parts
•Simple solution (except for custom scoring stuff)
•Easy to debug
•Easy to deploy
•Online
#1 New Solution - “Megatron”
#1 Overview
•Approach problem as a supervised learning task
•Rely on probabilities to obtain a meaningful score
•Use more features from multiple sources and use datasets
•Learn new patterns and relations by training
•Measure performance on dataset splits
•Use a microservice to serve classification requests
•Apply threshold for low confidence results
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#2 Service Architecture
1.Training Phase
2.Inference Phase
3.APIs
#2.1 Training Phase
1. Export dataset (product features labeled with category_id) and upload to Swift
2. Download specific dataset version in “training VM”
3. Start a training session using a train/val split from dataset
4. Save best performing model params snapshot (based on validation set loss)
5. Compress and upload model params to Swift container
#2.2 Inference Phase
1. Application Part: Send classification request
upon new product arrivals:
- Kafka producer (asynchronous request)
- Megatron Client HTTP synchronous
requests (2nd alternative)
2. Category Classifier Microservice Part:
- Pop messages from stream (Kafka
consumer)
- Dispatch messages to in-memory Neural
Network instance
- Fetch predictions (scores) and post-back
to Core Application API endpoint
#2.3 APIs
1. Megatron microservice internal API
- Common API (wraps Keras API)
- Basic methods:
✓ build
✓ train
✓ save
✓ load
✓ predict
- CLI commands
#2.3 APIs(2)
1. Skroutz Application Ecosystem (Ruby client)
- Megatron::Client
✓ Issues requests to microservice
- Megatron DB model
✓ Stores prediction results
- ApiController endpoint
✓ Receives callbacks from microservice
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#3 Data
•Product attribute values (potential features)
Product
Name
Shop manufacturer
Part number
EAN
Price
Shop category
...
Samsung TV 32'' DF324 (PNDFD22) Full HD Black NEW
Αρχική > Ηλεκτρονικά > Τηλεοράσεις
PNDFD22
300 €
#3 Data(2)
•Training Dataset - Raw Features
Image
Numerical
Categorical
Label
Text
#3 Data(3)
•Preprocessing
- Text
- Numerical
- Categorical
- Labels
X
y
#3 Preprocessing - Text
• Our best solution involves “Word Vectors”
• Steps to prepare for word vectors:
- Learn a words Vocabulary (mapping of words to numeric id)
- Transform text sentences to Sequences of ids based on Vocabulary
- Decide a representative sequence length (E.g. 60 words)
- Apply zero padding (pre or post) and truncation to maintain a fixed length
#3 Preprocessing - Text(2)
• Use of Pretrained Embeddings (see W2Vec, FastText, GloVe etc.)
• We use FastText library with skipgram algorithm (unsupervised)
- https://fasttext.cc/docs/en/unsupervised-tutorial.html
#3 Preprocessing - Text(3)
• Embeddings:
- Outputs 100 dim Vector
- Total 1,500,000 rows (vocab)
• 2 versions (Name, Shop-category)
#3 Preprocessing - Numerical
• “Pricevat” and “Name Length” values
• Apply Standard Scaling
#3 Preprocessing - Categorical
• All discrete value attributes/features:
- shop_id
- matching Product PNs category_id list
• One-Hot encoding:
#3 Label Encoding
• “category_id “ values are the “true” labels which should be learned by NN
• One-Hot encoding
• OR just use IDs and rely to “Keras” conventions (E.g. use an internal sparse categorical
representation to save huge amounts of RAM)
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#4 Training
1.Basic Concepts
2.Model Architecture
3.Training “In Action”
#4.1 Basic Concepts
•Objective:
- Find a combination of mathematical functions and a set of
corresponding params to maximize prediction accuracy (or minimize
error rate).
- Ensure that the above generalizes well for production.
- Learn params in an acceptable time window.
•Experiment with Neural Network architectures
•GPUS to the rescue (speedup x10)
#4.1 Basic Concepts(2)
•Loss function
- Categorical Crossentropy
•Optimizer
- Adam (Gradient Descent)
•Hyper-params
- Mini-Batch Size
- Learning Rate
- Epochs
#4.1 Validation
•Why?
- Simulate unseen data
- Compare different:
✓ training methods
✓ hyper -params
- Avoid Overfitting
•Should be representative
•Validation Strategy
- 10% of whole Dataset
- Stratification on Categories
#4.2 Model Architecture
Text
#4.2 Model Architecture
•Hybrid End-to-End architecture
•4 branches (4 input vectors):
A. Name Features Branch
B. Shop-Category Features Branch
C. Basic Features Branch (Numerics, Categorical)
D. Matching PNs Branch (Categorical)
Text
#4.2 Text Branches
• Inspired by “Embed, Encode, Attend, Predict”
- https://explosion.ai/blog/deep-learning-formula-nlp
• Each of “name” and “shop-category” sequence flows through:
- 1 x Embeddings Layer
- 1 x Bi-LSTM Encoder
- 1 x Attention Module
- 1 x LSTM Encoder
#4.2 Text Branches - Why LSTM?
• LSTM stands for “Long Short Term Memory” Layer (Encoder):
- Memory Cells / Captures context
- Propagates signal from previous words to the next in a Sequence
- 2 Stacked Layers performed better in our experiments
- 128 dimension output vector
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
128dim
#cells = sequence length
#4.2 Text Branches - Why pay Attention?
• Attention Mechanism:
- Controll how much signal should be propagated to next layers
- https://distill.pub/2016/augmented-rnns/
#4.2 Other Branches
• Basic Features Branch
- Inputs a concatenation of basic feats
- 1 Dense layer with #classes output
- ReLU activation
• Matching PNs Branch
- Inputs a concatenation of PN feats
- Short-circuited to final layer
InputVector
#classes
(~2kforSkroutz)
#4.2 Final Layer
• Merging Layer
- Concatenates all 4 branches outputs
- softmax activation
- Output: probabilities for each class
#4.2
#4.2 Model Architecture
• Model Capacity/Complexity:
#4.3 Training In Action - Model Selection
•Conducted 100s of experiments with different combinations
of features, layers, modules (e.g. Embeddings, Bag of Words,
TF/IDF, LSTM, etc.)
•10s of Ablations studies: remove specific features to see how
performance is affected
•Read many papers and applied some common tricks (Bi-LSTM,
AdaptivePooling etc.)
•It is an alchemy!
#4.3 Training In Action - Tools
•Training Scheduler Process runs weekly
•CLI training commands
- CUDA_VISIBLE_DEVICES=1 python -m category_classifier.cli scrooge --model end2end --train --epochs
8 --batch_size 128
•Model Versioning
- E.g. “skroutz_models_2018_09_01_v1.tar.gz”
#4.3 Training In Action
Training run output example:
GPU monitoring:
#4.3 Training In Action
Learning Curves (Tensorboard):
Current best
Previous Arch Current bestPrevious Arch
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#5 Inference
1.Inference Pipeline
2.Inference API
3.Production
#5.1 Inference Pipeline
•Online execution:
- preprocessing
- vectorization
- Prediction
•Utilized by CategoryClassifier Class
- Wrapper of external API
•Utilize scikit-learn Pipelines
- http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
#5.2 Inference API
• REPL
• Kafka Worker
• Flask App
#5.3 Production
•x2 inference VMs
- inference1.skroutz.gr, inference2.skroutz.gr (Kafka Workers)
•x2 Flavors (Greece, UK)
•Grafana Monitoring for Kafka Part
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#6 Evaluation
•More than 6% error rate reduction overall in Skroutz!
•Currently, more than ~2 content-editor hours saved per day in
Skroutz (this is scaling)!
•Move operations from list with “uncategorized” products
reduced significantly (by an order of magnitude)!
#6 Performance Summary
Success Rate Failure Rate No Prediction Rate
Megatron Old Megatron Old Megatron Old
Skroutz (GR)
2.3k categories
90.10% 82.6% 7.9% 13.8% 2% 3.5%
91.85% 85.7% 8.14% 14.32% N/A N/A
Scrooge (UK)
350 categories
87.56% 38.9% 2.5% 26.24% 9.9% 58.48%
97.1% 93.67% 2.8% 6.32% N/A N/A
#6 Monitoring Dashboard
#Future Improvements
• Utilize Image Features (in End-To-End model)
• Utilize Entity Recognition to extract more features
• Find ways to utilize more features (color, sizes etc.)
• Categorical Self-Trained Embeddings
• Experiment with newer solutions like “Transformer”
#Contact Info
Andreas Loupasakis
• Email: alup@skroutz.gr
• Kaggle: https://www.kaggle.com/andreaslup
• Twitter: https://twitter.com/andy_lupo
• LinkedIn: https://www.linkedin.com/in/andreas-loupasakis-06399a47
Thank you!

Contenu connexe

Similaire à Automated product categorization

Similaire à Automated product categorization (20)

Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails Apps
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance Tuning
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2
 
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
 
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time AnalyticsAWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
 
Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017
 
CQRS recepies
CQRS recepiesCQRS recepies
CQRS recepies
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501
 
Test strategy utilising mc useful tools
Test strategy utilising mc useful toolsTest strategy utilising mc useful tools
Test strategy utilising mc useful tools
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Design principles & quality factors
Design principles & quality factorsDesign principles & quality factors
Design principles & quality factors
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
ITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons LearnedITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
 
Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016
 
Sumo Logic QuickStart
Sumo Logic QuickStartSumo Logic QuickStart
Sumo Logic QuickStart
 

Plus de Warply

4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
Warply
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
Warply
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
Warply
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
Warply
 

Plus de Warply (20)

Chatbot workshop - How to build one.#digitized16
Chatbot workshop - How to build one.#digitized16Chatbot workshop - How to build one.#digitized16
Chatbot workshop - How to build one.#digitized16
 
Chatbot workshop introduction.#digitized16
Chatbot workshop introduction.#digitized16 Chatbot workshop introduction.#digitized16
Chatbot workshop introduction.#digitized16
 
Warply Mobile Banking solutions
Warply Mobile Banking solutionsWarply Mobile Banking solutions
Warply Mobile Banking solutions
 
Warply Mobile Banking solutions
Warply Mobile Banking solutionsWarply Mobile Banking solutions
Warply Mobile Banking solutions
 
Chatbots - A new era in digital banking
Chatbots - A new era in digital bankingChatbots - A new era in digital banking
Chatbots - A new era in digital banking
 
The CNN Greece Case study
The CNN Greece Case studyThe CNN Greece Case study
The CNN Greece Case study
 
Programmatic Demystified (?)
Programmatic Demystified (?)Programmatic Demystified (?)
Programmatic Demystified (?)
 
In store Retail sap forum
In store Retail sap forum In store Retail sap forum
In store Retail sap forum
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
 
Approaching Customers on Mobile. Bonus: sneak peek of Warply Engage Platform 2.0
Approaching Customers on Mobile. Bonus: sneak peek of Warply Engage Platform 2.0Approaching Customers on Mobile. Bonus: sneak peek of Warply Engage Platform 2.0
Approaching Customers on Mobile. Bonus: sneak peek of Warply Engage Platform 2.0
 
Mobile Payments Event by Warply: Eurobank's presentation
Mobile Payments Event by Warply: Eurobank's presentationMobile Payments Event by Warply: Eurobank's presentation
Mobile Payments Event by Warply: Eurobank's presentation
 
Mobile Payments Event by Warply: Apple Pay
Mobile Payments Event by Warply: Apple PayMobile Payments Event by Warply: Apple Pay
Mobile Payments Event by Warply: Apple Pay
 
Data Privacy in Modern Advertisement
Data Privacy in Modern AdvertisementData Privacy in Modern Advertisement
Data Privacy in Modern Advertisement
 
Mobile Loyalty that works: a successful case study by Warply and Eurobank
Mobile Loyalty that works: a successful case study by Warply and Eurobank Mobile Loyalty that works: a successful case study by Warply and Eurobank
Mobile Loyalty that works: a successful case study by Warply and Eurobank
 
3rd Mobile Marketing Event by Warply: WIND Telecommunications Hellas presenta...
3rd Mobile Marketing Event by Warply: WIND Telecommunications Hellas presenta...3rd Mobile Marketing Event by Warply: WIND Telecommunications Hellas presenta...
3rd Mobile Marketing Event by Warply: WIND Telecommunications Hellas presenta...
 
3rd Mobile Marketing event by Warply: Mobile as a Revenue Channel
3rd Mobile Marketing event by Warply: Mobile as a Revenue Channel3rd Mobile Marketing event by Warply: Mobile as a Revenue Channel
3rd Mobile Marketing event by Warply: Mobile as a Revenue Channel
 
3rd Mobile Marketing event by Warply: Travelplanet24 presentation
3rd Mobile Marketing event by Warply: Travelplanet24 presentation3rd Mobile Marketing event by Warply: Travelplanet24 presentation
3rd Mobile Marketing event by Warply: Travelplanet24 presentation
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

Automated product categorization

  • 1.
  • 2. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 3. #1 What is Skroutz.gr? • Skroutz.gr is a marketplace & shopping assistant which makes online shopping easier and more reliable • It includes more than 11,000,000 products from 3,200 different e-shops • On a monthly basis the website welcomes more than 8 million unique visitors ranking in the top positions in the Greek Web
  • 4. #1 Some Numbers 3,200 merchants 11 million products 270 mil. pageviews / mo 1.1 mil. searches/day 33 mil. sessions/mo
  • 5. #1 The Problem • Each day we collect thousands of new products by downloading e-shop feeds (XML, CSV etc. - product catalogs) • We want to categorize incoming product payloads as provided by eshops to the most relevant categories in Skroutz category tree taxonomy with the minimum human intervention. - Difficult - Important
  • 6. #1 Why Difficult? • Many leaf categories in Skroutz taxonomy (>2k) • Sibling categories (subjective categorization) • Misleading product titles and shop-categories from shops
  • 7. #1 Why Important? Robot MO collects products from shop feeds and stores them to DB Megatron category classifier categorizes products to the correct category Tron groups similar products to entities called SKUs to be ready for indexing Elasticsearch indexes products to be searchable from user interface
  • 8. #1 Facts •Merchants send more than ~15k new products every day in Skroutz!!! •2.3k unique leaf categories in our category tree (taxonomy) •Manual “move-to-category” action: - Costs ~7.8s on average for content managers - Subjective decisions may add extra overhead
  • 9. #1 Old Solution - Overview •Use Elasticseach to match specific product attributes: - PN (manufacturer part number) - Name - Shop-category •Aggregate matches and group by categories •Normalize results and use custom weights to calculate a score •Take Top-K results
  • 10. #1 Old Solution - Limitations •Plain cosine similarity distance on TF/IDF weights: - No learning feedback loop - No advanced statistics utilization (e.g. correlation between price value and text features) •No easy way to tune custom weights applied on final scoring •Heuristics don’t take into account category specific context •Heuristics don’t take into account word level context. E.g. word “samsung” is followed by word “galaxy” most of the time and then probably follows a model number.
  • 11. #1 Old Solution - Good Parts •Simple solution (except for custom scoring stuff) •Easy to debug •Easy to deploy •Online
  • 12. #1 New Solution - “Megatron”
  • 13. #1 Overview •Approach problem as a supervised learning task •Rely on probabilities to obtain a meaningful score •Use more features from multiple sources and use datasets •Learn new patterns and relations by training •Measure performance on dataset splits •Use a microservice to serve classification requests •Apply threshold for low confidence results
  • 14. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 15. #2 Service Architecture 1.Training Phase 2.Inference Phase 3.APIs
  • 16.
  • 17. #2.1 Training Phase 1. Export dataset (product features labeled with category_id) and upload to Swift 2. Download specific dataset version in “training VM” 3. Start a training session using a train/val split from dataset 4. Save best performing model params snapshot (based on validation set loss) 5. Compress and upload model params to Swift container
  • 18. #2.2 Inference Phase 1. Application Part: Send classification request upon new product arrivals: - Kafka producer (asynchronous request) - Megatron Client HTTP synchronous requests (2nd alternative) 2. Category Classifier Microservice Part: - Pop messages from stream (Kafka consumer) - Dispatch messages to in-memory Neural Network instance - Fetch predictions (scores) and post-back to Core Application API endpoint
  • 19. #2.3 APIs 1. Megatron microservice internal API - Common API (wraps Keras API) - Basic methods: ✓ build ✓ train ✓ save ✓ load ✓ predict - CLI commands
  • 20. #2.3 APIs(2) 1. Skroutz Application Ecosystem (Ruby client) - Megatron::Client ✓ Issues requests to microservice - Megatron DB model ✓ Stores prediction results - ApiController endpoint ✓ Receives callbacks from microservice
  • 21. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 22. #3 Data •Product attribute values (potential features) Product Name Shop manufacturer Part number EAN Price Shop category ... Samsung TV 32'' DF324 (PNDFD22) Full HD Black NEW Αρχική > Ηλεκτρονικά > Τηλεοράσεις PNDFD22 300 €
  • 23. #3 Data(2) •Training Dataset - Raw Features Image Numerical Categorical Label Text
  • 24. #3 Data(3) •Preprocessing - Text - Numerical - Categorical - Labels X y
  • 25. #3 Preprocessing - Text • Our best solution involves “Word Vectors” • Steps to prepare for word vectors: - Learn a words Vocabulary (mapping of words to numeric id) - Transform text sentences to Sequences of ids based on Vocabulary - Decide a representative sequence length (E.g. 60 words) - Apply zero padding (pre or post) and truncation to maintain a fixed length
  • 26. #3 Preprocessing - Text(2) • Use of Pretrained Embeddings (see W2Vec, FastText, GloVe etc.) • We use FastText library with skipgram algorithm (unsupervised) - https://fasttext.cc/docs/en/unsupervised-tutorial.html
  • 27. #3 Preprocessing - Text(3) • Embeddings: - Outputs 100 dim Vector - Total 1,500,000 rows (vocab) • 2 versions (Name, Shop-category)
  • 28. #3 Preprocessing - Numerical • “Pricevat” and “Name Length” values • Apply Standard Scaling
  • 29. #3 Preprocessing - Categorical • All discrete value attributes/features: - shop_id - matching Product PNs category_id list • One-Hot encoding:
  • 30. #3 Label Encoding • “category_id “ values are the “true” labels which should be learned by NN • One-Hot encoding • OR just use IDs and rely to “Keras” conventions (E.g. use an internal sparse categorical representation to save huge amounts of RAM)
  • 31. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 32. #4 Training 1.Basic Concepts 2.Model Architecture 3.Training “In Action”
  • 33. #4.1 Basic Concepts •Objective: - Find a combination of mathematical functions and a set of corresponding params to maximize prediction accuracy (or minimize error rate). - Ensure that the above generalizes well for production. - Learn params in an acceptable time window. •Experiment with Neural Network architectures •GPUS to the rescue (speedup x10)
  • 34. #4.1 Basic Concepts(2) •Loss function - Categorical Crossentropy •Optimizer - Adam (Gradient Descent) •Hyper-params - Mini-Batch Size - Learning Rate - Epochs
  • 35. #4.1 Validation •Why? - Simulate unseen data - Compare different: ✓ training methods ✓ hyper -params - Avoid Overfitting •Should be representative •Validation Strategy - 10% of whole Dataset - Stratification on Categories
  • 37. #4.2 Model Architecture •Hybrid End-to-End architecture •4 branches (4 input vectors): A. Name Features Branch B. Shop-Category Features Branch C. Basic Features Branch (Numerics, Categorical) D. Matching PNs Branch (Categorical) Text
  • 38. #4.2 Text Branches • Inspired by “Embed, Encode, Attend, Predict” - https://explosion.ai/blog/deep-learning-formula-nlp • Each of “name” and “shop-category” sequence flows through: - 1 x Embeddings Layer - 1 x Bi-LSTM Encoder - 1 x Attention Module - 1 x LSTM Encoder
  • 39. #4.2 Text Branches - Why LSTM? • LSTM stands for “Long Short Term Memory” Layer (Encoder): - Memory Cells / Captures context - Propagates signal from previous words to the next in a Sequence - 2 Stacked Layers performed better in our experiments - 128 dimension output vector - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 128dim #cells = sequence length
  • 40. #4.2 Text Branches - Why pay Attention? • Attention Mechanism: - Controll how much signal should be propagated to next layers - https://distill.pub/2016/augmented-rnns/
  • 41. #4.2 Other Branches • Basic Features Branch - Inputs a concatenation of basic feats - 1 Dense layer with #classes output - ReLU activation • Matching PNs Branch - Inputs a concatenation of PN feats - Short-circuited to final layer InputVector #classes (~2kforSkroutz)
  • 42. #4.2 Final Layer • Merging Layer - Concatenates all 4 branches outputs - softmax activation - Output: probabilities for each class
  • 43. #4.2
  • 44. #4.2 Model Architecture • Model Capacity/Complexity:
  • 45. #4.3 Training In Action - Model Selection •Conducted 100s of experiments with different combinations of features, layers, modules (e.g. Embeddings, Bag of Words, TF/IDF, LSTM, etc.) •10s of Ablations studies: remove specific features to see how performance is affected •Read many papers and applied some common tricks (Bi-LSTM, AdaptivePooling etc.) •It is an alchemy!
  • 46. #4.3 Training In Action - Tools •Training Scheduler Process runs weekly •CLI training commands - CUDA_VISIBLE_DEVICES=1 python -m category_classifier.cli scrooge --model end2end --train --epochs 8 --batch_size 128 •Model Versioning - E.g. “skroutz_models_2018_09_01_v1.tar.gz”
  • 47. #4.3 Training In Action Training run output example: GPU monitoring:
  • 48. #4.3 Training In Action Learning Curves (Tensorboard): Current best Previous Arch Current bestPrevious Arch
  • 49. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 51. #5.1 Inference Pipeline •Online execution: - preprocessing - vectorization - Prediction •Utilized by CategoryClassifier Class - Wrapper of external API •Utilize scikit-learn Pipelines - http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
  • 52. #5.2 Inference API • REPL • Kafka Worker • Flask App
  • 53. #5.3 Production •x2 inference VMs - inference1.skroutz.gr, inference2.skroutz.gr (Kafka Workers) •x2 Flavors (Greece, UK) •Grafana Monitoring for Kafka Part
  • 54. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 55. #6 Evaluation •More than 6% error rate reduction overall in Skroutz! •Currently, more than ~2 content-editor hours saved per day in Skroutz (this is scaling)! •Move operations from list with “uncategorized” products reduced significantly (by an order of magnitude)!
  • 56. #6 Performance Summary Success Rate Failure Rate No Prediction Rate Megatron Old Megatron Old Megatron Old Skroutz (GR) 2.3k categories 90.10% 82.6% 7.9% 13.8% 2% 3.5% 91.85% 85.7% 8.14% 14.32% N/A N/A Scrooge (UK) 350 categories 87.56% 38.9% 2.5% 26.24% 9.9% 58.48% 97.1% 93.67% 2.8% 6.32% N/A N/A
  • 58. #Future Improvements • Utilize Image Features (in End-To-End model) • Utilize Entity Recognition to extract more features • Find ways to utilize more features (color, sizes etc.) • Categorical Self-Trained Embeddings • Experiment with newer solutions like “Transformer”
  • 59. #Contact Info Andreas Loupasakis • Email: alup@skroutz.gr • Kaggle: https://www.kaggle.com/andreaslup • Twitter: https://twitter.com/andy_lupo • LinkedIn: https://www.linkedin.com/in/andreas-loupasakis-06399a47