Automated Production Ready ML at Scale

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Errol Koolmeister, H&M
Keven Wang, H&M
AUTOMATED PRODUCTION
READY ML @ SCALE
#UnifiedDataAnalytics #SparkAISummit

Agenda
• AI journey @ H&M
• Machine learning blueprint
• Automated ML development process
• ML orchestration for scale
3

Agenda
4

Algo library, IT platform, Business Impact
2016
Exploration
Run initial PoCs
Test AA appetite &
applicability
2017
Initiation
Industrialize early use cases
Defining organization and
capability needs
Establishing the IT / data
environment
2018
Establish AA & AI
function
Roll-out & hand over of
successful pilots
Establishing AA-WoW,
team, governance
2019
AA Leader
Increasingly data &
algo-driven retail business
Analytical support
across entire value chain
Strong internal AA teams
Engage in partnership with
strong AI players
2022
AI Leader of the Fashion Industry
Lead the frontier of AI at scale in delivering
customer value
Global leader in developing
talent pools and supporting
AI hubs and networks
AI-powered tools and capabilities supporting
core processes and business decisions in all
functions
World leading ecosystem of cutting edge AI
partners
Today
Our journey:

AI @ H&M quick facts
100+ co-located
FTEs
Growing # of
colleagues
30+ different
nationalities
Several
nationalities
Combined
teams
Sprints
Standups
Product
mgmt.
Epics
Algo
Cloud
New ways of
working
Consultants
HAAL
Azure Databricks

H&M use cases
H&M Advanced Analytics Landscape
LogisticsProduction Sales MarketingDesign / Buying
Common components eg Algos & Tech
Assortment quantification
Fashion Forecast
Allocation Markdown Online
Markdown Store
Personalized Promotions,
Recommendations &
Journeys

Agenda
9

Fragmented solution landscape
10#UnifiedDataAnalytics #SparkAISummit

Training
data
ingestion
Persisted
Model
Model Development
Model & data versioning
Deployment orchestration
DataStorage
Model development & usage process
Data
Preparation
Feature
Engineering
Model
development
Unseen data
ingestion Results
Data
Preparation
Transform
data into
features
Model
prediction
Model usage

Generic AI development process
Model exploration
Data exploration
Feature engineering
Model exploration
Try out different libs
Model implementation
Data onboarding / ETL
Model implementation
Set up model training pipeline
Implement model serving
set up container
Unit test
Model training
Execute pipeline
Performance evaluation
Build model
cross validation
Output model
Model tuning
Hyper parameter tuning
Model Assembling
Data augmentation
Build model env
Build model serving container
Offline model
prediction
Offline prediction
Output result
A/B deployment of model
serving
Online model
serving
Rolling upgrade
A/B deployment
Model monitoring
Performance monitoring
Monitoring non functional

Prediction
Model
development
Development process – tool mapping
Model exploration
Model
implementation
Model training Model tuning
Build model
applying env
Offline model
prediction
Online model serving
Model monitoring
Azure Databricks
VS CodePyCharmData Lake Store Data Lake Store
Kubernetes Container Registry
PyCharm Azure Databricks
Azure DatabricksAirflow

Architecture Principals
SEPARATION OF CONCERN AUTOMATEDSTATELESS
CLOUD NATIVE SERVERLESS

Unifying architecture for speed & scale
15

Agenda
16

AUTOMATED ML DEVELOPMENT
17
Kubernetes
Container
Registry
Triggering
CI Orchestrator
Model
repository
Azure Databricks
1 Code commit
2 code static check,
unit test,
Package
3.2 Trigger pipeline
4.3 Commit model
5.1 Fetch model
5.2 Build container image
6 Push container image
7 Auto deploy
new container
4.1 job execution
PyCharm
3.1 Push
to DBFS
4.2 log model info

Connect the dots
18
Exploration Implementation Build and packaging Training and prediction Monitoring
• Shared VS dedicated cluster • Notebook VS python modules • Library management • Training on worker nodes • Logging with Mlflow

Agenda
19

20
Train
Test
Val
Source
data
Feature
engineering
Training
hyper-param tuning
Prepare
Data
Optimization
GLPK
Large size
Parallel process
Large size
Parallel process
Medium size
Parallel process
Medium size
Iterative/Parallel process
Medium size
Iterative process

21
Internal information
Distributed computation Single machine computation
Spark
task 1
Spark
task 2
Python
task 1
Python
task 2
Python
task 3

22
…
Scenario 1
• Geo location l1
• Product type p1
• Time t1
Scenario 2
• Geo location l2
• Product type p2
• Time t2
Scenario 3
• Geo location l3
• Product type p3
• Time t3
Databricks Cluster
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Spark task
1
Spark task
2
Python
task 1
Python
task 2
Python
task 3
Source
data
Prep
data
Feature
engine…
Train Optimize
… Spark task
1
Spark task
2
Python
task 1
Python
task 2
Python
task 3… Spark task
1
Spark task
2
Python
task 1
Python
task 2
Python
task 3
Scenario m
• Geo location lm
• Product type pm
• Time tm
Source
data
Prep
data
Feature
engine…
Train Optimize
30 mins
60 mins
5 mins
? mins

23
Scenario 1
• Geo location l1
• Product type p1
• Time t1
Scenario 2
• Geo location l2
• Product type p2
• Time t2
Scenario 3
• Geo location l3
• Product type p3
• Time t3
Scenario i
• Geo location li
• Product type pi
• Time ti
Databricks Cluster
Databricks Cluster
Databricks Cluster
Scenario set
VM
VM
Container
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize

24
What we are looking for
• A ML orchestrator to train models for different scenarios (scenario set)
• Scenario set can be parameterized
• Leverage different computation patterns, like Spark, Docker
• Parallelize each scenarios as much as possible
• Optimize both resource utilization and total lead time

ML orchestrator - Airflow
• Multi source of failure
• Lack of elasticity, scaling up/down
• Coupling app dependency with infrastructure
25
How Apache Airflow Distributes Jobs on Celery workers
• Implement Pipeline/DAG by Python
• Workflow Scheduler by Airbnb
• Integration with different source & sink
Feature Challenge

Scenario
set
Scenario
task 1
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario
set
Scenario
task 1
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep
data
Feature
engine…
Train Optimize
DAG
Scenario
set
Scenario 1
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario 2
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario 3
Source
data
Prep
data
Feature
engine…
Train Optimize
Scenario i
Source
data
Prep
data
Feature
engine…
Train Optimize
Airflow MetaDB
Databricks
Cluster
Databricks
Cluster
Databricks
Cluster
AKS
Container
Registry
Airflow
Logs
Airflow
dags
Persistent
Volume
Airflow
Webserver
Airflow
Scheduler
Kubernetes
Pod
Azure
File share

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Automated Production Ready ML at Scale

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Automated Production Ready ML at Scale

Similaire à Automated Production Ready ML at Scale (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Automated Production Ready ML at Scale