Building an Enterprise Data Platform with Azure Databricks to Enable Machine Learning and Data Science at Scale at Sam’s Club

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Andrew Ray
Craig Covey
Building an Enterprise Data
Platform with Azure Databricks
to Enable Machine Learning
and Data Science at Scale
#UnifiedAnalytics #SparkAISummit

Intro
Andrew Ray Craig Covey
3#UnifiedAnalytics #SparkAISummit

The Problem
Dependent on
Walmart
Data
Everywhere
Old Tech Stack
Slow Data
Science

Previous State
Technology managed
by Walmart
• 10+ Big Hadoop Clusters
• Multiple large instances of
Teradata
• Data dispersed in
operational DB’s
– Oracle, MySQL, MS SQL,
DB2, Informix, SAP

Platform Build Goals
Cloud Focus Centralize
Sam's Club
Data
Tools for data
science and
analysis
Increase
Productivity

Users
#UnifiedAnalytics #SparkAISummit 7
Data
Scientists
Software
Engineers
Business

Current
State
Azure + Databricks +
Spark + Airflow +
Hadoop

Azure
Reduced
infrastructure costs
In the Microsoft
ecosystem
Spin up and destroy
Managed
Services
Active Directory
Integration

Current State
Databricks is central to
everything we do!
data
airflow
jobs
apps
ML
analysis
Azure Databricks
2 Regions
10 Workspaces
100+ Users
50+ Scheduled Jobs
1,000+ Notebooks
Scores of ML Models

Use Cases
Fraud
Engine
Credit Cards
& Account
Takeover
Fresh Sales
Tool
Fresh Food
Production
Plan
Assortment
What to Sell
in Clubs
Personalization
Product
Recommendations
Finance
Financial
Forecasts

Architecture

Airflow
3 Airflow Environments
(dev, stage, prod)
Hosted in Walmart's
internal cloud
Code in GitHub
Running on Docker
180+ DAGs
600,000,000,000+
rows
150+ TB
6 Sources
Zero to
Prod in
~2
Months

Airflow DAG Workflow
Create DAGs
dynamically
Various Hadoop jobs
move the data
Send data to Azure
Blob to two regions
in parallel
Databricks notebooks
verify source and Azure
data are statistically
nearly equal

Airflow DAG Setup

Databricks Workspaces
Tables in Databricks
are created with a
specific path in DBFS
that is mounted to a
container in Azure
Blob Storage
Multiple
Databricks
Workspaces
to segregate
access
Blob containers
are mounted
with read only
SAS tokens to
protect data
Data lives
in Blob
Storage
CREATE TABLE db1.tb1
USING delta
LOCATION ‘/mnt/general/...’
Create
Delta
tables

Databricks Notebooks
• Hybrid of Jupyter notebooks and Google Docs
• Allows for collaborative editing
– Easy technical support
– Remote pair programming
• Multiple languages in one notebook

Databricks Delta
Performance
Queries run fast!
Transactional Updates
No downtime or consistency
issues during updates
Change Data Capture
MERGE will make it easy
Next-gen format built on top of Spark for batch and
streaming big data use cases
1+ hours!
<6 seconds!

Use Cases in
Production
Using Azure
& Databricks

Finance
Purpose
high-quality financial forecasts for sales and
wage categories
Machine Learning
gradient boosting, statistical time-series
forecasting algorithms
Production
Databricks to analyze data, train ML models,
run production scoring jobs

Assortment
Purpose
Select products and inventory levels to
maximize profit/sales within finite shelf space
Machine Learning
Expectation-Maximization (EM), Catboost
(boosting to estimate True Demand)
Production
Databricks to manipulate data, train ML
models, run production scoring jobs

Fresh Sales Tool
Fresh Sales Tool Job Current System
(Spark R, Spark Scala, Databricks Delta)
Previous System
(Hive Query Language, Hadoop)
Fresh Forecasting 40 Minutes ~ 7 Hours
Minimum Presentation
Forecasting
3 Minutes ~ 1 Hour
Ly 8-week sales Report 1 Minute ~ 20 Minutes
Admin Dashboard report 5 Minutes ~ 4 Hours
$1,000,000+
per year
Initial projected
cost
~$100,000
per year
Job scoped clusters
&
code optimizations

Future
Plans
To the infinity
and beyond!

Access Control
Databricks
Role Based
Access
Control
PHASE 2
Sync groups for use with
table access control
CURRENT
Sperate Workspaces for
projects that need special
data
PHASE 1
Provision/deprovision users
to workspaces with AD groups

Disaster Recovery
Ú⌅[ åˇ &cã[ } Æ
åæcæÆã•Æ
æ|⌅^æåˆ Æ
⌅^] |ã&æc^å
Œ⌅&@ãç^Æc[ Æà|[ àÆ
•c[ ⌅æ* ^K
• ⇤ [ c^à[ [ •
• ⇥ãà⌅æ⌅ã^•
• R[ à•
• Ô|ˇ •c^⌅•
• W•^⌅Æåæcæ

JupyterHub on AKS
Deploy on Azure Kubernetes Service
Multi-user hub that spawns Jupyter
notebook servers
Targeting non-distributed workloads
Flexible
Cloud
Independent
Data
Source
Connectivity
Low
Cost
Open
Source
Preferred solution for non-distributed GPU
needs
Benefits
Python & R
Notebooks
TensorFlow
Easy access to data via Databricks

Challenges
&
Lessons
Learned

Challenges
Airflow is
not easy
Moving at
scale
Wrangling
Airflow
DAG code
Managing DDLs
across Databricks
workspaces

Lessons Learned
Unoptimized ORC
files with blob storage
Rouge clusters on
Databricks
Azure blob storage
performance is limited
Most built in Airflow
operators are useless

Thank You!
Andrew.Ray@samsclub.com
Dallas
Bentonville
Silicon Valley
Austin
https://careers.walmart.com/We are hiring!

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Building an Enterprise Data Platform with Azure Databricks to Enable Machine Learning and Data Science at Scale at Sam’s Club

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building an Enterprise Data Platform with Azure Databricks to Enable Machine Learning and Data Science at Scale at Sam’s Club

Similar to Building an Enterprise Data Platform with Azure Databricks to Enable Machine Learning and Data Science at Scale at Sam’s Club (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Building an Enterprise Data Platform with Azure Databricks to Enable Machine Learning and Data Science at Scale at Sam’s Club