This document summarizes Walmart's transition to building an enterprise data platform on Azure Databricks to enable machine learning and data science at scale. Previously, Walmart had a complex and slow legacy technology stack. The new platform goals were to centralize data in the cloud, increase productivity with data science tools, and reduce costs. Key aspects of the new platform included using Azure and Databricks for data processing and machine learning, Airflow for orchestration, and building several machine learning models for applications like fraud detection and product recommendations. Challenges in the transition included optimizing performance and managing resources across the platforms.
2. Andrew Ray
Craig Covey
Building an Enterprise Data
Platform with Azure Databricks
to Enable Machine Learning
and Data Science at Scale
#UnifiedAnalytics #SparkAISummit
5. Previous State
Technology managed
by Walmart
• 10+ Big Hadoop Clusters
• Multiple large instances of
Teradata
• Data dispersed in
operational DB’s
– Oracle, MySQL, MS SQL,
DB2, Informix, SAP
5#UnifiedAnalytics #SparkAISummit
13. Airflow
13#UnifiedAnalytics #SparkAISummit
3 Airflow Environments
(dev, stage, prod)
Hosted in Walmart's
internal cloud
Code in GitHub
Running on Docker
180+ DAGs
600,000,000,000+
rows
150+ TB
6 Sources
Zero to
Prod in
~2
Months
14. Airflow DAG Workflow
14#UnifiedAnalytics #SparkAISummit
Create DAGs
dynamically
Various Hadoop jobs
move the data
Send data to Azure
Blob to two regions
in parallel
Databricks notebooks
verify source and Azure
data are statistically
nearly equal
16. Databricks Workspaces
16#UnifiedAnalytics #SparkAISummit
Tables in Databricks
are created with a
specific path in DBFS
that is mounted to a
container in Azure
Blob Storage
Multiple
Databricks
Workspaces
to segregate
access
Blob containers
are mounted
with read only
SAS tokens to
protect data
Data lives
in Blob
Storage
CREATE TABLE db1.tb1
USING delta
LOCATION ‘/mnt/general/...’
Create
Delta
tables
17. Databricks Notebooks
• Hybrid of Jupyter notebooks and Google Docs
• Allows for collaborative editing
– Easy technical support
– Remote pair programming
• Multiple languages in one notebook
17#UnifiedAnalytics #SparkAISummit
18. Databricks Delta
18#UnifiedAnalytics #SparkAISummit
Performance
Queries run fast!
Transactional Updates
No downtime or consistency
issues during updates
Change Data Capture
MERGE will make it easy
Next-gen format built on top of Spark for batch and
streaming big data use cases
1+ hours!
<6 seconds!
21. Assortment
21#UnifiedAnalytics #SparkAISummit
Purpose
Select products and inventory levels to
maximize profit/sales within finite shelf space
Machine Learning
Expectation-Maximization (EM), Catboost
(boosting to estimate True Demand)
Production
Databricks to manipulate data, train ML
models, run production scoring jobs
24. Access Control
24#UnifiedAnalytics #SparkAISummit
Databricks
Role Based
Access
Control
PHASE 2
Sync groups for use with
table access control
CURRENT
Sperate Workspaces for
projects that need special
data
PHASE 1
Provision/deprovision users
to workspaces with AD groups
26. JupyterHub on AKS
26#UnifiedAnalytics #SparkAISummit
Deploy on Azure Kubernetes Service
Multi-user hub that spawns Jupyter
notebook servers
Targeting non-distributed workloads
Flexible
Cloud
Independent
Data
Source
Connectivity
Low
Cost
Open
Source
Preferred solution for non-distributed GPU
needs
Benefits
Python & R
Notebooks
TensorFlow
Easy access to data via Databricks