We will share our experiences in building Data Science and Machine Learning (DS/ML) into organizations. As new DS/ML teams are created, many wrestle with questions such as: How can we most efficiently achieve short-term goals while planning for scale and production long-term? How should DS/ML be incorporated into a company?
We will bring unique perspectives: one as a previous Databricks customer leading a DS team, one as the second ML engineer at Databricks, and both as current Solutions Architects guiding customers through their DS/ML journeys.We will cover best practices through the crawl-walk-run journey of DS/ML: how to immediately become more productive with an initial team, how to scale and move towards production when needed, and how to integrate effectively with the broader organization.
This talk is meant for technical leaders who are building new DS/ML teams or helping to spread DS/ML practices across their organizations. Technology discussion will focus on Databricks, but the lessons apply to any tech platforms in this space.
Building Data Science into Organizations: Field Experience
1. Building Data Science into
Organizations: Field Experience
Chris Robison
Joseph Bradley
Data + AI Summit 2021
2. Joseph Bradley
● Sr. Solutions Architect
● 2nd ML Engineer at Databricks
● Apache Spark committer and
PMC member
Our perspectives
Chris Robison
● Sr. Solutions Architect
● Former Director of Data Science
and Omni-channel Marketing at
Overstock.com
● Career data scientist and avid
Apache Spark user
4. So you want to do Data Science...
98.8%
14.4%
of Fortune 1,000 companies
are investing in strategic
Big Data & AI initiatives.
of Fortune 1,000 companies say
they have deployed AI capabilities
into widespread production.
Source: New Vantage Partners
5. Long-term
● Show business impact
● Increase productivity
● Scale DS across the organization
Short-term
● Validate that DS is worthwhile
● Get resources:
○ Data
○ Data Scientists
○ Executive sponsorship
● Show vision
Goals of a DS/ML/AI program
6. Technology and platform
● Poor integration between Data Science
and other data teams
● Planning for scale and production,
under investment constraints
Organization
● Team building: skill sets, hiring, and training
● Team organization: embedded vs. standalone
● Business and executive alignment
● R&D
Challenges of a DS/ML/AI program
7. Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
8. Execution
Use agile processes for data science
● Iterate with sprints and standups
● Fail fast in R&D
Transparency is key
● Communicate frequently to your business partners and executives
● Make business partners and consumers an integral part of process
Collaborate with the data and platform teams
● Make your needs known and understood
● Beware shortcuts which build technical debt
9. Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
10. ML/AI Success
● Successful MVPs with
a few models manually
in production
● Starting to build an
AI/ML Strategy
● In discovery phase for
new projects and
low-hanging fruit
Company
● Desire to become data
driven
● Smaller in size
(startup) or an existing
organization with new
data initiatives
Team
● 1-2 Data Scientists
(likely) reporting to a
CTO
● Acting as full stack
data scientists
● Typically a math or
computer science
background
Organization building -- “Crawl” stage
11. Common tools Descriptions
Notebooks and IDEs Python notebooks, R Studio, Local IDEs
Languages Python, R -- and potentially SQL, Scala, Java, etc.
ML libraries Standard libraries, plus bring-your-own libraries and versions
Git Notebook versioning, and syncing across platforms with Git
Data Pandas, Spark, Koalas; any data sources or formats
Visualization Matplotlib, Plotly, Seaborn, etc.
Integrations Platforms must integrate with any libraries, systems, or services.
Platforms which are cloud-native and have both UIs and APIs are ideal.
Keep using familiar tools
12. Build around OSS standards for portability
# Downloads / month
990K
350K
1.7M
516K
13. Be more productive with self-service analytics
Compute resources Libraries and environment
With popular ML libraries
Plug & play environments
requirements.txt
conda.yaml
And customization
Start up machines or
clusters on demand
Cost controls: Autoscaling, auto-termination,
spot instances, cost tracking
Governance: Cluster policies for enforcement
Option 2: Share clusters,
with separate Python
env per user or project.
Option 1: Use your
own cluster
14. Running example: ML prioritization of Sales opps
Platform enablement
and improvement
Customer history and
Sales data access
Long-term platform and
data pipeline planning
Develop DL
model
Use notebooks +
TensorBoard for
interactive
development.
Analyze
results
Review auto-logged
MLflow metrics to
analyze model
performance.
Load data
Efficient data
loading from S3,
ADLS, etc.
Get an ML
workspace
Simple machine or
cluster creation.
Ready-to-go DS
environments.
Share
results
Share insights
with other
stakeholders
Sync code
Import .py or
.ipynb notebook,
and sync with Git.
Discussion with Sales stakeholders to understand
the problem and data, and to set expectations
Explanation of results and
future potential to Sales
Build executive alignment and
buy-in for long-term initiatives
DS team training
and hiring
15. Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
16. ML/AI Success
● Successful MVPs and
production models in
multiple business
units
● Uniform testing
standards are being
established
Company
● Data initiatives being
discussed at the
executive level
● Business units
pushing for data
projects
● Emerging business
champions for AI/ML
Team
● Data Science team(s)
supporting multiple
business units
● Integrations with
software engineering
for production
● Diversifying skill-sets
for domain expertise
Organization building -- “Walk” stage
19. Auto-logging for reproducibility
Reproduce Run feature:
✓
✓
✓
✓
Code versioning
Data versioning
Cluster configuration
Environment specification
Reproducibility checklist:
Job scheduling in platform
Automation: schedule, alert, retry, API
Automate and reproduce wherever possible
Secure: IAM Passthrough | Cluster Policies | Table ACLs
20. Your Existing Data Lake
Ingestion
Tables
Data
Catalog
Feature
Store
Azure Data
Lake Storage
Amazon S3
Streaming
Batch
3rd
Party Data
Marketplace
Files
for Data Science and ML
● Schema enforced high
quality data
● Optimized performance
● Full data lineage /
governance
● Reproducibility through
time travel
ML Runtime
IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs
Infrastructure
Data Engineering Data Science
ML Engineer
21. Running example: ML-driven products
Scale up
or out
Larger machines.
Multiple GPUs.
Distributed
training.
Schedule training
and inference jobs
Create jobs from
notebooks or libraries.
Add schedules, retries,
and alerts.
Model validation checks.
Automate for
downstream
consumption
Integrate with 3rd-party
tools and systems to
export ML insights to
business stakeholders
Integrate with
data pipelines
Automate ingestion of
new data for ML and
output of ML insights
for business/product
Scale tuning with
Hyperopt + SparkTrials.
Manage tuning with
MLflow autologging.
Improve modeling
process
Executive <> Data Science team
alignment on data-driven initiatives
Knowledge sharing across business
units for ML-driven projects
Education for business stakeholders to
understand ML models and insights
Platform adoption by
multiple business units
Increased governance needs for platform, covering
needs of more business units and personas
Platform plays a key role in
establishing best practices
22. Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
23. ML/AI Success
● Successful production
models in multiple
verticals
● Uniform testing
standards established
● Program to grow
citizen data scientists
Company
● Data initiatives are
reported at the board
level
● Data driven decision
making across an
organization
Team
● Multiple Data Science
teams across verticals
led by an AI executive
● Standard
development and
deployment processes
for models
● COE across verticals
Organization building -- “Run” stage
24. model lifecycle
Staging Production Archived
Data Scientists Deployment Engineers
v1
v2
Models Tracking
Flavor 2
Flavor 1
Model Registry
Custom
Models
In-Line Code
Containers
Batch & Stream
Scoring
Cloud Inference
Services
OSS Serving
Solutions
Serving
Parameters Metrics Artifacts
Models
Metadata
Model
Deployment Options
25. Example of ML Ops
Training
Model
Validation
Job
Production
Batch
Inference Job
Email
Create model
version
Webhook for new model
versions in staging
Comment with test results +
transition request to production
Webhook for new model
version in production
ML Ops person receives email that
transition request to production was made
Approve new
production model
Model
Registry
26. Modes of deployment
Model training
Batch
Model Tracking
and Registry
Streaming
REST API
Embedded
Delta Lake /
Feature Store
Latency Cost
Minutes Low
Sec - Min Low - Med
< 1 Sec High
varies varies
BI tools
27. Repeatable Data Science lifecycle
Business
understanding
Executive
sponsorship
Center of Excellence
for DS & ML
End user
feedback
Metric discussions
and KPIs
Business value
realization
Exploratory
data analysis
Data ingestion
and preparation
Model deployment
and automation
ML modeling
Model monitoring
and feedback
ML and Data platform
and pipeline integration
Simple onboarding process
for new teams and use cases
Data and resource
sharing and governance
Standard handoff process
for production jobs
Sharable documentation
and usage education
28. Resources to learn more
Related talks and blogs
▪ Building Machine Learning Platforms Webinar
▪ MLflow Model Registry on Databricks Simplifies MLOps With CI/CD Features
Customer success stories
▪ Comcast, Starbucks, H&M
▪ Searchable customer stories
Databricks
▪ Data science and machine learning product page
▪ Managed MLflow product page
29. Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.