SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Building Data Science into
Organizations: Field Experience
Chris Robison
Joseph Bradley
Data + AI Summit 2021
Joseph Bradley
● Sr. Solutions Architect
● 2nd ML Engineer at Databricks
● Apache Spark committer and
PMC member
Our perspectives
Chris Robison
● Sr. Solutions Architect
● Former Director of Data Science
and Omni-channel Marketing at
Overstock.com
● Career data scientist and avid
Apache Spark user
5000+
Across the globe
CUSTOMERS
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
The Data and AI Company
ORIGINAL CREATORS
So you want to do Data Science...
98.8%
14.4%
of Fortune 1,000 companies
are investing in strategic
Big Data & AI initiatives.
of Fortune 1,000 companies say
they have deployed AI capabilities
into widespread production.
Source: New Vantage Partners
Long-term
● Show business impact
● Increase productivity
● Scale DS across the organization
Short-term
● Validate that DS is worthwhile
● Get resources:
○ Data
○ Data Scientists
○ Executive sponsorship
● Show vision
Goals of a DS/ML/AI program
Technology and platform
● Poor integration between Data Science
and other data teams
● Planning for scale and production,
under investment constraints
Organization
● Team building: skill sets, hiring, and training
● Team organization: embedded vs. standalone
● Business and executive alignment
● R&D
Challenges of a DS/ML/AI program
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
Execution
Use agile processes for data science
● Iterate with sprints and standups
● Fail fast in R&D
Transparency is key
● Communicate frequently to your business partners and executives
● Make business partners and consumers an integral part of process
Collaborate with the data and platform teams
● Make your needs known and understood
● Beware shortcuts which build technical debt
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
ML/AI Success
● Successful MVPs with
a few models manually
in production
● Starting to build an
AI/ML Strategy
● In discovery phase for
new projects and
low-hanging fruit
Company
● Desire to become data
driven
● Smaller in size
(startup) or an existing
organization with new
data initiatives
Team
● 1-2 Data Scientists
(likely) reporting to a
CTO
● Acting as full stack
data scientists
● Typically a math or
computer science
background
Organization building -- “Crawl” stage
Common tools Descriptions
Notebooks and IDEs Python notebooks, R Studio, Local IDEs
Languages Python, R -- and potentially SQL, Scala, Java, etc.
ML libraries Standard libraries, plus bring-your-own libraries and versions
Git Notebook versioning, and syncing across platforms with Git
Data Pandas, Spark, Koalas; any data sources or formats
Visualization Matplotlib, Plotly, Seaborn, etc.
Integrations Platforms must integrate with any libraries, systems, or services.
Platforms which are cloud-native and have both UIs and APIs are ideal.
Keep using familiar tools
Build around OSS standards for portability
# Downloads / month
990K
350K
1.7M
516K
Be more productive with self-service analytics
Compute resources Libraries and environment
With popular ML libraries
Plug & play environments
requirements.txt
conda.yaml
And customization
Start up machines or
clusters on demand
Cost controls: Autoscaling, auto-termination,
spot instances, cost tracking
Governance: Cluster policies for enforcement
Option 2: Share clusters,
with separate Python
env per user or project.
Option 1: Use your
own cluster
Running example: ML prioritization of Sales opps
Platform enablement
and improvement
Customer history and
Sales data access
Long-term platform and
data pipeline planning
Develop DL
model
Use notebooks +
TensorBoard for
interactive
development.
Analyze
results
Review auto-logged
MLflow metrics to
analyze model
performance.
Load data
Efficient data
loading from S3,
ADLS, etc.
Get an ML
workspace
Simple machine or
cluster creation.
Ready-to-go DS
environments.
Share
results
Share insights
with other
stakeholders
Sync code
Import .py or
.ipynb notebook,
and sync with Git.
Discussion with Sales stakeholders to understand
the problem and data, and to set expectations
Explanation of results and
future potential to Sales
Build executive alignment and
buy-in for long-term initiatives
DS team training
and hiring
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
ML/AI Success
● Successful MVPs and
production models in
multiple business
units
● Uniform testing
standards are being
established
Company
● Data initiatives being
discussed at the
executive level
● Business units
pushing for data
projects
● Emerging business
champions for AI/ML
Team
● Data Science team(s)
supporting multiple
business units
● Integrations with
software engineering
for production
● Diversifying skill-sets
for domain expertise
Organization building -- “Walk” stage
Data
Preparation
Feature
Engineering
Model
Training
Model
Evaluation
Model
Deployment
Model
Tuning
Model
Consumption
● Koalas
● Spark DataFrames
● Spark UDFs
● Larger instances
● GPUs
● Distributed training
(Spark ML,
HorovodRunner, etc.)
● Hyperopt
● MLflow
● Spark DataFrames & UDFs
● Jobs & Model Servers
● Mlflow
Scaling in a typical machine learning workflow
Auto-logging for reproducibility
Reproduce Run feature:
✓
✓
✓
✓
Code versioning
Data versioning
Cluster configuration
Environment specification
Reproducibility checklist:
Job scheduling in platform
Automation: schedule, alert, retry, API
Automate and reproduce wherever possible
Secure: IAM Passthrough | Cluster Policies | Table ACLs
Your Existing Data Lake
Ingestion
Tables
Data
Catalog
Feature
Store
Azure Data
Lake Storage
Amazon S3
Streaming
Batch
3rd
Party Data
Marketplace
Files
for Data Science and ML
● Schema enforced high
quality data
● Optimized performance
● Full data lineage /
governance
● Reproducibility through
time travel
ML Runtime
IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs
Infrastructure
Data Engineering Data Science
ML Engineer
Running example: ML-driven products
Scale up
or out
Larger machines.
Multiple GPUs.
Distributed
training.
Schedule training
and inference jobs
Create jobs from
notebooks or libraries.
Add schedules, retries,
and alerts.
Model validation checks.
Automate for
downstream
consumption
Integrate with 3rd-party
tools and systems to
export ML insights to
business stakeholders
Integrate with
data pipelines
Automate ingestion of
new data for ML and
output of ML insights
for business/product
Scale tuning with
Hyperopt + SparkTrials.
Manage tuning with
MLflow autologging.
Improve modeling
process
Executive <> Data Science team
alignment on data-driven initiatives
Knowledge sharing across business
units for ML-driven projects
Education for business stakeholders to
understand ML models and insights
Platform adoption by
multiple business units
Increased governance needs for platform, covering
needs of more business units and personas
Platform plays a key role in
establishing best practices
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
ML/AI Success
● Successful production
models in multiple
verticals
● Uniform testing
standards established
● Program to grow
citizen data scientists
Company
● Data initiatives are
reported at the board
level
● Data driven decision
making across an
organization
Team
● Multiple Data Science
teams across verticals
led by an AI executive
● Standard
development and
deployment processes
for models
● COE across verticals
Organization building -- “Run” stage
model lifecycle
Staging Production Archived
Data Scientists Deployment Engineers
v1
v2
Models Tracking
Flavor 2
Flavor 1
Model Registry
Custom
Models
In-Line Code
Containers
Batch & Stream
Scoring
Cloud Inference
Services
OSS Serving
Solutions
Serving
Parameters Metrics Artifacts
Models
Metadata
Model
Deployment Options
Example of ML Ops
Training
Model
Validation
Job
Production
Batch
Inference Job
Email
Create model
version
Webhook for new model
versions in staging
Comment with test results +
transition request to production
Webhook for new model
version in production
ML Ops person receives email that
transition request to production was made
Approve new
production model
Model
Registry
Modes of deployment
Model training
Batch
Model Tracking
and Registry
Streaming
REST API
Embedded
Delta Lake /
Feature Store
Latency Cost
Minutes Low
Sec - Min Low - Med
< 1 Sec High
varies varies
BI tools
Repeatable Data Science lifecycle
Business
understanding
Executive
sponsorship
Center of Excellence
for DS & ML
End user
feedback
Metric discussions
and KPIs
Business value
realization
Exploratory
data analysis
Data ingestion
and preparation
Model deployment
and automation
ML modeling
Model monitoring
and feedback
ML and Data platform
and pipeline integration
Simple onboarding process
for new teams and use cases
Data and resource
sharing and governance
Standard handoff process
for production jobs
Sharable documentation
and usage education
Resources to learn more
Related talks and blogs
▪ Building Machine Learning Platforms Webinar
▪ MLflow Model Registry on Databricks Simplifies MLOps With CI/CD Features
Customer success stories
▪ Comcast, Starbucks, H&M
▪ Searchable customer stories
Databricks
▪ Data science and machine learning product page
▪ Managed MLflow product page
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
Building Data Science into Organizations: Field Experience

Contenu connexe

Tendances

MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 

Tendances (20)

MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
How to deploy machine learning models into production
How to deploy machine learning models into productionHow to deploy machine learning models into production
How to deploy machine learning models into production
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
 
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure MLProductionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure ML
 
Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
Creating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonCreating data apps using Streamlit in Python
Creating data apps using Streamlit in Python
 
Danish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML OpsDanish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML Ops
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Introduction to Azure Machine Learning
Introduction to Azure Machine LearningIntroduction to Azure Machine Learning
Introduction to Azure Machine Learning
 

Similaire à Building Data Science into Organizations: Field Experience

Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
HostedbyConfluent
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData Inc.
 

Similaire à Building Data Science into Organizations: Field Experience (20)

SharePoint Inspired 'Get more from your data with Office 365'
SharePoint Inspired 'Get more from your data with Office 365'SharePoint Inspired 'Get more from your data with Office 365'
SharePoint Inspired 'Get more from your data with Office 365'
 
Building an AI organisation
Building an AI organisationBuilding an AI organisation
Building an AI organisation
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
 
How to classify documents automatically using NLP
How to classify documents automatically using NLPHow to classify documents automatically using NLP
How to classify documents automatically using NLP
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
 
Microsoft teams.pdf
Microsoft teams.pdfMicrosoft teams.pdf
Microsoft teams.pdf
 
New Business Development Proposal - Adding Project Portfolio Management (PPM)...
New Business Development Proposal - Adding Project Portfolio Management (PPM)...New Business Development Proposal - Adding Project Portfolio Management (PPM)...
New Business Development Proposal - Adding Project Portfolio Management (PPM)...
 
Sandeep resume
Sandeep resumeSandeep resume
Sandeep resume
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
FJ_Trainer
FJ_TrainerFJ_Trainer
FJ_Trainer
 
Building a 360 Degree View of Your Customers on BICS
Building a 360 Degree View of Your Customers on BICSBuilding a 360 Degree View of Your Customers on BICS
Building a 360 Degree View of Your Customers on BICS
 
Starter Kit for Collaboration from Karuana @ Microsoft IT
Starter Kit for Collaboration from Karuana @ Microsoft ITStarter Kit for Collaboration from Karuana @ Microsoft IT
Starter Kit for Collaboration from Karuana @ Microsoft IT
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 

Dernier (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 

Building Data Science into Organizations: Field Experience

  • 1. Building Data Science into Organizations: Field Experience Chris Robison Joseph Bradley Data + AI Summit 2021
  • 2. Joseph Bradley ● Sr. Solutions Architect ● 2nd ML Engineer at Databricks ● Apache Spark committer and PMC member Our perspectives Chris Robison ● Sr. Solutions Architect ● Former Director of Data Science and Omni-channel Marketing at Overstock.com ● Career data scientist and avid Apache Spark user
  • 3. 5000+ Across the globe CUSTOMERS Lakehouse One simple platform to unify all of your data, analytics, and AI workloads The Data and AI Company ORIGINAL CREATORS
  • 4. So you want to do Data Science... 98.8% 14.4% of Fortune 1,000 companies are investing in strategic Big Data & AI initiatives. of Fortune 1,000 companies say they have deployed AI capabilities into widespread production. Source: New Vantage Partners
  • 5. Long-term ● Show business impact ● Increase productivity ● Scale DS across the organization Short-term ● Validate that DS is worthwhile ● Get resources: ○ Data ○ Data Scientists ○ Executive sponsorship ● Show vision Goals of a DS/ML/AI program
  • 6. Technology and platform ● Poor integration between Data Science and other data teams ● Planning for scale and production, under investment constraints Organization ● Team building: skill sets, hiring, and training ● Team organization: embedded vs. standalone ● Business and executive alignment ● R&D Challenges of a DS/ML/AI program
  • 7. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 8. Execution Use agile processes for data science ● Iterate with sprints and standups ● Fail fast in R&D Transparency is key ● Communicate frequently to your business partners and executives ● Make business partners and consumers an integral part of process Collaborate with the data and platform teams ● Make your needs known and understood ● Beware shortcuts which build technical debt
  • 9. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 10. ML/AI Success ● Successful MVPs with a few models manually in production ● Starting to build an AI/ML Strategy ● In discovery phase for new projects and low-hanging fruit Company ● Desire to become data driven ● Smaller in size (startup) or an existing organization with new data initiatives Team ● 1-2 Data Scientists (likely) reporting to a CTO ● Acting as full stack data scientists ● Typically a math or computer science background Organization building -- “Crawl” stage
  • 11. Common tools Descriptions Notebooks and IDEs Python notebooks, R Studio, Local IDEs Languages Python, R -- and potentially SQL, Scala, Java, etc. ML libraries Standard libraries, plus bring-your-own libraries and versions Git Notebook versioning, and syncing across platforms with Git Data Pandas, Spark, Koalas; any data sources or formats Visualization Matplotlib, Plotly, Seaborn, etc. Integrations Platforms must integrate with any libraries, systems, or services. Platforms which are cloud-native and have both UIs and APIs are ideal. Keep using familiar tools
  • 12. Build around OSS standards for portability # Downloads / month 990K 350K 1.7M 516K
  • 13. Be more productive with self-service analytics Compute resources Libraries and environment With popular ML libraries Plug & play environments requirements.txt conda.yaml And customization Start up machines or clusters on demand Cost controls: Autoscaling, auto-termination, spot instances, cost tracking Governance: Cluster policies for enforcement Option 2: Share clusters, with separate Python env per user or project. Option 1: Use your own cluster
  • 14. Running example: ML prioritization of Sales opps Platform enablement and improvement Customer history and Sales data access Long-term platform and data pipeline planning Develop DL model Use notebooks + TensorBoard for interactive development. Analyze results Review auto-logged MLflow metrics to analyze model performance. Load data Efficient data loading from S3, ADLS, etc. Get an ML workspace Simple machine or cluster creation. Ready-to-go DS environments. Share results Share insights with other stakeholders Sync code Import .py or .ipynb notebook, and sync with Git. Discussion with Sales stakeholders to understand the problem and data, and to set expectations Explanation of results and future potential to Sales Build executive alignment and buy-in for long-term initiatives DS team training and hiring
  • 15. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 16. ML/AI Success ● Successful MVPs and production models in multiple business units ● Uniform testing standards are being established Company ● Data initiatives being discussed at the executive level ● Business units pushing for data projects ● Emerging business champions for AI/ML Team ● Data Science team(s) supporting multiple business units ● Integrations with software engineering for production ● Diversifying skill-sets for domain expertise Organization building -- “Walk” stage
  • 17.
  • 18. Data Preparation Feature Engineering Model Training Model Evaluation Model Deployment Model Tuning Model Consumption ● Koalas ● Spark DataFrames ● Spark UDFs ● Larger instances ● GPUs ● Distributed training (Spark ML, HorovodRunner, etc.) ● Hyperopt ● MLflow ● Spark DataFrames & UDFs ● Jobs & Model Servers ● Mlflow Scaling in a typical machine learning workflow
  • 19. Auto-logging for reproducibility Reproduce Run feature: ✓ ✓ ✓ ✓ Code versioning Data versioning Cluster configuration Environment specification Reproducibility checklist: Job scheduling in platform Automation: schedule, alert, retry, API Automate and reproduce wherever possible Secure: IAM Passthrough | Cluster Policies | Table ACLs
  • 20. Your Existing Data Lake Ingestion Tables Data Catalog Feature Store Azure Data Lake Storage Amazon S3 Streaming Batch 3rd Party Data Marketplace Files for Data Science and ML ● Schema enforced high quality data ● Optimized performance ● Full data lineage / governance ● Reproducibility through time travel ML Runtime IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs Infrastructure Data Engineering Data Science ML Engineer
  • 21. Running example: ML-driven products Scale up or out Larger machines. Multiple GPUs. Distributed training. Schedule training and inference jobs Create jobs from notebooks or libraries. Add schedules, retries, and alerts. Model validation checks. Automate for downstream consumption Integrate with 3rd-party tools and systems to export ML insights to business stakeholders Integrate with data pipelines Automate ingestion of new data for ML and output of ML insights for business/product Scale tuning with Hyperopt + SparkTrials. Manage tuning with MLflow autologging. Improve modeling process Executive <> Data Science team alignment on data-driven initiatives Knowledge sharing across business units for ML-driven projects Education for business stakeholders to understand ML models and insights Platform adoption by multiple business units Increased governance needs for platform, covering needs of more business units and personas Platform plays a key role in establishing best practices
  • 22. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 23. ML/AI Success ● Successful production models in multiple verticals ● Uniform testing standards established ● Program to grow citizen data scientists Company ● Data initiatives are reported at the board level ● Data driven decision making across an organization Team ● Multiple Data Science teams across verticals led by an AI executive ● Standard development and deployment processes for models ● COE across verticals Organization building -- “Run” stage
  • 24. model lifecycle Staging Production Archived Data Scientists Deployment Engineers v1 v2 Models Tracking Flavor 2 Flavor 1 Model Registry Custom Models In-Line Code Containers Batch & Stream Scoring Cloud Inference Services OSS Serving Solutions Serving Parameters Metrics Artifacts Models Metadata Model Deployment Options
  • 25. Example of ML Ops Training Model Validation Job Production Batch Inference Job Email Create model version Webhook for new model versions in staging Comment with test results + transition request to production Webhook for new model version in production ML Ops person receives email that transition request to production was made Approve new production model Model Registry
  • 26. Modes of deployment Model training Batch Model Tracking and Registry Streaming REST API Embedded Delta Lake / Feature Store Latency Cost Minutes Low Sec - Min Low - Med < 1 Sec High varies varies BI tools
  • 27. Repeatable Data Science lifecycle Business understanding Executive sponsorship Center of Excellence for DS & ML End user feedback Metric discussions and KPIs Business value realization Exploratory data analysis Data ingestion and preparation Model deployment and automation ML modeling Model monitoring and feedback ML and Data platform and pipeline integration Simple onboarding process for new teams and use cases Data and resource sharing and governance Standard handoff process for production jobs Sharable documentation and usage education
  • 28. Resources to learn more Related talks and blogs ▪ Building Machine Learning Platforms Webinar ▪ MLflow Model Registry on Databricks Simplifies MLOps With CI/CD Features Customer success stories ▪ Comcast, Starbucks, H&M ▪ Searchable customer stories Databricks ▪ Data science and machine learning product page ▪ Managed MLflow product page
  • 29. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.