SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
From The Lab to the Factory
Building A Production Machine Learning Infrastructure
Josh Wills, Senior Director of Data Science
Cloudera

1
One Other Thing About Me

2
Data Science: Another Definition

3
Data Scientists Build Data Products.

4
A Shift In Perspective
Analytics in the Factory

Analytics in the Lab
•
•
•
•
•
•

5

Question-driven
Interactive
Ad-hoc, post-hoc
Fixed data
Focus on speed and
flexibility
Output is embedded into a
report or in-database
scoring engine

•
•
•
•
•
•

Metric-driven
Automated
Systematic
Fluid data
Focus on transparency and
reliability
Output is a production
system that makes
customer-facing decisions
All* Products Become Data Products

6
Identifying the Bottlenecks

7
Oryx: Model Building and Serving
•

Algorithms
•
•
•

ALS Recommenders
K-Means Parallel
RDF

Batch model building
via MapReduce*
• Server for real-time
scoring and updates
• PMML 4.1 Models
•

8
Oryx Design

9
Generational Thinking

10
The Limits of Our Models

11
Space Exploration

12
Data Science Needs DevOps

13
Introducing Gertrude
•

Multivariate Testing
•

•

Overlapping
Experiments
•
•

14

Define and explore a
space of parameters

Tang et al. (2010)
Runs multiple
independent
experiments on every
request
Simple Conditional Logic
•

Declare experiment
flags in compiled code
•

•

15

Settings that can vary
per request

Create a config file that
contains simple rules
for calculating flag
values and rules for
experiment diversion
Separate Data Push from Code Push
•

Validate config files and
push updates to servers
•
•

•

16

Zookeeper via Curator
File-based

Servers pick up new
configs, load them, and
update experiment
space and flag value
calculations
The Experiments Dashboard

17
A Few Links I Love
•

http://research.google.com/pubs/pub36500.html
•

•

http://www.exp-platform.com/
•

•

Collection of all of Microsoft’s papers and presentations on
their experimentation platform

http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/
•

18

The original paper on the overlapping experiments
infrastrucure at Google

Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies
Thank you!
Josh Wills, Director of Data Science, Cloudera

@josh_wills

Contenu connexe

En vedette

Slalom @ Seattle Interactive Conference 2016
Slalom @ Seattle Interactive Conference 2016Slalom @ Seattle Interactive Conference 2016
Slalom @ Seattle Interactive Conference 2016
Slalom
 

En vedette (8)

Does your content need a dam makeover
Does your content need a dam makeoverDoes your content need a dam makeover
Does your content need a dam makeover
 
AI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning VersionAI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning Version
 
Love Your Future
Love Your FutureLove Your Future
Love Your Future
 
Slalom @ Seattle Interactive Conference 2016
Slalom @ Seattle Interactive Conference 2016Slalom @ Seattle Interactive Conference 2016
Slalom @ Seattle Interactive Conference 2016
 
Digest customer loyalty_in_retail_banking_2014
Digest customer loyalty_in_retail_banking_2014Digest customer loyalty_in_retail_banking_2014
Digest customer loyalty_in_retail_banking_2014
 
Bain digest. Customer behavior and loyalty in retail banking 2015
Bain digest. Customer behavior and loyalty in retail banking 2015Bain digest. Customer behavior and loyalty in retail banking 2015
Bain digest. Customer behavior and loyalty in retail banking 2015
 
Making loyalty pay: How to build - not destroy - value
Making loyalty pay: How to build - not destroy - valueMaking loyalty pay: How to build - not destroy - value
Making loyalty pay: How to build - not destroy - value
 
Which Innovation strategy should my company pursue?
Which Innovation strategy should my company pursue? Which Innovation strategy should my company pursue?
Which Innovation strategy should my company pursue?
 

Similaire à Cloudera User Group - From the Lab to the Factory

Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
Cloudera, Inc.
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
Florian Wilhelm
 

Similaire à Cloudera User Group - From the Lab to the Factory (20)

Josh Wills, MLconf 2013
Josh Wills, MLconf 2013Josh Wills, MLconf 2013
Josh Wills, MLconf 2013
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...
 
Curiosity Software and RCG Global Services Present - Solving Test Data: the g...
Curiosity Software and RCG Global Services Present - Solving Test Data: the g...Curiosity Software and RCG Global Services Present - Solving Test Data: the g...
Curiosity Software and RCG Global Services Present - Solving Test Data: the g...
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
 
Ds for finance day 4
Ds for finance day 4Ds for finance day 4
Ds for finance day 4
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Co-op’s Transformation from Brick and Mortar to AI with Databricks with Rob M...
Co-op’s Transformation from Brick and Mortar to AI with Databricks with Rob M...Co-op’s Transformation from Brick and Mortar to AI with Databricks with Rob M...
Co-op’s Transformation from Brick and Mortar to AI with Databricks with Rob M...
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev Stack
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
Predicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCAPredicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCA
 
(20.05.2009) Cumuy Presenta - Más tecnologías interesantes para conocer - PPT 2
(20.05.2009) Cumuy Presenta - Más tecnologías interesantes para conocer - PPT 2(20.05.2009) Cumuy Presenta - Más tecnologías interesantes para conocer - PPT 2
(20.05.2009) Cumuy Presenta - Más tecnologías interesantes para conocer - PPT 2
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
Building an Experimentation Platform in Clojure
Building an Experimentation Platform in ClojureBuilding an Experimentation Platform in Clojure
Building an Experimentation Platform in Clojure
 
DevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 ConferenceDevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 Conference
 
7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflows7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflows
 

Plus de ClouderaUserGroups

Pa cloudera manager-api's_extensibility_v2
Pa   cloudera manager-api's_extensibility_v2Pa   cloudera manager-api's_extensibility_v2
Pa cloudera manager-api's_extensibility_v2
ClouderaUserGroups
 

Plus de ClouderaUserGroups (6)

What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready state
 
Extending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via APIExtending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via API
 
Pa cloudera manager-api's_extensibility_v2
Pa   cloudera manager-api's_extensibility_v2Pa   cloudera manager-api's_extensibility_v2
Pa cloudera manager-api's_extensibility_v2
 
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
Cloudera User Group SF - Cloudera Manager: APIs & ExtensibilityCloudera User Group SF - Cloudera Manager: APIs & Extensibility
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
 
Cloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
Cloudera User Group Chicago - Cloudera Manager: APIs & ExtensibilityCloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
Cloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
 
Cloudera User Group Chicago - The Future of Data
Cloudera User Group Chicago - The Future of DataCloudera User Group Chicago - The Future of Data
Cloudera User Group Chicago - The Future of Data
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Cloudera User Group - From the Lab to the Factory

  • 1. From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera 1
  • 2. One Other Thing About Me 2
  • 3. Data Science: Another Definition 3
  • 4. Data Scientists Build Data Products. 4
  • 5. A Shift In Perspective Analytics in the Factory Analytics in the Lab • • • • • • 5 Question-driven Interactive Ad-hoc, post-hoc Fixed data Focus on speed and flexibility Output is embedded into a report or in-database scoring engine • • • • • • Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions
  • 6. All* Products Become Data Products 6
  • 8. Oryx: Model Building and Serving • Algorithms • • • ALS Recommenders K-Means Parallel RDF Batch model building via MapReduce* • Server for real-time scoring and updates • PMML 4.1 Models • 8
  • 11. The Limits of Our Models 11
  • 13. Data Science Needs DevOps 13
  • 14. Introducing Gertrude • Multivariate Testing • • Overlapping Experiments • • 14 Define and explore a space of parameters Tang et al. (2010) Runs multiple independent experiments on every request
  • 15. Simple Conditional Logic • Declare experiment flags in compiled code • • 15 Settings that can vary per request Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
  • 16. Separate Data Push from Code Push • Validate config files and push updates to servers • • • 16 Zookeeper via Curator File-based Servers pick up new configs, load them, and update experiment space and flag value calculations
  • 18. A Few Links I Love • http://research.google.com/pubs/pub36500.html • • http://www.exp-platform.com/ • • Collection of all of Microsoft’s papers and presentations on their experimentation platform http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/ • 18 The original paper on the overlapping experiments infrastrucure at Google Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies
  • 19. Thank you! Josh Wills, Director of Data Science, Cloudera @josh_wills