SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Imagine
How
5 Years from Now
will
predictive applications
be put
in production
Our Goal Today
How are we doing today ?
What is difficult ?
What should be simpler?
What is a predictive application ?
Churn Prevention
Fraud Detection
Demand Forecast
Targeting
Maintenance
Match Making
Ad Bidding
Drug Studies
Pricing
Ranking
This discussion not relevant to all
Churn
Maintenance
Drug Studies Multi-Years
Multi-Years
Multi-Years Weekly
Weekly
Yearly
Bidding Two Weeks Sub-Second
Data Span
Retrain
every …
Score
every…
Yearly
Day
Monthly
Monthly
Production
= Dev
Online Learning
Not just a “model”
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Data
Collection
Let’s call this a
Predictive Service Specification
How much effort ?
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
20% 30% 25% 5% 5% 15%
Data
Collection
Who Does What ?
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Data Domain
Engineers
Data AnalystsData ScientistsBusiness Intelligence
Engineers
Huge Variety of Tech
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Data
Collection
ETL ?
Ad-Hoc?
ETL ?
Ad-Hoc?
ETL ?
SQL ? R ? Python ?
Matlab ?
R ? Python ?
R ? Python ? SAS? Java / Python
Business Rules
Management System
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
From Build to Run
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
?
Input Data Decision
Build Time
Run Time
How People Do that Today ?
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
PMMLETL WebServiceScript/SQL
Data
Collection
A Predictive Service
=
Up to 4 different “Applications" that can run out-of-sync
Some Integrated Per-Platform Approach
in Database
in SAS
in Hadoop/Spark
SQL Commercial Warehouse
+ Scoring UDF
End-to-end integration script
Ad-hoc development
Top Companies invested a lot
Each probably >5M$ in their ML production platform
Reason 1 : Prohibitive Costs kill projects
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
RSQL PythonR
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
SQLETL WebServiceSQL PMML
300K$ 50K$ 200K$100K$
50K$
650K$
Reason 2: Distribution Drift
New behaviour
New product
New competitor
Model stops working as planned
You need to be able to do same week update
Reason 3: Mitigate with Data Hazards
You need to be able to do same week update
Most interesting “Big Data” Sources are fragile
Reason 4: Decide is beyond Predict
Most Interesting Problems Require To Combine
Models + Heuristics + Non-local Optimization
Reason 5: “Suits ready” for scalability
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Your CTO could certainly
maintain it up and running all by himself
Your CTO could certainly
maintain it up and running all by himself
Imagine the Dream Platform
That Would Solve All This
?
Let’s call it Blue Box
New Data
Decision
Feature : Cleansing, Enrich and Merge
Blue Box must be the perfect Data Blending runtime
Feature: Aggregating Data
Raw Events Stream Aggregate State
Consolidating History Must be part of Blue Box
1TB-100TB+ 100MB-1OGB
Feature : External Data Compliant
main
data
enriched main
data
additional
data
e.g. Census,
Map, Etc..
Third Data Data Must Be “In” the Blue Box
Feature : Update Data Service
Smart Lazy Human
A/B Test Support in Blue Box
Decision Ver. A
Decision Ver. B
P D F M S
New
Model
Feature : Programatic Decision
Need for Business Compliant
“Real-Time” Rules in Blue Box
model 1
model 2 model 3
if
combine
with
if proba > 0,63 decision A
else decision B
if proba > 0,79 decision A
else decision B
Feature : Audit and Logs
Smart Lazy Human
?
Blue Box needs to keep track of its decisions and Why
Decision Cause Log
External Data
Advanced Join / Matching
Ad-Hoc Transformation
Python / R / Spark DataFrame transformations
SQL Like Transformations
Scoring Causes / Audit
A/B Test Support
Model Rollback / Versioning
Prediction Log. Stats / Audit
Ad-hoc scoring/decision code/scoring
Open Source
What does Blue Box look like?
?
Interesting /
Potential Open Source Project
Real-Time Entity Update, Management,
Scoring
Open Source PMML Scoring in Java
Oryx: Lambda Architecture built on Spark and
Kafka, with specialisation on real-time machine learning
How will we create the “blue box” ?
?
Specification ? PMML Extension ?
Open Source Framework ?
Hadoop / Spark Specific ?
Thank you !
is blue
Convince decisions makers to make
data their competitive advantage
florian.douetteau@dataiku.comjobs@dataiku.com
Wanna work on
this topic ?
Wanna share your
dream features?

Contenu connexe

Tendances

Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You! DataKitchen
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
 
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectHow to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectPAPIs.io
 
Cloud as a Data Platform
Cloud as a Data PlatformCloud as a Data Platform
Cloud as a Data PlatformAndrei Savu
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...VMware Tanzu
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scaleLooker
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 

Tendances (20)

Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureML
 
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectHow to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
 
Cloud as a Data Platform
Cloud as a Data PlatformCloud as a Data Platform
Cloud as a Data Platform
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scale
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 

En vedette

04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
"Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ..."Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ...Dataconomy Media
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
Lighting design for Startup Offices
Lighting design for Startup OfficesLighting design for Startup Offices
Lighting design for Startup OfficesPetteriTeikariPhD
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
Build a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-timeBuild a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-timeAmazon Web Services
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 

En vedette (13)

04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
"Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ..."Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ...
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
Lighting design for Startup Offices
Lighting design for Startup OfficesLighting design for Startup Offices
Lighting design for Startup Offices
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
Build a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-timeBuild a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-time
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 

Similaire à PREDICTIVE APPLICATIONS PRODUCTION

Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Christopher Gutknecht
 
Actminds Outsourcing Summit 07
Actminds Outsourcing Summit 07Actminds Outsourcing Summit 07
Actminds Outsourcing Summit 07cnetto
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuningYosuke Mizutani
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessInside Analysis
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
 
Sasikumar Selvaraj CV- Mainframe
Sasikumar Selvaraj CV- MainframeSasikumar Selvaraj CV- Mainframe
Sasikumar Selvaraj CV- Mainframesasikumar s
 
Data Virtualization for Data Architects (New Zealand)
Data Virtualization for Data Architects (New Zealand)Data Virtualization for Data Architects (New Zealand)
Data Virtualization for Data Architects (New Zealand)Denodo
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
 
Thinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsThinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsInside Analysis
 
Big Data LDN 2018: THE PATH TO ENTERPRISE AI: TALES FROM THE FIELD
Big Data LDN 2018: THE PATH TO ENTERPRISE AI: TALES FROM THE FIELDBig Data LDN 2018: THE PATH TO ENTERPRISE AI: TALES FROM THE FIELD
Big Data LDN 2018: THE PATH TO ENTERPRISE AI: TALES FROM THE FIELDMatt Stubbs
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analyticsMariaDB plc
 
MeasureCamp 2022: Digital Analytics Solutions for 2022
MeasureCamp 2022: Digital Analytics Solutions for 2022MeasureCamp 2022: Digital Analytics Solutions for 2022
MeasureCamp 2022: Digital Analytics Solutions for 2022Lukáš Čech
 
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Christopher Gutknecht
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSHCL Technologies
 

Similaire à PREDICTIVE APPLICATIONS PRODUCTION (20)

AI at Scale in Enterprises
AI at Scale in Enterprises AI at Scale in Enterprises
AI at Scale in Enterprises
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
 
Actminds Outsourcing Summit 07
Actminds Outsourcing Summit 07Actminds Outsourcing Summit 07
Actminds Outsourcing Summit 07
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information Access
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
 
Sasikumar Selvaraj CV- Mainframe
Sasikumar Selvaraj CV- MainframeSasikumar Selvaraj CV- Mainframe
Sasikumar Selvaraj CV- Mainframe
 
Data Virtualization for Data Architects (New Zealand)
Data Virtualization for Data Architects (New Zealand)Data Virtualization for Data Architects (New Zealand)
Data Virtualization for Data Architects (New Zealand)
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Thinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsThinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters Analytics
 
Big Data LDN 2018: THE PATH TO ENTERPRISE AI: TALES FROM THE FIELD
Big Data LDN 2018: THE PATH TO ENTERPRISE AI: TALES FROM THE FIELDBig Data LDN 2018: THE PATH TO ENTERPRISE AI: TALES FROM THE FIELD
Big Data LDN 2018: THE PATH TO ENTERPRISE AI: TALES FROM THE FIELD
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
 
MeasureCamp 2022: Digital Analytics Solutions for 2022
MeasureCamp 2022: Digital Analytics Solutions for 2022MeasureCamp 2022: Digital Analytics Solutions for 2022
MeasureCamp 2022: Digital Analytics Solutions for 2022
 
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
 

Plus de Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thDataiku
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku
 

Plus de Dataiku (8)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch
 

PREDICTIVE APPLICATIONS PRODUCTION

  • 1. Imagine How 5 Years from Now will predictive applications be put in production Our Goal Today How are we doing today ? What is difficult ? What should be simpler?
  • 2. What is a predictive application ? Churn Prevention Fraud Detection Demand Forecast Targeting Maintenance Match Making Ad Bidding Drug Studies Pricing Ranking
  • 3. This discussion not relevant to all Churn Maintenance Drug Studies Multi-Years Multi-Years Multi-Years Weekly Weekly Yearly Bidding Two Weeks Sub-Second Data Span Retrain every … Score every… Yearly Day Monthly Monthly Production = Dev Online Learning
  • 4. Not just a “model” Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Data Collection Let’s call this a Predictive Service Specification
  • 5. How much effort ? Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision 20% 30% 25% 5% 5% 15% Data Collection
  • 6. Who Does What ? Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Data Domain Engineers Data AnalystsData ScientistsBusiness Intelligence Engineers
  • 7. Huge Variety of Tech Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Data Collection ETL ? Ad-Hoc? ETL ? Ad-Hoc? ETL ? SQL ? R ? Python ? Matlab ? R ? Python ? R ? Python ? SAS? Java / Python Business Rules Management System Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision
  • 8. From Build to Run Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision ? Input Data Decision Build Time Run Time
  • 9. How People Do that Today ? Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision PMMLETL WebServiceScript/SQL Data Collection A Predictive Service = Up to 4 different “Applications" that can run out-of-sync
  • 10. Some Integrated Per-Platform Approach in Database in SAS in Hadoop/Spark SQL Commercial Warehouse + Scoring UDF End-to-end integration script Ad-hoc development
  • 11. Top Companies invested a lot Each probably >5M$ in their ML production platform
  • 12. Reason 1 : Prohibitive Costs kill projects Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision RSQL PythonR Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision SQLETL WebServiceSQL PMML 300K$ 50K$ 200K$100K$ 50K$ 650K$
  • 13. Reason 2: Distribution Drift New behaviour New product New competitor Model stops working as planned You need to be able to do same week update
  • 14. Reason 3: Mitigate with Data Hazards You need to be able to do same week update Most interesting “Big Data” Sources are fragile
  • 15. Reason 4: Decide is beyond Predict Most Interesting Problems Require To Combine Models + Heuristics + Non-local Optimization
  • 16. Reason 5: “Suits ready” for scalability Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Your CTO could certainly maintain it up and running all by himself Your CTO could certainly maintain it up and running all by himself
  • 17. Imagine the Dream Platform That Would Solve All This ? Let’s call it Blue Box New Data Decision
  • 18. Feature : Cleansing, Enrich and Merge Blue Box must be the perfect Data Blending runtime
  • 19. Feature: Aggregating Data Raw Events Stream Aggregate State Consolidating History Must be part of Blue Box 1TB-100TB+ 100MB-1OGB
  • 20. Feature : External Data Compliant main data enriched main data additional data e.g. Census, Map, Etc.. Third Data Data Must Be “In” the Blue Box
  • 21. Feature : Update Data Service Smart Lazy Human A/B Test Support in Blue Box Decision Ver. A Decision Ver. B P D F M S New Model
  • 22. Feature : Programatic Decision Need for Business Compliant “Real-Time” Rules in Blue Box model 1 model 2 model 3 if combine with if proba > 0,63 decision A else decision B if proba > 0,79 decision A else decision B
  • 23. Feature : Audit and Logs Smart Lazy Human ? Blue Box needs to keep track of its decisions and Why Decision Cause Log
  • 24. External Data Advanced Join / Matching Ad-Hoc Transformation Python / R / Spark DataFrame transformations SQL Like Transformations Scoring Causes / Audit A/B Test Support Model Rollback / Versioning Prediction Log. Stats / Audit Ad-hoc scoring/decision code/scoring Open Source What does Blue Box look like? ?
  • 25. Interesting / Potential Open Source Project Real-Time Entity Update, Management, Scoring Open Source PMML Scoring in Java Oryx: Lambda Architecture built on Spark and Kafka, with specialisation on real-time machine learning
  • 26. How will we create the “blue box” ? ? Specification ? PMML Extension ? Open Source Framework ? Hadoop / Spark Specific ?
  • 27. Thank you ! is blue Convince decisions makers to make data their competitive advantage florian.douetteau@dataiku.comjobs@dataiku.com Wanna work on this topic ? Wanna share your dream features?