Imagine
How
5 Years from Now
will
predictive applications
be put
in production
Our Goal Today
How are we doing today ?
Wha...
What is a predictive application ?
Churn Prevention
Fraud Detection
Demand Forecast
Targeting
Maintenance
Match Making
Ad ...
This discussion not relevant to all
Churn
Maintenance
Drug Studies Multi-Years
Multi-Years
Multi-Years Weekly
Weekly
Yearl...
Not just a “model”
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Data
Collection
Let’s ca...
How much effort ?
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
20% 30% 25% 5% 5% 15%
Dat...
Who Does What ?
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Data Domain
Engineers
Data ...
Huge Variety of Tech
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
Data
Collection
ETL ?
...
From Build to Run
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
?
Input Data Decision
Bui...
How People Do that Today ?
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
PMMLETL WebServi...
Some Integrated Per-Platform Approach
in Database
in SAS
in Hadoop/Spark
SQL Commercial Warehouse
+ Scoring UDF
End-to-end...
Top Companies invested a lot
Each probably >5M$ in their ML production platform
Reason 1 : Prohibitive Costs kill projects
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
...
Reason 2: Distribution Drift
New behaviour
New product
New competitor
Model stops working as planned
You need to be able t...
Reason 3: Mitigate with Data Hazards
You need to be able to do same week update
Most interesting “Big Data” Sources are fr...
Reason 4: Decide is beyond Predict
Most Interesting Problems Require To Combine
Models + Heuristics + Non-local Optimizati...
Reason 5: “Suits ready” for scalability
Data
Prep
Domain
Specific
Feature Eng.
Feature Eng. Model(s)
Scoring
/Decision
You...
Imagine the Dream Platform
That Would Solve All This
?
Let’s call it Blue Box
New Data
Decision
Feature : Cleansing, Enrich and Merge
Blue Box must be the perfect Data Blending runtime
Feature: Aggregating Data
Raw Events Stream Aggregate State
Consolidating History Must be part of Blue Box
1TB-100TB+ 100M...
Feature : External Data Compliant
main
data
enriched main
data
additional
data
e.g. Census,
Map, Etc..
Third Data Data Mus...
Feature : Update Data Service
Smart Lazy Human
A/B Test Support in Blue Box
Decision Ver. A
Decision Ver. B
P D F M S
New
...
Feature : Programatic Decision
Need for Business Compliant
“Real-Time” Rules in Blue Box
model 1
model 2 model 3
if
combin...
Feature : Audit and Logs
Smart Lazy Human
?
Blue Box needs to keep track of its decisions and Why
Decision Cause Log
External Data
Advanced Join / Matching
Ad-Hoc Transformation
Python / R / Spark DataFrame transformations
SQL Like Transfo...
Interesting /
Potential Open Source Project
Real-Time Entity Update, Management,
Scoring
Open Source PMML Scoring in Java
...
How will we create the “blue box” ?
?
Specification ? PMML Extension ?
Open Source Framework ?
Hadoop / Spark Specific ?
Thank you !
is blue
Convince decisions makers to make
data their competitive advantage
florian.douetteau@dataiku.comjobs@d...
Prochain SlideShare
Loading in …5
×

Dataiku productive application to production - pap is may 2015

806 views

Published on

Beyond Predictive Analytics : Deploying apps to production and keep them improving

Some smart companies have been putting predictive application in production for decades. Still, either because of lack of sharing or lack of generality, there is still no single and obvious way to put a predictive application in production today.

As a consequence, for most companies, transitioning analytics from development to production is still “the next frontier”.

Behind the single word "production” lays a great number of questions like: what exactly do you put in production: data, model, code all three ? Who is responsible for maintenance and quality check over time : business, tech or both ? How can I make my predictive app continuously improve and check that it delivers the promised business value over time ? What are the best practice for maintenance and updates by the way ? Will my data scientists keep working after first development or should I lay half of them off ? etc…

Let’s make a small analogy with the development of web sites in the 90’s and early 00’s :
Back then, the winners where not necessarily the web sites with an amazing design, but a winner had clearly made the necessary efforts and had a robust way to put their web site reliabily in production

Today, every web developper can enjoy the confort of Heroku, Amazon, Github, docker, Angular, bootstrap … and so we forget. How much time before we get the same confort for the predictive world ?

0 commentaires
0 mentions J'aime
Statistiques
Remarques
  • Soyez le premier à commenter

  • Be the first to like this

Aucun téléchargement
Vues
Total des vues
806
On SlideShare
0
À partir des ajouts
0
Nombre d'ajouts
1
Actions
Partages
0
Téléchargements
13
Commentaires
0
J'aime
0
Ajouts 0
No embeds

No notes for slide

Dataiku productive application to production - pap is may 2015

  1. 1. Imagine How 5 Years from Now will predictive applications be put in production Our Goal Today How are we doing today ? What is difficult ? What should be simpler?
  2. 2. What is a predictive application ? Churn Prevention Fraud Detection Demand Forecast Targeting Maintenance Match Making Ad Bidding Drug Studies Pricing Ranking
  3. 3. This discussion not relevant to all Churn Maintenance Drug Studies Multi-Years Multi-Years Multi-Years Weekly Weekly Yearly Bidding Two Weeks Sub-Second Data Span Retrain every … Score every… Yearly Day Monthly Monthly Production = Dev Online Learning
  4. 4. Not just a “model” Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Data Collection Let’s call this a Predictive Service Specification
  5. 5. How much effort ? Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision 20% 30% 25% 5% 5% 15% Data Collection
  6. 6. Who Does What ? Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Data Domain Engineers Data AnalystsData ScientistsBusiness Intelligence Engineers
  7. 7. Huge Variety of Tech Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Data Collection ETL ? Ad-Hoc? ETL ? Ad-Hoc? ETL ? SQL ? R ? Python ? Matlab ? R ? Python ? R ? Python ? SAS? Java / Python Business Rules Management System Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision
  8. 8. From Build to Run Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision ? Input Data Decision Build Time Run Time
  9. 9. How People Do that Today ? Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision PMMLETL WebServiceScript/SQL Data Collection A Predictive Service = Up to 4 different “Applications" that can run out-of-sync
  10. 10. Some Integrated Per-Platform Approach in Database in SAS in Hadoop/Spark SQL Commercial Warehouse + Scoring UDF End-to-end integration script Ad-hoc development
  11. 11. Top Companies invested a lot Each probably >5M$ in their ML production platform
  12. 12. Reason 1 : Prohibitive Costs kill projects Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision RSQL PythonR Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision SQLETL WebServiceSQL PMML 300K$ 50K$ 200K$100K$ 50K$ 650K$
  13. 13. Reason 2: Distribution Drift New behaviour New product New competitor Model stops working as planned You need to be able to do same week update
  14. 14. Reason 3: Mitigate with Data Hazards You need to be able to do same week update Most interesting “Big Data” Sources are fragile
  15. 15. Reason 4: Decide is beyond Predict Most Interesting Problems Require To Combine Models + Heuristics + Non-local Optimization
  16. 16. Reason 5: “Suits ready” for scalability Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Your CTO could certainly maintain it up and running all by himself Your CTO could certainly maintain it up and running all by himself
  17. 17. Imagine the Dream Platform That Would Solve All This ? Let’s call it Blue Box New Data Decision
  18. 18. Feature : Cleansing, Enrich and Merge Blue Box must be the perfect Data Blending runtime
  19. 19. Feature: Aggregating Data Raw Events Stream Aggregate State Consolidating History Must be part of Blue Box 1TB-100TB+ 100MB-1OGB
  20. 20. Feature : External Data Compliant main data enriched main data additional data e.g. Census, Map, Etc.. Third Data Data Must Be “In” the Blue Box
  21. 21. Feature : Update Data Service Smart Lazy Human A/B Test Support in Blue Box Decision Ver. A Decision Ver. B P D F M S New Model
  22. 22. Feature : Programatic Decision Need for Business Compliant “Real-Time” Rules in Blue Box model 1 model 2 model 3 if combine with if proba > 0,63 decision A else decision B if proba > 0,79 decision A else decision B
  23. 23. Feature : Audit and Logs Smart Lazy Human ? Blue Box needs to keep track of its decisions and Why Decision Cause Log
  24. 24. External Data Advanced Join / Matching Ad-Hoc Transformation Python / R / Spark DataFrame transformations SQL Like Transformations Scoring Causes / Audit A/B Test Support Model Rollback / Versioning Prediction Log. Stats / Audit Ad-hoc scoring/decision code/scoring Open Source What does Blue Box look like? ?
  25. 25. Interesting / Potential Open Source Project Real-Time Entity Update, Management, Scoring Open Source PMML Scoring in Java Oryx: Lambda Architecture built on Spark and Kafka, with specialisation on real-time machine learning
  26. 26. How will we create the “blue box” ? ? Specification ? PMML Extension ? Open Source Framework ? Hadoop / Spark Specific ?
  27. 27. Thank you ! is blue Convince decisions makers to make data their competitive advantage florian.douetteau@dataiku.comjobs@dataiku.com Wanna work on this topic ? Wanna share your dream features?

×