Enabling data scientists within an enterprise requires a well-thought out approach from an organization, technology, and business results perspective. In this talk, Tim and Hussain will share common pitfalls to data science enablement in the enterprise and provide their recommendations to avoid them. Taking an example, actionable use case from the financial services industry, they will focus on how Anaconda plays a pivotal role in setting up big data infrastructure, integrating data science experimentation and production environments, and deploying insights to production. Along the way, they will highlight opportunities for leveraging open source and unleashing data science teams while meeting regulatory and compliance challenges.
Schema on read is obsolete. Welcome metaprogramming..pdf
How to make your data scientists happy
1. How to Make Your Data Scientists Happy
A use-case backed approach for enabling data science in enterprise
April 2018
ANACONDACON 2018
2. HUSSAIN SULTAN
WASHINGTON DC
Leader in computational
Python development and
Data Science
Amazon and Capital One
Consulting clients: leading
Fintech lenders and
mega-regional banks
TIM HORAN
WASHINGTON DC
10 years of consumer lending
Led US Credit Card Valuations
at Capital One
Consulting clients: leading
market place installment loan
lender and global 100 banks
Introduction
3. 3
Explosion of Data
Modern Analytics
Analytics and data management technology have
progressed significantly in the last 10 years
Cloud Computing
Software Development
Predictive Analytics
Open Source
Infrastructure Automation
90% of today's data was created in the last two
years1
$219.6 billion spent globally on public cloud
services in 2016 and predicted to be $411 billion
by 20202
The line between software development and
sustainable analysis is blurring
The hive-mind of open source clearly has a
space in modern analytics as enterprise
solutions build on top and around it
Low cost compute and storage makes
Machine Learning and Artificial Intelligence
accessible
By the end of 2018, spending on IT-as-a-Service
for data centers, software, and services will be
just under $550 billion worldwide3
1IBM 10 Key Marketing Trends for 2017 - https://ibm.co/2y0r7Ee
2Gartner Press Release - http://gtnr.it/2Fw5LmJ
3Deloitte Technology, Media, and Telecommunications Predictions 2017 - http://bit.ly/2jMYdwm
4. 4
In 2014, Gartner Research predicted 60% of
Big Data projects through 2017 would be failures.
When 2017 rolled around ...
Despite significant investment by enterprises to embrace
Big Data and modern analytics, most efforts are failing.
5. 5
In 2014, Gartner Research predicted 60% of
Big Data projects through 2017 would be failures.
When 2017 rolled around ...
Despite significant investment by enterprises to embrace
Big Data and modern analytics, most efforts are failing.
9. 9
Who are your Data Scientists, and what do they do?
Biz Analyst Data Scientist Developer Data Engineer DevOps
Business Insight Generation
Model Building
Insight / Model Deployment
Analytical Tool Creation
Data Science Enablement
Data Management
1Leveraging Base Framework from Anaconda – Journey to Open Data Science - http://bit.ly/2FyvpHD
10. 10
Who are your Data Scientists, and what do they do?
Biz Analyst Data Scientist Developer Data Engineer DevOps
Business Insight Generation
Model Building
Insight / Model Deployment
Analytical Tool Creation
Data Science Enablement
Data Management
Data Scientists play a critical bridge role between
Biz Analysts and traditional IT roles in enterprise
1Leveraging Base Framework from Anaconda – Journey to Open Data Science - http://bit.ly/2FyvpHD
11. 11
Who are your Data Scientists, and what do they do?
Biz Analyst Data Scientist Developer Data Engineer DevOps
Business Insight Generation
Model Building
Insight / Model Deployment
Analytical Tool Creation
Data Science Enablement
Data Management
Deployment in enterprise requires the
most coordination across teams
1Leveraging Base Framework from Anaconda – Journey to Open Data Science - http://bit.ly/2FyvpHD
14. 14
Data Scientists want to drive change in their organization
using Data
It begins with the first word, Data …
Tools to get the job done ...
Transparent path from insight to impact …
• Raw data handled and stored consistently to eliminate data silos
• Metadata readily available, in particular, lineage working backward to raw data sources
• A well understood and thoughtful data access process
• Open source first and foremost (Python/Anaconda, R)
• Scaled Data Science platform to enable interactive exploration and visualization
• Thoughtful and well understood open source governance process
• Automated workflows to deploy new insights to market and monitor results
• At minimum, transparency on how to bring insights to market/production
15. 15
It begins with the first word, Data …
Tools to get the job done ...
Transparent path from insight to impact …
• Raw data handled and stored consistently to eliminate data silos
• Metadata readily available, in particular, lineage working backward to raw data sources
• A well understood and thoughtful data access process
• Open source first and foremost (Python/Anaconda, R)
• Scaled Data Science platform to enable interactive exploration and visualization
• Thoughtful and well understood open source governance process
• Automated workflows to deploy new insights to market and monitor results
• At minimum, transparency on how to bring insights to market/production
A path from insight to implementation is consistently the largest gap to
successful ”Big Data” / modern analytics projects.
Data Scientists want to drive change in their organization
using Data
16. 16
A Common “Big Data” Project Life Cycle
Production
Infrastructure
Analytics and
Monitoring
Stack
Implementation
Process
On-Prem Hadoop or
Cloud Database Scaled Data Science
Environment
Data Scientists
New Insights:
• Model
• Strategy
Parallel Modernization Lab or Center of Excellence
Enterprise BAU
(Business as Usual) Solution
Biz Analyst /
Data Scientists
New ETL
Process
New Implementation Process
17. 17
A Common “Big Data” Project Life Cycle
Production
Infrastructure
Analytics and
Monitoring
Stack
Implementation
Process
On-Prem Hadoop or
Cloud based data
solution Scaled Data Science
Environment
Data Scientists
New Insights:
• Model
• Strategy
Parallel Modernization Lab or Center of Excellence
Enterprise BAU
(Business as Usual) Solution
Biz Analyst /
Data Scientists
New ETL
Process
New Implementation Process
Common Challenge #1
• Key performance indicator for new ETL focused on moving as
much data into lake as possible
• Data landing with limited metadata or challenging structures
• BAU solution not built on raw schema may not have been re-
created in new ETL process
18. 18
A Common “Big Data” Project Life Cycle
Production
Infrastructure
Analytics and
Monitoring
Stack
Implementation
Process
On-Prem Hadoop or
Cloud based data
solution Scaled Data Science
Environment
Data Scientists
New Insights:
• Model
• Strategy
Parallel Modernization Lab or Center of Excellence
Enterprise BAU
(Business as Usual) Solution
Biz Analyst /
Data Scientists
New ETL
Process
New Implementation Process
Common Challenge #2
• Translation required due to separate development environments
• New technology implemented on legacy infrastructure creates
unexpected hurdles or brick walls
• Production implementation requires buy-in that prototypes or
proof of concepts don’t require
19. 19
Recommended Approach
Our modern analytics and Big Data engagements center around
an effective use case from which all software, infrastructure,
and organizational investments are informed.
Modernize analytics
infrastructure as needed
Identify a Use Case
Build use case as a
iteratively improving product
Sustain the new product
and infrastructure
20. 20
Strategically Important
• Does the use case align with corporate imperatives?
• Will its success open the door for more use cases in
your direct team and across the broader organization?
Actionable
• Will insights or results from the use case lead to
in-market changes?
• Can insights or results drive change quickly and be
iteratively improved over time?
Material
• Can insights or results from the use case drive material
impact to the business?
Identify
a strategically-important,
actionable, material use
case to gain support and
guide your investment
21. 21
To illustrate the approach, a specific use case, a
marketing response propensity model
Analytical Database
Biz Analyst /
Data Scientists
Linear Response
Model Specifications
Marketing Targeting
Production Intent
3rd Party
Implementation
& Data Partner
(Credit Bureau)
Marketed Prospects
Non-Marketed Prospects
22. 22
To illustrate the approach, a specific use case, a
marketing response propensity model
Analytical Database
Biz Analyst /
Data Scientists
Linear Response
Model Specifications
Marketing Targeting
Production Intent
3rd Party
Implementation
& Data Partner
(Credit Bureau)
Marketed Prospects
Non-Marketed Prospects
Modern Analytics Use Case Litmus Test
Strategically Important: Response model used regularly to target
marketing spend – driving the growth of a critical business.
23. 23
To illustrate the approach, a specific use case, a
marketing response propensity model
Analytical Database
Biz Analyst /
Data Scientists
Linear Response
Model Specifications
Marketing Targeting
Production Intent
3rd Party
Implementation
& Data Partner
(Credit Bureau)
Marketed Prospects
Non-Marketed Prospects
Modern Analytics Use Case Litmus Test
Actionable: There is opportunity to leverage new machine learning
techniques to build models that typically out perform traditional linear
response models. Unclear if our implementation partner can support
new model types.
24. 24
To illustrate the approach, a specific use case, a
marketing response propensity model
Analytical Database
Biz Analyst /
Data Scientists
Linear Response
Model Specifications
Marketing Targeting
Production Intent
3rd Party
Implementation
& Data Partner
(Credit Bureau)
Marketed Prospects
Non-Marketed Prospects
Modern Analytics Use Case Litmus Test
Material: Determined by measuring the net incremental responders
generated when the model is implemented. If the juice is not worth
the squeeze don’t invest.
25. Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
26. Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Response Model Passed
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Data Scientist
27. Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Marketing Targeting Passed
Biz Analyst
28. Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Response Model Passed
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Data Scientist
Marketing Targeting Passed
Biz Analyst
29. Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Pain Point #1: Limited modern analytics
tool chest for response model building
Response Model Passed
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Data Scientist
Marketing Targeting Passed
Biz Analyst
30. Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Pain Point #2: Manual, bespoke testing
and go-to-production process
Response Model Passed
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Data Scientist
Marketing Targeting Passed
Biz Analyst
31. Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Production C
Environment
Changing either the Source Systems or Production Environments had
the most interdependencies outside of the use case, so left unchanged
Production C
Environment
Source
Systems
32. Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Modeling &
Analytics
Environment
Production C
Environment
Enterprise Guide
Raw Data
Processed
Data
Engineer
Open Source
Sandbox
Modeling
Data
Ported
Data
Scientist
Replacing the overall Modeling and Analytics environment was costly and
time consuming, so we stood up a separate Open Source Sandbox
Source
Systems
33. Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Production C
Environment
Enterprise Guide
XGBoost
to C
Package
Raw Data
Processed
Data
Engineer
Open Source
Sandbox
Modeling
Data
Ported
Data
Scientist
Response Model
Passed
Data Scientist
To enable Machine Learning Models like GBM (Gradient Boosting Machine), we
created an XGBoost model dump to C translation package
Modeling &
Analytics
Environment
Source
Systems
34. Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Production C
Environment
Enterprise Guide
Raw Data
Processed
Data
Engineer
Open Source
Sandbox
Modeling
Data
Ported
Data
Scientist
Biz Analyst
Marketing Targeting Passed
Local Intent
Testing
To lessen the iterative, manual Marketing Targeting intent checks, we
deployed testing that verified Excel inputs against production outputs
Modeling &
Analytics
Environment
Source
Systems
35. Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Production C
Environment
Enterprise Guide
XGBoost
to C
Package
Raw Data
Processed
Data
Engineer
Open Source
Sandbox
Modeling
Data
Ported
Data
Scientist
Response Model
Passed
Data Scientist
Biz Analyst
Marketing Targeting Passed
Local Intent
Testing
The initial use case deliverable enabled modern machine learning models and
lessened the manual testing previously required
Modeling &
Analytics
Environment
Source
Systems
36. 36
Remember our Common “Big Data” Project Life Cycle
Production
Infrastructure
Analytics and
Monitoring
Stack
Implementation
Process
On-Prem Hadoop or
Cloud based data
solution Scaled Data Science
Environment
Data Scientists
New Insights:
• Model
• Strategy
Parallel Modernization Lab or Center of Excellence
Enterprise BAU
(Business as Usual) Solution
Biz Analyst /
Data Scientists
New ETL
Process
New Implementation Process
37. 37
Our use case approach often leads to hybrid solutions
that get material results as quickly as possible
Analytics or
Monitoring Stack
Production
Infrastructure
Initial Use Case Solution
Biz Analyst /
Data Scientists
Implementation
Process
Analytics and
Monitoring Stack
New Open
Source Sandbox
GBM Conversion
Routine and Local
Intent Testing
38. 38
As part of initial launch, we build the use case as a
product that can iteratively improve
Product
Team
Backlog
Test Build Deploy
Software Engineering Best Practices
Internal
Customer(s)
Biz Analysts
Features
Product team deliver features with a focus on continued
improvement not getting the product “done”
Machine Learning
Response Model
Illustrative Product Structure
Model iteratively improved as
more or new data is available
!
39. Finalize potential architecture as you iterate
Biz Analyst
Data Scientist
Computational Frameworks
Distributed Compute & Storage
Model Grid search, Distributed Model
Training, Model Conversion
Build New
Model
Automated
Model Validation
Anaconda
Repository
Historical model
versions are built and
stored for future use
Automated Builds
and Job Scheduling
Continuous
Integration
Databases/ 3rd
Party Services/
Prediction APIs
Deploy
Model Build
Package
Marketed Prospects
Non-Marketed
Prospects
40. 40
Data Scientists want to drive change in their organization
using Data
It begins with the first word, Data …
Tools to get the job done ...
Transparent path from insight to impact …
• Raw data handled and stored consistently to eliminate data silos
• Metadata readily available, in particular, lineage working backward to raw data sources
• A well understood and thoughtful data access process
• Open source first and foremost (Python/Anaconda, R)
• Scaled Data Science platform to enable interactive exploration and visualization
• Thoughtful and well understood open source governance process
• Automated workflows to deploy new insights to market and monitor results
• At minimum, transparency on how to bring insights to market/production
41. 41
Work Backwards from a Specific Use Case
• Identify the problem you want to solve, not the technology you
want to use.
Identify Path to Implementation ASAP
• The Path to Implementation is historically the largest challenge
to successful Big Data and Modern analytics challenges – learn
from others’ mistakes.
Think MacGyver not Michelangelo
• The goal is to get material enhancement into production as
quickly as possible. You won’t have the perfect architecture on
your first pass.
Organize Around Products
• Setting up a product team, clear customers, and a backlog, the
initial answer can be enhanced bit by bit while continuing to
drive better in production solutions.
Take Aways