Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
The Rise of the DataOps - Dataiku - J On the Beach 2016
1. The Role of the
DevOps in the
Data Analytics
Teams
J ON THE BEACH
05/21/16
MORPHEDWITH
DEEP LEARNING™
TYPICAL OPSGUY
(source: Reddit)
TYPICAL YOUNGDATA SCIENTIST
(source: Common Sense)
2. My initial interests
Type Systems Automated Proving Abstract Program Interpretation Functional Programming Garbage Collection
and Vms
Graph Analytics Chess IA Natural Language Processing 80% Emacs /20% VIM
3. So to sum it up …
I (USED TO?)
TO BE A BIG NERD
6. Let’s get back to the (brief) history of DevOps
Agile Conference, 2008
Scrum, and Agile
in an operational context
He ! We should have
our own velocity in
Belgium
10 deploysper day : Dev and Op
Operation at Flickr
O’Reilly Velocity, June 2009Patrick Dubois
2007
Dev
Ops
QA
DevOpsDays
Ghent, October 2009
7. DevOps
DevOps is the practice of
operations and development
engineers participating together
in the entire service lifecycle,
from design through the
development
process to production support.
DevOps is also characterized
by operations staff making
use many of the same
techniques as developers for
their systems work.
Invite Ops to the Dev Meeting
Oh. And let them SPEAK
Ops should know how to code
8. Let’s take an example: John devops from 2009
Learnt Python the Hard Way
Startedwith Puppet 1.0
Used EC2 before ELB and EBS !
10. There’s been op associated to data for a while ?
It’s called Business Intelligence !
11. History of Data Analytics (Oversimplified)
2013 2014 2015 2016 2017 2018
Moving to a world of automated decision making
DATA
FOR MORE INSIGHTS
DATA
FOR AUTOMATED DECISIONS
12. The Age Of Distributed Intelligence
Global, Personalised
and Real Time Data
Driven Services
13. Data, Analytics and Data Science
Conflict and Frustration
Concept
Combination
Catharsis
Create Culture
Share
Create Tools
Data
+
Science
15. Classic Business Intelligence Team Organization
Business Leader
Data Consumer
Line-of-business
Data Consumer Business Project
Sponsor
BI Solution Architect
Model Designer
ETL Developer
Dashboard / Report Designer
Specs
Dim
Big Boss
16. Data Science Team Organization
Business Leader
Data Consumer
Line-of-business
Data Consumer
Business Project
Sponsor
Data Engineer
Data Analyst
System Engineer /
Data Architect
Business
Needs
Data Scientist
IT
Constraints
I.T.
17. Is there room for a new role ?
Data
Plumberer
Data
Engineer
Data
Scientist
Data
Waiter
Data
Cleaner
Data
Analyst
REAL
JOB
DREAM
JOB
DevOps For Data?
18. Imagine
a company building
a new ”smart car” app: AutoFine™
”Revolutionary Collaborative network that check the quality of your driving and punish
You with virtual fines if you’re a bad driver”
19. Imagine
a company building
a new ”smart car” service AutoFine™
10 TB of Data
Every Month
Hive / Spark /
Python
10 Different
PredictiveModels
Real-Time API
/ Workflow
20. ????
??
??
OPERATIONS : Whose is responsible for …
Check that the newly
trained model perform as
expected
Check that the product catalog
and the websitetags remain
consistent
Check that the Hadoop cluster scales
as expected and as enough
bandwidthto handlethe workload
Test the performance for
the real-time API
Monitor the performanceof
the model and decide to
rollback / maintain/ rollout
24. Create an API culture
Do not share
o Random Piece of Code
o Flat File
o Email
Do share
ü Reproductible documentedworkflows
ü Clean, documentedAPIs
25. Defensive Data
Programming
•Software has errors.
•You are not your software, yet
you are are responsible for the
errors.
•You can never remove the
errors, only reduce their
probability.
26. Defensive Data Programming
•Handle the case when one of the input file is empty
•Handle the case when a new value appear
•Handle the case when two columns become completely
correlated
•Handle the case when a column is 16k long
•Etc.. Etc. etc…
27. Monitoring : the alerts for people who love it
• Performance ….
• Time Spent …
• Number of Errors …
28. Monitoring : Business Informal Monitoring
• % Opening
• Market Spent
• Exception User Events …
29. Resource Allocation
I’ve got this strange
Error ”OutOfMemory” . Do you know what it is
?
Why is the Hadoop Cluster going slower than
my laptop ?
31. Get to the latest package culture …
Data Scientist
I need the latest version of scikit
And networkX ….
And coud you repackage that
To enable TensorFlow optimizations ?
System Administrator
…..
34. Job Title : a matter of name, $$ and social ladder
Data scientist Data Ops
Developer
Statistician
Full Stack Developer
Sys Admin
DevOps
35. Job Role : A matter of Do or Don’t
DO DON’T
Things you really want to do Things you really don’t want to get into
36. FIGHT THE
TOY PLATFORM ANTI-PATTERN
Test and Invest in Infrastructure == Skilled People
or
Go For Cloud / Packaged Infrastructure
Your Brand New Hadoop Cluster
is perceived as slow, not so used
and not reliable
37. FIGHT THE
TECHNO MISMATCH ANTI-PATTERN
Assume Being Polyglot
or
Be a Dictator
VS
VS
The Python
Clan
The R
Tribe
The Old Elephant
Fraternity
The New Elephant
Club
39. GETTINGDATA POLITICS
THE FOX
Hunt for Big Problem!
Convince the CEO that you can
Solve a Business Critical problem
And use it as an excuse to get all
The data you want !
THE SPIDER
Create Network !
Create a set of trackers or
Addictive Data Collection
internally
To get Data on your side !
40. PREDICTIVE ANALYTICS DEPLOYMENT STRATEGY
Website 2000’ winners
Companies that were able to release fast
"Artificial Intelligence with Data for
Internet of Things" 2010’ winners
Companies able to put intelligence in production
?
Design a way to put “PREDITICTIVE MODELS”
IN PRODUCTION
41. OWN ANONYMISATION / PRIVACY
/ DATA SECURITY WITH PARTNERS ISSUES
Technical Feasibility ? What can or cannot be done ?
42. Let’s Wrap IT Up !
A Company Building a GPS powered automated car fine system
10 TB of Data
Every Month
Hive / Spark /
Python
10 Different
PredictiveModels
Real-Time API
/ Workflow
Robust
Workflow
With
Data Quality
Checks
Functional
Monitoring
By Business
People
through
Slack and
Dashboards
Monitoring
for the API
Feature
Engineering
Pipeline in
Python
43. But you where do you stand ?
???? ???? ???? ?????
What's your roll-back strategy like?
What kind of multi-variatetesting or strategies do
you havein place for predictivemodels?
How do you manage the robustness of your data flow productionscripts?
How can businesspeople monitor the
performance of the application?