4. Is there life after marriage data science?
Dating, Flowers,
Dreams
Marriage
Happily lived
forever?
Collect & prepare
data
Build ML Model
5. This talk is for people who are married aware of
“other 99% of data science”
Dating, Flowers,
Dreams
Marriage
Happily lived
forever?
Collect &
prepare data
Build ML Model
6. This talk is NOT about
- Setting up Apache Spark/Hadoop cluster
- Configuring CI/CD tools like Jenkins
- Configuring monitoring tools & dashboards
- Agile/DevOps brainwashing & consulting story
7. Agenda
- Challenges in deploying analytics into
production
- Deploying analytics as a service
- Feedback loops: testing, monitoring,
analytics of analytics
10. What is a deliverable of data scientist and data
engineer?
11. What is a deliverable of data scientist?
Academic
paper?
ML Model? R/Python
script?
Jupiter
Notebook?
BI
Dashboard?
12. What has to be a deliverable of data scientist?
Data pipelines and machine
learning models that deployed as
pluggable, testable, supportable,
monitorable analytics services.
22. Step 5: Use feedback loops: testing, monitoring,
analytics for analytics
Build ML Model
Test
Monitor,
maintain,
analyze
Deploy as a service
Collect & prepare
data
23. Agenda
- Challenges in deploying analytics into
production
- Deploying analytics as a service
- Feedback loops: testing, monitoring,
analytics of analytics
24. Deploying analytics as a service
- Defines deliverable for Data Scientist / Data Engineer.
- Plugs analytics into end-to-end products through API.
- With the right tooling allows Data Scientist to deploy it in self
serve
25. Look around - proprietary ML based APIs
- Alchemy API
- Google Prediction API
- Cloud Vision API
- Azure ML
Can we do our own on top of Apache Spark?
27. Bad Practice #2. Database as API
Execute reporting job
Mark Job as complete &
save result
Poll for new tasks
Poll for resultSet a flag to build a report
28. Bad Practice #3. Low level HTTP API
When Data Scientists
design an API...
29. Hydrosphere Mist - a service for exposing analytics
jobs and machine learning models as web services
30. Types of analytics services
- Enterprise Analytics services
- Reactive or Streaming services
- Realtime ML services
31. Enterprise analytics services
- Could not be
pre-calculated
- On-demand
parametrized jobs
- Requires large scale
processing
- Reporting
- Simulation (pricing, bank
stress testing, taxi rides)
- Forecasting (ad
campaign, energy
savings, others)
- Ad-hoc analytics tools
for business users
36. Realtime Machine Learning Services
Train models in Apache Spark and deploy it for realtime
low latency serving/scoring with high throughput
37. PMML is not an option
Spark ML, TensorFlow, H2O, Vowpal Wabbit, and every new ML
library invents uses own serialisation format
38. Format is not an issue if we re-define a deliverable for
ML model
xml, json, parquet, pojo, other
Single row Serving / Scoring
layer
Large Scale,
Batch
processing
engine
Monitoring,
testing
integration
Deliverable artifact for Machine Learning Model
41. Agenda
- Challenges in deploying analytics into
production
- Deploying analytics as a service
- Feedback loops: testing,
monitoring, analytics of analytics
42. Testing, monitoring, analytics of analytics
- Poorly discussed in community.
- We are in production, baby!
- Regression.
- State matters. Model lifetime is limited.
- Data drifts, pipelines and model fail silently.
● Saves time
● Saves money
● Saves lifes
44. TDD world does not work here
Pff… easy:
- Unit tests - by platform developers
- Integration tests - often impossible
Not clear who and not clear how:
- Regression
- Data Validation
- Production testing
- Data and ML pipelines quality monitoring
45. Need either “Data QA” & “Data Ops” people
or … AI
(formula for the next 10 000 startups - take something and add AI)
53. ML pipeline Kafka
Analytics jobs
for metrics
Emit Metrics
Stream it back
into Spark
Context
Use insights to
make our data
structures
smart
Solution: loop of analytics for analytics
54. Benefits
● Don’t need to talk to Ops! :)
● Already have Apache Spark and Kafka in place
● Data Scientist in the loop!
● Unlimited flexibility in analytics, correlation and
using ML for ML
● Models could feeded back into Smart self
QA-ed data structures.
55. Hydrosphere Swirl - a system that creates a swirl of
analytics for analytics
56. Original ML
pipeline
Kafka
Streaming or
Batch Swirl
jobs
Hydrosphere
Swirl
Plug, modify,
deploy, run jobs &
consume results
Metrics
definition,
Notebook
integration
Hydrosphere
Mist
(1) Emit metrics
Hydrosphere Swirl: Vision
60. Twitter
Ingest &
transform
Serve ads to
user
Hydrosphere Swirl
Invalid records 10/sec 10/sec0.2 Clicks
New ML model
deployment
Deployed
bug in ML
code
Ratio
Swirl Demo: Serve Ads to users with positive Tweets
61. Thank you
Looking for
- Feedback
- Advisors, mentors & partners
- Pilots and early adopters
Stay in touch
- @hydrospheredata
- https://github.com/Hydrospheredata
- http://hydrosphere.io/
- spushkarev@hydrosphere.io