In this talk, we will explore how Uber enables rapid experimentation of machine learning models and optimization algorithms through the Uber’s Data Science Workbench (DSW). DSW covers a series of stages in data scientists’ workflow including data exploration, feature engineering, machine learning model training, testing and production deployment. DSW provides interactive notebooks for multiple languages with on-demand resource allocation and share their works through community features.
It also has support for notebooks and intelligent applications backed by spark job servers. Deep learning applications based on TensorFlow and Torch can be brought into DSW smoothly where resources management is taken care of by the system. The environment in DSW is customizable where users can bring their own libraries and frameworks. Moreover, DSW provides support for Shiny and Python dashboards as well as many other in-house visualization and mapping tools.
In the second part of this talk, we will explore the use cases where custom machine learning models developed in DSW are productionized within the platform. Uber applies Machine learning extensively to solve some hard problems. Some use cases include calculating the right prices for rides in over 600 cities and applying NLP technologies to customer feedbacks to offer safe rides and reduce support costs. We will look at various options evaluated for productionizing custom models (server based and serverless). We will also look at how DSW integrates into the larger Uber’s ML ecosystem, e.g. model/feature stores and other ML tools, to realize the vision of a complete ML platform for Uber.
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
Building Intelligent Applications, Experimental ML with Uber’s Data Science Workbench with Atul Gupte and Felix Cheung
1. Building Intelligent Applications & Experimental ML
with Uber’s Data Science Workbench
Felix Cheung & Atul Gupte
Uber Technologies, Inc.
2. / Data at Uber
/ Analytics Stack
/ Spark at Uber
/ Machine Learning at Uber
/ Data Science Workbench
/ Common User Flows & Impact
Contents
3. Engineer turned Product Manager
Previously: building FarmVille & the mobile advertising platform @ Zynga
Currently: Product Manager for Data Science Workbench & Data Warehouse
/ About Atul
4. Apache Spark PMC & Committer
Engineer, Tech Lead & Area Owner of Spark @ Uber
/ About Felix
9. 6,000+ data scientists, engineers, and operations
managers rely on us to support the business
10. Data is what differentiates Uber
but, data at Uber is unlike anywhere else.
11. Delicate marketplace with
network effects
Bits to atoms
Business
New LOBs spun up in a snap
Pluggable mobility platform
Spatio-temporal
Analytics
Sheer scale
Real-time. Real-world.
ML is Uber’s brain
Apps/Machine generated queries
Varied skills: BI to DNN
Consumers
Internal and external
6,000 and growing
What makes Uber unique
12. MISSION
Move the world with
global data, local
insights, and intelligent
decisions.
Data Platform Team
14. The Data Team
Ingest
Workflow
Management
Store
Produce Model
Ad-Hoc &
Streaming
Analytics
Business
Intelligence
Machine
Learning
Metadata/
Knowledge
Experimentation/
Segmentation
Visualization
Data
Infrastructure
Data Platforms
Data Services
& Analytics
Disperse
15. Kafka
Schemaless
SOA
BI Apps Ad-hocExperimentation ML Notebooks
Cluster
Management
All-Active
Observability
Security
Raw
Data
Raw
Tables
Hadoop
Hive Presto Spark
Modeled
Tables
Vertica
Vertica
Warehouse
AthenaX
Apollo
Streaming
Real-time
Metadata/Workflow Management
Data Infrastructure
17. at Uber Scale
100,000+
Spark jobs per day
~96%
ETL pipelines
~98%
YARN job resource use (in
vcore-seconds) on Spark
● 11,000+ machines across multiple data-centers
● Many 10s-petabytes of data
● Runs on one of the largest production HDFS clusters
18. Introducing Uber’s Spark Compute Service
Simplifies lives of developers & cluster operators
Consolidate Infrastructure Investments
YARN, Mesos
Available across multiple data-centers
Improve Developer Experience
Standardized Spark builds across Uber
Bring-your-own-stack (optional)
Advanced monitoring & debugging
Serve Multiple Use Cases
Exploratory, bursty & scheduled batch
Manage full Spark application lifecycle
Proliferate
Better language support (R/Python/Java)
Consumption Interfaces (CLI/REST/GUI)
19. Session Recap (June 5th)
Karthikeyan Natarajan
Senior Software Engineer
Bo Yang
Senior Software Engineer
21. The hype
● Ability of a machine to learn without being explicitly programmed
● Identify hidden patterns in the world based on current and historical data
and use it to predict the future
● Ability of a machine to get better at a task with data and experience
● Learn from mistakes and improve when given newer/more information
24. UNDERSTAND
BUSINESS NEED(S)
DEFINE MINIMUM
VIABLE PRODUCT (MVP)
○ Customers + cross-functional team
○ Define objectives and key results
○ Data-driven
○ Research
○ Ruthless prioritization
2. prototype
3. productionize
4. measure
1. define
Problem Definition
25. UNDERSTAND
BUSINESS NEED(S)
DEFINE MINIMUM
VIABLE PRODUCT (MVP)
2. prototype
1. define
GET DATA
DATA PREPARATION
TRAIN MODELS
EVALUATE MODELS
3. productionize
4. measure
validation
computational cost
interpretability
SQL, Spark
data cleansing and
pre-processing,
R / Python
CPU or GPU
Exploration
26. UNDERSTAND
BUSINESS NEED(S)
2. prototype
1. define
DATA PREPARATION
TRAIN MODELS
EVALUATE MODELS
4. measure
GET DATA
PRODUCTIONIZE
MODELS
3. productionize
DEPLOY MODELS
Engineers + Data Scientists,
Java or Go,
unit tests
MAKE PREDICTIONSReal-time or
batch
Experimentation and
rollout monitoring;
Retraining strategy
DEFINE MINIMUM
VIABLE PRODUCT (MVP)
Production
27. UNDERSTAND
BUSINESS NEED(S)
DEFINE MINIMUM
VIABLE PRODUCT (MVP)
2. prototype
1. define
DATA PREPARATION
TRAIN MODELS
EVALUATE MODELS
GET DATA
DEPLOY MODELS
PRODUCTIONIZE
MODELS
MONITOR
PREDICTIONS
4. measure
MAKE PREDICTIONS
3. productionize
Automatically detect
degradations
GATHER AND ANALYZE
INSIGHTS
Deep-dive analyses
inform future product
roadmap
Measure
28. 3x growth in Data Science community
Py and R Machine Learning was mostly DIY - and on laptops
Moving a Py models to production was hard
Proliferation of tools, libraries, infra
None of which could scale to 1000s
Collaboration and Sharing non-existent
Security / Compliance / DC redundancy
Our world in 2016
30. Unleash the productivity of the Data Science
community at Uber by providing scalable
infrastructure, tools, customization, and support.
Mission
31.
32. Fully hosted environment - nothing to install
One-click to Jupyter Notebook or RStudio IDE
Pre-baked environment
Session Customization (BYOP)
Wired to all internal sources and compute engines
Our world today
Share/publish/comment on data/notebooks
One-click publish to Shiny dashboards
Multi-DC
Secure and GDPR Compliant
Support & documentation
34. RStudio and Shiny are trademarks of RStudio, Inc
"Jupyter" is a trademark of the NumFOCUS foundation, of which Project Jupyter is a part.
"Python" is a registered trademark of the PSF. The Python logos (in several variants) are use trademarks of the PSF as well.
40. DSW + Spark Architecture
Storage Service
DataScientists
FrontEnd
Application
Management
DSW DSW cluster
ContainerContainer
Container
RStudio
Server
Container
Jupyter
Server Compute
Service
Hadoop Cluster
Hive
Presto
Spark
HDFS
SparkMagic
Livy
41. DSW + Spark Use-cases
● Explore large-scale dataset
● Parallelise Python native packages for feature
generation & model training
● Collaborate and review on a common interface for
ad-hoc analysis & prototyping
42. Common DS Patterns (#1)
PySpark
Python
Native
packages
PySpark
Hive Tables Hive Tables
scikit-learn
Features
DSW
43. Common DS Patterns (#2)
Spark
Scala
mllib
Hive Tables HDFS
Trained
Model
Production
DSW
Evaluate
44. DSW + Spark Impact
Safety
Trip classification
Risk
Driver account check
Driver referral risk scoring
Uber Eats
Restaurant recommendations
Support
NLP model for support tickets
Operations
Lifetime value (LTV) model
more!
46. We’re hiring!
Excited to build the data platform that moves the world?
Come join us!
http://t.uber.com/datahire
San Francisco, Palo Alto, Seattle, Bangalore