How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran

Data Science Infrastructure Team, Thanh Tran
Upwork
How to Rebuild Data and ML
Platform using Kinesis, S3,
Spark, MLlib, Databricks,
Airflow and Upwork
#AssignedHashtagGoesHere

3
WE
I
OUR
Nikolay Melnik
Lead ML Engineer
Ukraine
Dimitris Manikis
Senior Data Engineer
Greece
Artem Moskvin
Data/ML Engineer
Germany
Roman Tkachuk
Senior Data Engineer
Ukraine
Andrei Demus
Data/ML Engineer
Ukraine
Igor Korsunov
ML Engineer
Russia
Anna Lysak
Data/ML Engineer
Ukraine
Yongtao Ma
Senior ML Engineer
Germany
Giannis Koutsoubos
Lead Backend Engineer
Greece

4
● Highest-skilled experts for
the job
● Competitive/lower rate
● Mix of long-term and
project-based staff
QUALITY
COST/EARNING
AGILITY
● Work on cutting-edge projects
● Happy with competitive
compensation + flexibility in
location and work hours
● Work only when they want work
My Team
With Upwork, our new hires AND I are better off!
Me

5
We believe significant
welfare improvements
can be achieved through
data science driven
optimization of the online
labor marketplace.

6
We have the biggest
closed-loop online
dataset of jobs and job
seekers in labor
history.
Contract progress (~1B)
Feedback (~10M)
Web site activity (~10B)
Money transactions (~100M)
Profiles (~10M)
Job Posts (~10M)
Proposals (~100M)
Messages (~100M)
Hiring decisions (~10M)
Contract progress (~1B)
Feedback (~10M)
Web site activity (~10B)
Money transactions (~100M)

7
What do we need to ship data
science products?

8
We need to support an agile data science workflow
to provide quick and validated improvements!
● Data Science analytics
○ Complete and cleansed data, single ground-truth
○ Tools for computing metrics, continuous validation
● Data Science model development
○ Business objects and UI event data
○ Scaling complex data processing and feature computation
○ Discoverability of data and features
○ Batch + live data mismatches
○ Managing, monitoring and versioning of models and experiments
○ Knowledge sharing and code reuse (experiments, model, feature computation pipeline)
○ Flexibility to accommodate variety of ML frameworks
● Data Science model productionalization
○ Minimize differences between trained model and production code
○ Code modularized, tested, integrated into CI/CD workflow
○ Standardized model serving that is scalable, available, high throughput, low latency...

11
● Kinesis and Structure Spark Streaming for high throughput live event data processing
● Moving away from traditional DWH solution to distributed Spark-based batch data
processing to avoid performance issues and workload limitations
● Spark MLlib + Tensorflow as core ML libraries to balance the tradeoff between
flexibility and standardized model engineering
● Data processing, feature computation and pipeline retraining jobs scheduled and
orchestrated via Airflow
● Experiment management and model versioning integral part of CI/CD workflow
● Adopt engineer CI/CD workflow to data science using Jenkins, Databricks and Airflow:
standalone model testing + live regression test helps to identify batch and live data
mismatches
● Spark-based pipeline developed by data scientists directly used for model scoring in
production environment
● Microservices for streamlined model serving, scalability, availability...
● Extensive use of Databricks notebook-based documentation of model, experiments
and feature engineering code
● Graphite, ELK and Pagerduty for logging, monitoring and alerts

Batch Data & ML Environment
12

Pitfalls and Lessons Learned
21

• Microservices can lead to data fragmentation and high downstream
processing overhead
• Structured streaming latency when number of Kinesis consumers is high
• Stream-to-stream/stream-to-batch join not suitable for real-time use cases
yet
• Differences between live and batch data
• Differences between trained vs. deployed ML pipeline can be minimized
• CI/CD needs to be customized to support data science workflow and
artefacts
• Databricks notebooks very convenient for collaboration, documentation,
code sharing and reuse, results dissemination
22

23
Interested in search & recommendations, multi-sided matching or
online labor marketplace optimization? We are hiring!
Interested in doing work only when you want work?
Join Upwork as contractors!

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran

Similaire à How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran