Upwork has the biggest closed-loop online dataset of jobs and job seekers in labor history (>10M Profiles, >100M Job Posts, Job Proposals and Hiring Decisions, >10B of Messages, Transaction and Feedback Data). Besides sheer quantity, our data is also contextually very rich. We have client and contractor data for the entire job-funnel – from finding jobs to getting the job done.
For various machine learning applications including search and recommendations and labor marketplace optimization (rate, supply and demand), we heavily relied on a Greenplum-based data warehouse solution for data processing and ad-hoc ML pipelines (weka, scikit-learn, R) for offline model development and online model scoring.
In this talk, we present our modernization efforts in moving towards a 1) holistic data processing infrastructure for batch and stream data processing using S3, Kinesis, Spark and Spark Structured Streaming 2) model development using Spark MLlib and other ML libraries for Spark 3) model serving using Databricks Model Scoring, Scoring over Structured Streams and microservices and 3) how we orchestrate and streamline all these processes using Apache Airflow and a CI/CD workflow customized to our Data Science product engineering needs. The focus of this talk is on how we were able to leverage the Databricks service offering to reduce DevOps overhead and costs, complete the entire modernization with moderate efforts and adopt a collaborative notebook-based solution for all our data scientists to develop model, reuse features and share results. We will shared the core lessons learned and pitfalls we encountered during this journey.
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran
1. Data Science Infrastructure Team, Thanh Tran
Upwork
How to Rebuild Data and ML
Platform using Kinesis, S3,
Spark, MLlib, Databricks,
Airflow and Upwork
#AssignedHashtagGoesHere
3. 3
WE
I
OUR
Nikolay Melnik
Lead ML Engineer
Ukraine
Dimitris Manikis
Senior Data Engineer
Greece
Artem Moskvin
Data/ML Engineer
Germany
Roman Tkachuk
Senior Data Engineer
Ukraine
Andrei Demus
Data/ML Engineer
Ukraine
Igor Korsunov
ML Engineer
Russia
Anna Lysak
Data/ML Engineer
Ukraine
Yongtao Ma
Senior ML Engineer
Germany
Giannis Koutsoubos
Lead Backend Engineer
Greece
4. 4
● Highest-skilled experts for
the job
● Competitive/lower rate
● Mix of long-term and
project-based staff
QUALITY
COST/EARNING
AGILITY
● Work on cutting-edge projects
● Happy with competitive
compensation + flexibility in
location and work hours
● Work only when they want work
My Team
With Upwork, our new hires AND I are better off!
Me
5. 5
We believe significant
welfare improvements
can be achieved through
data science driven
optimization of the online
labor marketplace.
6. 6
We have the biggest
closed-loop online
dataset of jobs and job
seekers in labor
history.
Contract progress (~1B)
Feedback (~10M)
Web site activity (~10B)
Money transactions (~100M)
Profiles (~10M)
Job Posts (~10M)
Proposals (~100M)
Messages (~100M)
Hiring decisions (~10M)
Contract progress (~1B)
Feedback (~10M)
Web site activity (~10B)
Money transactions (~100M)
8. 8
We need to support an agile data science workflow
to provide quick and validated improvements!
● Data Science analytics
○ Complete and cleansed data, single ground-truth
○ Tools for computing metrics, continuous validation
● Data Science model development
○ Business objects and UI event data
○ Scaling complex data processing and feature computation
○ Discoverability of data and features
○ Batch + live data mismatches
○ Managing, monitoring and versioning of models and experiments
○ Knowledge sharing and code reuse (experiments, model, feature computation pipeline)
○ Flexibility to accommodate variety of ML frameworks
● Data Science model productionalization
○ Minimize differences between trained model and production code
○ Code modularized, tested, integrated into CI/CD workflow
○ Standardized model serving that is scalable, available, high throughput, low latency...
11. 11
● Kinesis and Structure Spark Streaming for high throughput live event data processing
● Moving away from traditional DWH solution to distributed Spark-based batch data
processing to avoid performance issues and workload limitations
● Spark MLlib + Tensorflow as core ML libraries to balance the tradeoff between
flexibility and standardized model engineering
● Data processing, feature computation and pipeline retraining jobs scheduled and
orchestrated via Airflow
● Experiment management and model versioning integral part of CI/CD workflow
● Adopt engineer CI/CD workflow to data science using Jenkins, Databricks and Airflow:
standalone model testing + live regression test helps to identify batch and live data
mismatches
● Spark-based pipeline developed by data scientists directly used for model scoring in
production environment
● Microservices for streamlined model serving, scalability, availability...
● Extensive use of Databricks notebook-based documentation of model, experiments
and feature engineering code
● Graphite, ELK and Pagerduty for logging, monitoring and alerts
22. • Microservices can lead to data fragmentation and high downstream
processing overhead
• Structured streaming latency when number of Kinesis consumers is high
• Stream-to-stream/stream-to-batch join not suitable for real-time use cases
yet
• Differences between live and batch data
• Differences between trained vs. deployed ML pipeline can be minimized
• CI/CD needs to be customized to support data science workflow and
artefacts
• Databricks notebooks very convenient for collaboration, documentation,
code sharing and reuse, results dissemination
22
23. 23
Interested in search & recommendations, multi-sided matching or
online labor marketplace optimization? We are hiring!
Interested in doing work only when you want work?
Join Upwork as contractors!