SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Data Science Infrastructure Team, Thanh Tran
Upwork
How to Rebuild Data and ML
Platform using Kinesis, S3,
Spark, MLlib, Databricks,
Airflow and Upwork
#AssignedHashtagGoesHere
Introduction
2
3
WE
I
OUR
Nikolay Melnik
Lead ML Engineer
Ukraine
Dimitris Manikis
Senior Data Engineer
Greece
Artem Moskvin
Data/ML Engineer
Germany
Roman Tkachuk
Senior Data Engineer
Ukraine
Andrei Demus
Data/ML Engineer
Ukraine
Igor Korsunov
ML Engineer
Russia
Anna Lysak
Data/ML Engineer
Ukraine
Yongtao Ma
Senior ML Engineer
Germany
Giannis Koutsoubos
Lead Backend Engineer
Greece
4
● Highest-skilled experts for
the job
● Competitive/lower rate
● Mix of long-term and
project-based staff
QUALITY
COST/EARNING
AGILITY
● Work on cutting-edge projects
● Happy with competitive
compensation + flexibility in
location and work hours
● Work only when they want work
My Team
With Upwork, our new hires AND I are better off!
Me
5
We believe significant
welfare improvements
can be achieved through
data science driven
optimization of the online
labor marketplace.
6
We have the biggest
closed-loop online
dataset of jobs and job
seekers in labor
history.
Contract progress (~1B)
Feedback (~10M)
Web site activity (~10B)
Money transactions (~100M)
Profiles (~10M)
Job Posts (~10M)
Proposals (~100M)
Messages (~100M)
Hiring decisions (~10M)
Contract progress (~1B)
Feedback (~10M)
Web site activity (~10B)
Money transactions (~100M)
7
What do we need to ship data
science products?
8
We need to support an agile data science workflow
to provide quick and validated improvements!
● Data Science analytics
○ Complete and cleansed data, single ground-truth
○ Tools for computing metrics, continuous validation
● Data Science model development
○ Business objects and UI event data
○ Scaling complex data processing and feature computation
○ Discoverability of data and features
○ Batch + live data mismatches
○ Managing, monitoring and versioning of models and experiments
○ Knowledge sharing and code reuse (experiments, model, feature computation pipeline)
○ Flexibility to accommodate variety of ML frameworks
● Data Science model productionalization
○ Minimize differences between trained model and production code
○ Code modularized, tested, integrated into CI/CD workflow
○ Standardized model serving that is scalable, available, high throughput, low latency...
Upwork Data & ML Platform
9
10
11
● Kinesis and Structure Spark Streaming for high throughput live event data processing
● Moving away from traditional DWH solution to distributed Spark-based batch data
processing to avoid performance issues and workload limitations
● Spark MLlib + Tensorflow as core ML libraries to balance the tradeoff between
flexibility and standardized model engineering
● Data processing, feature computation and pipeline retraining jobs scheduled and
orchestrated via Airflow
● Experiment management and model versioning integral part of CI/CD workflow
● Adopt engineer CI/CD workflow to data science using Jenkins, Databricks and Airflow:
standalone model testing + live regression test helps to identify batch and live data
mismatches
● Spark-based pipeline developed by data scientists directly used for model scoring in
production environment
● Microservices for streamlined model serving, scalability, availability...
● Extensive use of Databricks notebook-based documentation of model, experiments
and feature engineering code
● Graphite, ELK and Pagerduty for logging, monitoring and alerts
Batch Data & ML Environment
12
13
14
Live Data & ML Pipeline
15
16
17
CI/CD Workflow
18
19
20
Pitfalls and Lessons Learned
21
• Microservices can lead to data fragmentation and high downstream
processing overhead
• Structured streaming latency when number of Kinesis consumers is high
• Stream-to-stream/stream-to-batch join not suitable for real-time use cases
yet
• Differences between live and batch data
• Differences between trained vs. deployed ML pipeline can be minimized
• CI/CD needs to be customized to support data science workflow and
artefacts
• Databricks notebooks very convenient for collaboration, documentation,
code sharing and reuse, results dissemination
22
23
Interested in search & recommendations, multi-sided matching or
online labor marketplace optimization? We are hiring!
Interested in doing work only when you want work?
Join Upwork as contractors!

Contenu connexe

Tendances

Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfIlham31574
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Presto User & Admin Guide
Presto User & Admin GuidePresto User & Admin Guide
Presto User & Admin GuideJEONGPHIL HAN
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 

Tendances (20)

Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Data modeling for the business
Data modeling for the businessData modeling for the business
Data modeling for the business
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Presto User & Admin Guide
Presto User & Admin GuidePresto User & Admin Guide
Presto User & Admin Guide
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 

Similaire à How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyVenkata Pingali
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scaleHenry Saputra
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Sl boston 05_12_15_ener_noc_final_public
Sl boston 05_12_15_ener_noc_final_publicSl boston 05_12_15_ener_noc_final_public
Sl boston 05_12_15_ener_noc_final_publicSplunk
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!TigerGraph
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith Kumar Pampatti
 
SharePoint Best Practices Conference 2013
SharePoint Best Practices Conference 2013SharePoint Best Practices Conference 2013
SharePoint Best Practices Conference 2013Mike Brannon
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric TeamsData Con LA
 
GCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online CourseGCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online CourseJayanthvisualpath
 
EISmartwork Plant Digitization toward Industrial 4.0
EISmartwork Plant Digitization toward Industrial 4.0EISmartwork Plant Digitization toward Industrial 4.0
EISmartwork Plant Digitization toward Industrial 4.0Lee Kian Lie
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
 
Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...DataStax Academy
 
Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Serhii Kholodniuk: What you need to know, before migrating data platform to G...Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Serhii Kholodniuk: What you need to know, before migrating data platform to G...Lviv Startup Club
 
Data Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-OctoberData Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-OctoberDataMites
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
 

Similaire à How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran (20)

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Sl boston 05_12_15_ener_noc_final_public
Sl boston 05_12_15_ener_noc_final_publicSl boston 05_12_15_ener_noc_final_public
Sl boston 05_12_15_ener_noc_final_public
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
 
SharePoint Best Practices Conference 2013
SharePoint Best Practices Conference 2013SharePoint Best Practices Conference 2013
SharePoint Best Practices Conference 2013
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
GCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online CourseGCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online Course
 
EISmartwork Plant Digitization toward Industrial 4.0
EISmartwork Plant Digitization toward Industrial 4.0EISmartwork Plant Digitization toward Industrial 4.0
EISmartwork Plant Digitization toward Industrial 4.0
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...
 
Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Serhii Kholodniuk: What you need to know, before migrating data platform to G...Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Serhii Kholodniuk: What you need to know, before migrating data platform to G...
 
Data lineage
Data lineageData lineage
Data lineage
 
Data Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-OctoberData Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-October
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 
AtLASpoint
AtLASpointAtLASpoint
AtLASpoint
 

Plus de Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Plus de Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Dernier

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 

Dernier (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran

  • 1. Data Science Infrastructure Team, Thanh Tran Upwork How to Rebuild Data and ML Platform using Kinesis, S3, Spark, MLlib, Databricks, Airflow and Upwork #AssignedHashtagGoesHere
  • 3. 3 WE I OUR Nikolay Melnik Lead ML Engineer Ukraine Dimitris Manikis Senior Data Engineer Greece Artem Moskvin Data/ML Engineer Germany Roman Tkachuk Senior Data Engineer Ukraine Andrei Demus Data/ML Engineer Ukraine Igor Korsunov ML Engineer Russia Anna Lysak Data/ML Engineer Ukraine Yongtao Ma Senior ML Engineer Germany Giannis Koutsoubos Lead Backend Engineer Greece
  • 4. 4 ● Highest-skilled experts for the job ● Competitive/lower rate ● Mix of long-term and project-based staff QUALITY COST/EARNING AGILITY ● Work on cutting-edge projects ● Happy with competitive compensation + flexibility in location and work hours ● Work only when they want work My Team With Upwork, our new hires AND I are better off! Me
  • 5. 5 We believe significant welfare improvements can be achieved through data science driven optimization of the online labor marketplace.
  • 6. 6 We have the biggest closed-loop online dataset of jobs and job seekers in labor history. Contract progress (~1B) Feedback (~10M) Web site activity (~10B) Money transactions (~100M) Profiles (~10M) Job Posts (~10M) Proposals (~100M) Messages (~100M) Hiring decisions (~10M) Contract progress (~1B) Feedback (~10M) Web site activity (~10B) Money transactions (~100M)
  • 7. 7 What do we need to ship data science products?
  • 8. 8 We need to support an agile data science workflow to provide quick and validated improvements! ● Data Science analytics ○ Complete and cleansed data, single ground-truth ○ Tools for computing metrics, continuous validation ● Data Science model development ○ Business objects and UI event data ○ Scaling complex data processing and feature computation ○ Discoverability of data and features ○ Batch + live data mismatches ○ Managing, monitoring and versioning of models and experiments ○ Knowledge sharing and code reuse (experiments, model, feature computation pipeline) ○ Flexibility to accommodate variety of ML frameworks ● Data Science model productionalization ○ Minimize differences between trained model and production code ○ Code modularized, tested, integrated into CI/CD workflow ○ Standardized model serving that is scalable, available, high throughput, low latency...
  • 9. Upwork Data & ML Platform 9
  • 10. 10
  • 11. 11 ● Kinesis and Structure Spark Streaming for high throughput live event data processing ● Moving away from traditional DWH solution to distributed Spark-based batch data processing to avoid performance issues and workload limitations ● Spark MLlib + Tensorflow as core ML libraries to balance the tradeoff between flexibility and standardized model engineering ● Data processing, feature computation and pipeline retraining jobs scheduled and orchestrated via Airflow ● Experiment management and model versioning integral part of CI/CD workflow ● Adopt engineer CI/CD workflow to data science using Jenkins, Databricks and Airflow: standalone model testing + live regression test helps to identify batch and live data mismatches ● Spark-based pipeline developed by data scientists directly used for model scoring in production environment ● Microservices for streamlined model serving, scalability, availability... ● Extensive use of Databricks notebook-based documentation of model, experiments and feature engineering code ● Graphite, ELK and Pagerduty for logging, monitoring and alerts
  • 12. Batch Data & ML Environment 12
  • 13. 13
  • 14. 14
  • 15. Live Data & ML Pipeline 15
  • 16. 16
  • 17. 17
  • 19. 19
  • 20. 20
  • 21. Pitfalls and Lessons Learned 21
  • 22. • Microservices can lead to data fragmentation and high downstream processing overhead • Structured streaming latency when number of Kinesis consumers is high • Stream-to-stream/stream-to-batch join not suitable for real-time use cases yet • Differences between live and batch data • Differences between trained vs. deployed ML pipeline can be minimized • CI/CD needs to be customized to support data science workflow and artefacts • Databricks notebooks very convenient for collaboration, documentation, code sharing and reuse, results dissemination 22
  • 23. 23 Interested in search & recommendations, multi-sided matching or online labor marketplace optimization? We are hiring! Interested in doing work only when you want work? Join Upwork as contractors!