How to Build a ML Platform Efficiently Using Open-Source

•

0 j'aime•497 vues

Fast-growing startups usually face a common set of challenges when employing machine learning. Data scientists are expected to work on new products and develop new models as well as iterate on existing ones. Once in production, models should be continuously monitored and regularly maintained as the infrastructure evolves. Before too long, data scientists end up spending most of their time doing maintenance and firefighting of existing models instead of creating new ones. At GetYourGuide, we faced these challenges and decided to think about machine learning development holistically, which led us to our machine learning platform. Our platform uses MLflow to keep track of our machine learning life-cycle and ease the development experience. To integrate our models into our production environment, we also need to deal with additional requirements like API specification, SLOs and monitoring. To empower our data scientists, we have built a templating system that takes care of the heavy lifting of going to production, leveraging software engineering tools and ML-specific ones like BentoML. In this talk we will present: – Our previous approaches for deploying models and their tradeoffs – Our data science and platform principles – The main functionalities of our platform – A live demo to create a new service – Our learnings in the process

Données & analyses

How to Build a ML
Platform Efficiently
Using Open-Source
Jean Carlo Machado
Theodore Meynard
GetYourGuide 1

Agenda
▪ Introduction
▪ ML at GetYourGuide
▪ Before the Platform
▪ ML Platform
▪ Demo
▪ Final words
2

Who Are We
Theo: Senior Data Scientist
Jean: Senior Software Engineer
3
theodore.meynard@getyourguide.com
jean.machado@getyourguide.com

We’ve built the world’s largest
marketplace for travel activities…
Millions of travelers use
GetYourGuide every year
We facilitate the transaction We offer more than 40,000
activities worldwide
Connecting
customers...
...to suppliers
around the world
5

Amsterdam
:
Canal
Cruise
Other Data Products
We also use ML for:
• Demand forecasting
• Inventory labeling
⇒ 20+ ML projects distributed in 2
teams + delivered models to other
teams to maintain
10

Data Product Principles
We follow clean code principles
PoCs are temporary
We build solid, resilient deployment processes
Know our models health at every point in time
Quality,
Testing &
Monitoring
We integrate into the engineering
ecosystem and leverage
open-source
Data workflows are efficient and
cost-effective
We take reproducibility and
modularity seriously
Engineering
We Promote the Data Product
mindset
We deeply integrate with data
stakeholders
Stakeholder
Engagement
We actively manage the unknowns in our
planning
Data analytics dynamically informs our
project plans
Workflow
Customer and business value over
fancy solutions
Exploration is one of our goals
Strategy
We value small iterations on
existing models
We value explainability over pure
accuracy
Performance is proven online
Model
Our principles explained 11

How We Started
Pros
● Widely used by ML practitioners
● Good to start new projects &
prototype
● Great visualization
Cons
● No proper version control
● No code reuse
● No automatic testing
13

A Major Improvement
Pros
● Tests included in library
● Version control with code review
● Maintainable projects
Cons
● No CI/CD
● No model tracking
14

From Amsterdam:
Volendam, Marken and
Windmills
ML Platform Key Features
● CI/CD
● Model Tracking
● Batch & Online inference
16

ML Platform Principles
• Maximize data scientist’s model
ownership
• Reproducible Machine Learning
• Reuse most of our existing
infrastructure
• Build incrementally
• Use open-source and open
standards
17

From MLﬂow to BentoML Example
import mlflow
from iris_classifier import IrisClassifier
# Load mlflow model
mlflow_model = mlflow.sklearn.load_model(model_uri)
# Create a iris classifier service instance
iris_classifier_service = IrisClassifier()
# Pack the newly trained model artifact
iris_classifier_service.pack("model", mlflow_model)
# Save the prediction service for model serving
saved_path = iris_classifier_service.save()
from bentoml import env, artifacts, api, BentoService
from bentoml.adapters import DataframeInput
from bentoml.frameworks.sklearn import SklearnModelArtifact
@artifacts([SklearnModelArtifact("model")])
class IrisClassifier(BentoService):
"""A minimum prediction service"""
@api(input=DataframeInput())
def predict(self, df: pd.DataFrame):
"""An inference API named `predict`"""
return self.artifacts.model.predict(df)
iris_classiﬁer.py
mlﬂow2bentoml.py
24

● The integration with the existing
architecture fosters proactive
collaboration
● The tool space is very new making
the exploration vital but time
consuming
● A design review is necessary to align
everyone
Learnings
27

Amsterdam:
Moco Museum
Conclusion
● As we grew, we needed to reﬁne our
Data science process
● Software engineering good practices
+ special twist for ML
● Our platform helps Data Scientists to
○ Build faster
○ Deploy safer
○ Document automatically
28

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
29

Recommandé

MLflow Model ServingDatabricks

How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks

Introduction to MLflowDatabricks

Apply MLOps at Scale by H&MDatabricks

Ml ops intro sessionAvinash Patil

MLOps - The Assembly Line of MLJordan Birdsell

CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on DatabricksDatabricks

Productionzing ML Model Using MLflow Model ServingDatabricks

Recommandé

MLflow Model ServingDatabricks

How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks

Introduction to MLflowDatabricks

Apply MLOps at Scale by H&MDatabricks

Ml ops intro sessionAvinash Patil

MLOps - The Assembly Line of MLJordan Birdsell

CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on DatabricksDatabricks

Productionzing ML Model Using MLflow Model ServingDatabricks

MLOps Using MLflowDatabricks

Machine Learning using Kubeflow and KubernetesArun Gupta

Building an ML Platform with Ray and MLflowDatabricks

MLOps for production-level machine learningcnvrg.io AI OS - Hands-on ML Workshops

Accelerate Your ML Pipeline with AutoML and MLflowDatabricks

From Data Science to MLOpsCarl W. Handlin

The A-Z of Data: Introduction to MLOpsDataPhoenix

What’s New with Databricks Machine LearningDatabricks

Productionalizing Models through CI/CD Design with MLflowDatabricks

MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks

Learn to Use Databricks for the Full ML LifecycleDatabricks

Wix's ML PlatformRan Romano

MLOps in actionPieter de Bruin

Seamless MLOps with Seldon and MLflowDatabricks

MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleDatabricks

MLOps Bridging the gap between Data Scientists and Ops.Knoldus Inc.

MLOps Virtual Event: Automating ML at ScaleDatabricks

Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi

Productionalizing Machine Learning Solutions with Effective Tracking, Monitor...Databricks

MLflow: A Platform for Production Machine LearningMatei Zaharia

The Need for SpeedCapgemini

Ds for finance day 4QuantUniversity

Contenu connexe

Tendances

MLOps Using MLflowDatabricks

Machine Learning using Kubeflow and KubernetesArun Gupta

Building an ML Platform with Ray and MLflowDatabricks

MLOps for production-level machine learningcnvrg.io AI OS - Hands-on ML Workshops

Accelerate Your ML Pipeline with AutoML and MLflowDatabricks

From Data Science to MLOpsCarl W. Handlin

The A-Z of Data: Introduction to MLOpsDataPhoenix

What’s New with Databricks Machine LearningDatabricks

Productionalizing Models through CI/CD Design with MLflowDatabricks

MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks

Learn to Use Databricks for the Full ML LifecycleDatabricks

Wix's ML PlatformRan Romano

MLOps in actionPieter de Bruin

Seamless MLOps with Seldon and MLflowDatabricks

MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleDatabricks

MLOps Bridging the gap between Data Scientists and Ops.Knoldus Inc.

MLOps Virtual Event: Automating ML at ScaleDatabricks

Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi

Productionalizing Machine Learning Solutions with Effective Tracking, Monitor...Databricks

MLflow: A Platform for Production Machine LearningMatei Zaharia

Tendances (20)

MLOps Using MLflow

Machine Learning using Kubeflow and Kubernetes

Building an ML Platform with Ray and MLflow

MLOps for production-level machine learning

Accelerate Your ML Pipeline with AutoML and MLflow

From Data Science to MLOps

The A-Z of Data: Introduction to MLOps

What’s New with Databricks Machine Learning

Productionalizing Models through CI/CD Design with MLflow

MLflow: Infrastructure for a Complete Machine Learning Life Cycle

Learn to Use Databricks for the Full ML Lifecycle

Wix's ML Platform

MLOps in action

Seamless MLOps with Seldon and MLflow

MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle

MLOps Bridging the gap between Data Scientists and Ops.

MLOps Virtual Event: Automating ML at Scale

Feature drift monitoring as a service for machine learning models at scale

Productionalizing Machine Learning Solutions with Effective Tracking, Monitor...

MLflow: A Platform for Production Machine Learning

Similaire à How to Build a ML Platform Efficiently Using Open-Source

The Need for SpeedCapgemini

Ds for finance day 4QuantUniversity

DutchMLSchool. ML for Energy Trading and Automotive SectorBigML, Inc

Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks

Practical model management in the age of Data science and MLQuantUniversity

Deploying ML models in the enterprisedoppenhe

Pitfalls of machine learning in productionAntoine Sauray

Reproducibility and experiments management in Machine Learning Mikhail Rozhkov

Model driven engineering for big data management systemsMarcos Almeida

Pivoting event streaming, from PROJECTS to a PLATFORMconfluent

Mohamed Sabri: Operationalize machine learning with KubeflowLviv Startup Club

Mohamed Sabri: Operationalize machine learning with KubeflowEdunomica

PAD-3126 - Evolving the DevOps Organization around IBM PureApplication System...Hendrik van Run

Regtech in Fintech + QuSandbox DemoQuantUniversity

From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati

Building Intelligent Solutions with Graphs, Stefan Kolmar, Neo4jNeo4j

ITMAGINATION - competences, facts, technologies, clientsITMAGINATION

C2_W1---.pdfHumayun Kabir

Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward

Becoming a Digital MasterTOPdesk

Similaire à How to Build a ML Platform Efficiently Using Open-Source (20)

The Need for Speed

Ds for finance day 4

DutchMLSchool. ML for Energy Trading and Automotive Sector

Consolidating MLOps at One of Europe’s Biggest Airports

Practical model management in the age of Data science and ML

Deploying ML models in the enterprise

Pitfalls of machine learning in production

Reproducibility and experiments management in Machine Learning

Model driven engineering for big data management systems

Pivoting event streaming, from PROJECTS to a PLATFORM

Mohamed Sabri: Operationalize machine learning with Kubeflow

PAD-3126 - Evolving the DevOps Organization around IBM PureApplication System...

Regtech in Fintech + QuSandbox Demo

From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...

Building Intelligent Solutions with Graphs, Stefan Kolmar, Neo4j

ITMAGINATION - competences, facts, technologies, clients

C2_W1---.pdf

Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...

Becoming a Digital Master

Plus de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Dernier

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

Business Analytics using Microsoft Excelysmaelreyes

How we prevented account sharing with MFAAndrei Kaleshka

IMA MSN - Medical Students Network (2).pptxdolaknnilon

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Multiple time frame trading analysis -brianshannon.pdfchwongval

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Learn How Data Science Changes Our WorldEduminds Learning

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Dernier (20)

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines

Business Analytics using Microsoft Excel

How we prevented account sharing with MFA

IMA MSN - Medical Students Network (2).pptx

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

Real-Time AI Streaming - AI Max Princeton

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI

Semantic Shed - Squashing and Squeezing.pptx

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Multiple time frame trading analysis -brianshannon.pdf

Heart Disease Classification Report: A Data Analysis Project

Learn How Data Science Changes Our World

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

Defining Constituents, Data Vizzes and Telling a Data Story

How to Build a ML Platform Efficiently Using Open-Source

1. How to Build a ML Platform Efficiently Using Open-Source Jean Carlo Machado Theodore Meynard GetYourGuide 1

2. Agenda ▪ Introduction ▪ ML at GetYourGuide ▪ Before the Platform ▪ ML Platform ▪ Demo ▪ Final words 2

3. Who Are We Theo: Senior Data Scientist Jean: Senior Software Engineer 3 theodore.meynard@getyourguide.com jean.machado@getyourguide.com

4. Introduction 4

5. We’ve built the world’s largest marketplace for travel activities… Millions of travelers use GetYourGuide every year We facilitate the transaction We offer more than 40,000 activities worldwide Connecting customers... ...to suppliers around the world 5

6. ML at GetYourGuide 6

7. Data Product: Ranking Service 7

8. Data Product: Recommendation Panels 8

9. Data Product: Paid Search 9

10. Amsterdam : Canal Cruise Other Data Products We also use ML for: • Demand forecasting • Inventory labeling ⇒ 20+ ML projects distributed in 2 teams + delivered models to other teams to maintain 10

11. Data Product Principles We follow clean code principles PoCs are temporary We build solid, resilient deployment processes Know our models health at every point in time Quality, Testing & Monitoring We integrate into the engineering ecosystem and leverage open-source Data workflows are efficient and cost-effective We take reproducibility and modularity seriously Engineering We Promote the Data Product mindset We deeply integrate with data stakeholders Stakeholder Engagement We actively manage the unknowns in our planning Data analytics dynamically informs our project plans Workflow Customer and business value over fancy solutions Exploration is one of our goals Strategy We value small iterations on existing models We value explainability over pure accuracy Performance is proven online Model Our principles explained 11

12. Before the platform 12

13. How We Started Pros ● Widely used by ML practitioners ● Good to start new projects & prototype ● Great visualization Cons ● No proper version control ● No code reuse ● No automatic testing 13

14. A Major Improvement Pros ● Tests included in library ● Version control with code review ● Maintainable projects Cons ● No CI/CD ● No model tracking 14

15. ML Platform 15

16. From Amsterdam: Volendam, Marken and Windmills ML Platform Key Features ● CI/CD ● Model Tracking ● Batch & Online inference 16

17. ML Platform Principles • Maximize data scientist’s model ownership • Reproducible Machine Learning • Reuse most of our existing infrastructure • Build incrementally • Use open-source and open standards 17

18. Our Current Workﬂow 18

19. ML CI/CD 19

20. Training Path 20

21. Batch Inference Path 21

22. Online Inference Path 22

23. Online Inference With BentoML 23

24. From MLflow to BentoML Example import mlflow from iris_classifier import IrisClassifier # Load mlflow model mlflow_model = mlflow.sklearn.load_model(model_uri) # Create a iris classifier service instance iris_classifier_service = IrisClassifier() # Pack the newly trained model artifact iris_classifier_service.pack("model", mlflow_model) # Save the prediction service for model serving saved_path = iris_classifier_service.save() from bentoml import env, artifacts, api, BentoService from bentoml.adapters import DataframeInput from bentoml.frameworks.sklearn import SklearnModelArtifact @artifacts([SklearnModelArtifact("model")]) class IrisClassifier(BentoService): """A minimum prediction service""" @api(input=DataframeInput()) def predict(self, df: pd.DataFrame): """An inference API named `predict`""" return self.artifacts.model.predict(df) iris_classifier.py mlflow2bentoml.py 24

25. Demo 25

26. Final remarks 26

27. ● The integration with the existing architecture fosters proactive collaboration ● The tool space is very new making the exploration vital but time consuming ● A design review is necessary to align everyone Learnings 27

28. Amsterdam: Moco Museum Conclusion ● As we grew, we needed to reﬁne our Data science process ● Software engineering good practices + special twist for ML ● Our platform helps Data Scientists to ○ Build faster ○ Deploy safer ○ Document automatically 28

29. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. 29