SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Asset Management for
Machine Learning
Georg Hildebrand
DATA AI
$$$
???
The Big Picture:
ML
=
Knowledge Workers Productivity
x 50 ??
Peter Drucker source
Increase of
Performance
Acceleration through
automatisation The area of knowledge workers
… the story of a model that was sent
as email attachment ...
… the story of features that were not
available anymore ...
*D. Sculley et al., “Hidden technical debt in machine learning
systems,” in Advances in Neural Information Processing
Systems, 2015, pp. 2503–2511.
Pick a complexity for your ML product:
source
System of ML silos
Productivity …
Hidden depth of
machine learning
How to reach
50 x productivity
?
… make models and features human
and machine readable!
Store context
- How was the input data
chosen?
- Who trained the model? Is
there a project paper?
- How to run the model?
- Store log output of model
training
- Framework used to build the
model.
- Chain model and feature
preprocessing!
Version and
test
- Snapshot input data /features
- Version the model (eg. into S3
or a backend database)
- Link the container artifact.
- Provide input and output
tests
- Store performance metrics
- …
Put it into practice
-
There is no one stop shop solution :-/
Manage ML Deployment pipeline:
Example: SeldonIO, serving and integrating ML
Purpose: End to end deployment for serving ML and more, very advanced.
https://github.com/SeldonIO/
Trys use the
modularised way!
May require data to
move around.
Manage Models + Data and serve low latency:
Example: Vespa, storing model together with the data
Purpose: eg. Low latency serving of ML
https://github.com/vespa-engine/vespa
Container node
Query
Application
Package
Admin &
Config
Content node
Deploy
- Configuration
- Components
- ML models
Scatter-gather
Core
sharding
models models models
Moves models to
data!
Manage Project, Code, Data and more:
Purpose: Git for data scientists - manage your code and data together
Example DVC (Data Version Control):
https://github.com/iterative/dvc
Input and output are
tracked --> DAG !!!
$ dvc run python code/xml_to_tsv.py
data/Posts.xml data/Posts.tsv python
* source, why luigi and airflow might not be
enough...
Model tracking, snapshotting and reproducibility:
Purpose: Git for data scientists - turn any repository into a trackable task record with reusable
environments and metrics logging.
Example Datmo:
https://github.com/datmo/datmo
OS, but tight to paid
service!!!
PipelineAI:
https://github.com/PipelineAI/pipeline
Reproducible Model Pipelines + serving:
Purpose: Consistent, Immutable, Reproducible Model Runtimes
Example PipelineAI:
https://github.com/PipelineAI/pipeline
Models are stored in
docker containers!!!
Open Source solutions
for model / workflow
management
Open Source
Project
K8s 1st class
support
# Active
Maintainers
# Commits 1st Commit Comp
Seldon.IO Yes + Kubeflow 3 1131 2017-Feb #6
H2o driverless-ai Yes + Kubeflow 14 22967 2014-Mar #12
PipelineAI Yes. No Kubeflow 1 4846 2015-Jul #273
Polyaxon Yes. No Kubeflow 1 2732 2017-May #88
Vespa-engine No, but possilbe 14 18290 2016-Jun #5854
IBM/FfDL Yes. No Kubeflow 4 bulk-init hidden #77
RiseML Yes. No Kubeflow 3 95 2017-Oct #9
databricks/mlflow No. No Kubeflow 5 bulk-init hidden #58
datmo No, possible 4 bulk-init 2018-04 #213
Data / Feature
Management
Open Source
Project
Note # Active
Maintainers
# Commits 1st Commit Comp
Data Version Control
(DVC)
project and
data
management
2 1364 2017-Mar #readme
pachyderm Data Pipeline,
management
6 11652 2014-09 tbd
quiltdata/quilt Data and
package
management
8 1177 2017-01 tbd
Questions?
Georg Hildebrand, zalando SE
https://github.com/georghildebrand
https://www.linkedin.com/in/georghildebrand/
Example Components
ML Journey
Idea
Explore
Combine and
Extract Data
Analyse and
Preprocess
Model and
train
Test and Monitor
Share and
Improve
There is even more hidden depth:
● Models erode boundaries of modularisation:
context and preprocessing dependencies
● Data Dependencies Cost More than Code Dependencies (unstable data
dependencies, data quality monitoring etc.)
● Complex hardware requirements (GPUs on kubernetes??)
● Feedback Loops (model influences its feature data)
● ML-System Anti-Patterns (glue code, pipeline jungles …)
* Own experience
** D. Sculley et al., “Hidden technical debt in machine learning
systems,” in Advances in Neural Information Processing Systems,
2015, pp. 2503–2511.
*** D. Sculley et al., “Machine Learning: The High-Interest Credit Card
of Technical Debt,” p. 9.

Contenu connexe

Tendances

Palo alto networks product overview
Palo alto networks product overviewPalo alto networks product overview
Palo alto networks product overview
Belsoft
 
Storage networking-technologies
Storage networking-technologiesStorage networking-technologies
Storage networking-technologies
sagaroceanic11
 
Information Technology Disaster Planning
Information Technology Disaster PlanningInformation Technology Disaster Planning
Information Technology Disaster Planning
guest340570
 
Iam suite introduction
Iam suite introductionIam suite introduction
Iam suite introduction
wardell henley
 

Tendances (20)

Bcp drp
Bcp drpBcp drp
Bcp drp
 
Reference Architecture for Data Loss Prevention in the Cloud
Reference Architecture for Data Loss Prevention in the CloudReference Architecture for Data Loss Prevention in the Cloud
Reference Architecture for Data Loss Prevention in the Cloud
 
Tech Refresh - Cybersecurity in Healthcare
Tech Refresh - Cybersecurity in HealthcareTech Refresh - Cybersecurity in Healthcare
Tech Refresh - Cybersecurity in Healthcare
 
SABSA Implementation(Part IV)_ver1-0
SABSA Implementation(Part IV)_ver1-0SABSA Implementation(Part IV)_ver1-0
SABSA Implementation(Part IV)_ver1-0
 
Palo alto networks product overview
Palo alto networks product overviewPalo alto networks product overview
Palo alto networks product overview
 
Splunk Security Session - .conf Go Köln
Splunk Security Session - .conf Go KölnSplunk Security Session - .conf Go Köln
Splunk Security Session - .conf Go Köln
 
Any Standard is Better Than None: GDPR and the ISO Standards
Any Standard is Better Than None: GDPR and the ISO StandardsAny Standard is Better Than None: GDPR and the ISO Standards
Any Standard is Better Than None: GDPR and the ISO Standards
 
Software-Defined WAN: A Real World Success Story
Software-Defined WAN: A Real World Success StorySoftware-Defined WAN: A Real World Success Story
Software-Defined WAN: A Real World Success Story
 
Solar winds supply chain breach - Insights from the trenches
Solar winds supply chain breach - Insights from the trenchesSolar winds supply chain breach - Insights from the trenches
Solar winds supply chain breach - Insights from the trenches
 
Storage networking-technologies
Storage networking-technologiesStorage networking-technologies
Storage networking-technologies
 
Netskope Overview
Netskope OverviewNetskope Overview
Netskope Overview
 
Disaster recovery & business continuity
Disaster recovery & business continuityDisaster recovery & business continuity
Disaster recovery & business continuity
 
ANONYMITY, SECURITY, PRIVACY AND CIVIL LIBERTIES.pptx
ANONYMITY, SECURITY, PRIVACY AND CIVIL LIBERTIES.pptxANONYMITY, SECURITY, PRIVACY AND CIVIL LIBERTIES.pptx
ANONYMITY, SECURITY, PRIVACY AND CIVIL LIBERTIES.pptx
 
2021/0/15 - Solarwinds supply chain attack: why we should take it sereously
2021/0/15 - Solarwinds supply chain attack: why we should take it sereously2021/0/15 - Solarwinds supply chain attack: why we should take it sereously
2021/0/15 - Solarwinds supply chain attack: why we should take it sereously
 
An overview of the Indian Data Privacy Bill
An overview of the Indian Data Privacy Bill An overview of the Indian Data Privacy Bill
An overview of the Indian Data Privacy Bill
 
Understanding Cisco’ Next Generation SD-WAN Technology
Understanding Cisco’ Next Generation SD-WAN TechnologyUnderstanding Cisco’ Next Generation SD-WAN Technology
Understanding Cisco’ Next Generation SD-WAN Technology
 
Information Technology Disaster Planning
Information Technology Disaster PlanningInformation Technology Disaster Planning
Information Technology Disaster Planning
 
Intro to Office 365 Security & Compliance Center
Intro to Office 365 Security & Compliance CenterIntro to Office 365 Security & Compliance Center
Intro to Office 365 Security & Compliance Center
 
Iam suite introduction
Iam suite introductionIam suite introduction
Iam suite introduction
 
NIST Cybersecurity Framework - Mindmap
NIST Cybersecurity Framework - MindmapNIST Cybersecurity Framework - Mindmap
NIST Cybersecurity Framework - Mindmap
 

Similaire à 2018 data engineering for ml asset management for features and models

Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
Gianmario Spacagna
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 

Similaire à 2018 data engineering for ml asset management for features and models (20)

Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
 
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML PipelinesPyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
 
03_aiops-1.pptx
03_aiops-1.pptx03_aiops-1.pptx
03_aiops-1.pptx
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 

Dernier

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 

2018 data engineering for ml asset management for features and models

  • 1. Asset Management for Machine Learning Georg Hildebrand
  • 3. The Big Picture: ML = Knowledge Workers Productivity x 50 ?? Peter Drucker source
  • 5. … the story of a model that was sent as email attachment ...
  • 6. … the story of features that were not available anymore ...
  • 7. *D. Sculley et al., “Hidden technical debt in machine learning systems,” in Advances in Neural Information Processing Systems, 2015, pp. 2503–2511. Pick a complexity for your ML product:
  • 8. source System of ML silos Productivity … Hidden depth of machine learning
  • 9. How to reach 50 x productivity ?
  • 10. … make models and features human and machine readable!
  • 11. Store context - How was the input data chosen? - Who trained the model? Is there a project paper? - How to run the model? - Store log output of model training - Framework used to build the model. - Chain model and feature preprocessing!
  • 12. Version and test - Snapshot input data /features - Version the model (eg. into S3 or a backend database) - Link the container artifact. - Provide input and output tests - Store performance metrics - …
  • 13. Put it into practice - There is no one stop shop solution :-/
  • 14. Manage ML Deployment pipeline: Example: SeldonIO, serving and integrating ML Purpose: End to end deployment for serving ML and more, very advanced. https://github.com/SeldonIO/ Trys use the modularised way! May require data to move around.
  • 15. Manage Models + Data and serve low latency: Example: Vespa, storing model together with the data Purpose: eg. Low latency serving of ML https://github.com/vespa-engine/vespa Container node Query Application Package Admin & Config Content node Deploy - Configuration - Components - ML models Scatter-gather Core sharding models models models Moves models to data!
  • 16. Manage Project, Code, Data and more: Purpose: Git for data scientists - manage your code and data together Example DVC (Data Version Control): https://github.com/iterative/dvc Input and output are tracked --> DAG !!! $ dvc run python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv python * source, why luigi and airflow might not be enough...
  • 17. Model tracking, snapshotting and reproducibility: Purpose: Git for data scientists - turn any repository into a trackable task record with reusable environments and metrics logging. Example Datmo: https://github.com/datmo/datmo OS, but tight to paid service!!!
  • 18. PipelineAI: https://github.com/PipelineAI/pipeline Reproducible Model Pipelines + serving: Purpose: Consistent, Immutable, Reproducible Model Runtimes Example PipelineAI: https://github.com/PipelineAI/pipeline Models are stored in docker containers!!!
  • 19. Open Source solutions for model / workflow management Open Source Project K8s 1st class support # Active Maintainers # Commits 1st Commit Comp Seldon.IO Yes + Kubeflow 3 1131 2017-Feb #6 H2o driverless-ai Yes + Kubeflow 14 22967 2014-Mar #12 PipelineAI Yes. No Kubeflow 1 4846 2015-Jul #273 Polyaxon Yes. No Kubeflow 1 2732 2017-May #88 Vespa-engine No, but possilbe 14 18290 2016-Jun #5854 IBM/FfDL Yes. No Kubeflow 4 bulk-init hidden #77 RiseML Yes. No Kubeflow 3 95 2017-Oct #9 databricks/mlflow No. No Kubeflow 5 bulk-init hidden #58 datmo No, possible 4 bulk-init 2018-04 #213
  • 20. Data / Feature Management Open Source Project Note # Active Maintainers # Commits 1st Commit Comp Data Version Control (DVC) project and data management 2 1364 2017-Mar #readme pachyderm Data Pipeline, management 6 11652 2014-09 tbd quiltdata/quilt Data and package management 8 1177 2017-01 tbd
  • 21. Questions? Georg Hildebrand, zalando SE https://github.com/georghildebrand https://www.linkedin.com/in/georghildebrand/
  • 23. ML Journey Idea Explore Combine and Extract Data Analyse and Preprocess Model and train Test and Monitor Share and Improve
  • 24. There is even more hidden depth: ● Models erode boundaries of modularisation: context and preprocessing dependencies ● Data Dependencies Cost More than Code Dependencies (unstable data dependencies, data quality monitoring etc.) ● Complex hardware requirements (GPUs on kubernetes??) ● Feedback Loops (model influences its feature data) ● ML-System Anti-Patterns (glue code, pipeline jungles …) * Own experience ** D. Sculley et al., “Hidden technical debt in machine learning systems,” in Advances in Neural Information Processing Systems, 2015, pp. 2503–2511. *** D. Sculley et al., “Machine Learning: The High-Interest Credit Card of Technical Debt,” p. 9.