SlideShare une entreprise Scribd logo
1  sur  32
Data Provenance in
Hopsworks
Alex Ormenisan
PhD student @KTH
Software Engineer @Logical Clocks
Hops ML Stockholm
September 12th 2019
What is Data Lineage?
A graph representation of:
• Data history
• Data evolution
• Data transformations
• Data dependencies
Systems get bigger.
Data pipelines get longer.
Finding the root cause of data problems gets harder.
Is Lineage relevant for ML?
• The Machine Learning development process is complex.
• It produces many artifacts.
Model
Serving
Data Prep
Feature
Store
Train &
Validate
• The process is iterative.
Raw
Datasets
Training
Datasets
Experiments Models
Problems
• What version of this Feature was used to create the training dataset I just used?
• There was a bug introduced in a training dataset – what models are affected?
Is Lineage relevant for ML?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Raw
Datasets
Training
Datasets
Experiments Models
Lineage records the direct acyclic graph (DAG) of artifacts generated by the pipeline.
Is Lineage relevant for ML?
Raw
Datasets
Training
Datasets
Experiments Models
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Why track Lineage?
Data discovery:
Free text search for artifacts is not sufficient.
Provide more context on artifacts:
• What pipelines is it used in?
• Most frequent users?
• What features are based on it?
• When was it last used?
• Is there a newer version of it?
…search:
*Disrupting data discovery – May 15th 2019, SF-Big-Analytics, https://www.meetup.com/SF-Big-Analytics/events/260676161/
Complex/repetitive pipelines:
Analyse:
• Compare current model accuracy
with previous models
Debug:
• Context in case of failure: what
changed, root cause?
Recompute:
• Suggestion to rerun pipeline in case
there are significant changes in input.
…search:
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
!
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Garbage collection:
• Delete model that is no longer in
use?
• What related data can be deleted?
• What related pipelines can be
removed?
…search:
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
!
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Data Governance:
• Traceability
• Compliance
Governance
Traceability
…search:
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
!
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Explicit vs. Implicit
Provenance
Explicit vs. Implicit Provenance
Explicit:
Top – down tracking of lineage.
Requires changes to application or
library code.
Metadata Store decoupled from
the platform.
Libraries
Application
Running
platform
Metadata
store
Implicit:
Bottom – up tracking of lineage.
Requires changing the platform.
Naming conventions, link files to
artifacts.
Metadata strongly consistent with
storage platform.
● Wrap existing code in components that
execute a stage in the pipeline.
● Interceptors in components inject
metadata to Metadata Store.
Hooks in the platform layers:
● Filesystem*
● Execution engine
● Job scheduling*
● Pipeline orchestration
Explicit vs. Implicit Provenance
Explicit: MLFlow, TFX Implicit: Hopsworks*
(-) User needs to change code.
(+) Lineage data is more exact - API can be
arbitrarily rich.
(-) Consistency issues between Metadata
and Storage/Scheduling layer.
(+) Platform developer does the work.
(+) Lineage information is complete. All data
accesses are logged, not just calls from one
specific library.
(+) Can detect concurrent access to the same
artifacts.
(-) Requires tagging to reach similar
expressiveness to explicit provenance
(+) Works even when artifacts are accessed
across different ML frameworks.
Pros and Cons
Explicit: Implicit:
Explicit provenance
● Executions:
mlflow.create_experiment(…)
mlflow.set_experiment(…)
mlflow.start_run(…)
mlflow.end_run(…)
● Artifacts:
mlflow.log_artifact(…)
mlflow.sklearn.log_model(…)
● Executions:
hops.experiment.grid_search(…)
hops.experiment.collective_allreduce(…)
● FeatureStore:
hops.featurestore.create_training_dataset(…)
● Saving and deploying models:
hops.serving.export(…)
hops.serving.deploy(…)
MLFlow: Hopsworks:
Explicit provenance - TFX
From TFX documentation website: https://www.tensorflow.org/tfx/guide/mlmd
MLMD client libraries intercept inject metadata.
The MetadataStore contains information about ML
artifacts and executions.
Potential Problem
Metadata and Storage layer need to be kept in sync.
Lineage in Hopsworks
Hopsworks Implicit Lineage
Lineage in Hopsworks:
• File as unit of data
• Application as transformations
• Jobs – an application is a run of a job
• Pipelines – orchestrations of jobs
Job Scheduler
Application
Execution engine
Libraries
Filesystem
Pipeline Orchestration
Filesystem Lineage
Track file operations:
• create/delete/read/append
Metadata store based on HDFS Extended
Attributes operations (XAttributes):
• add/update/remove operations
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS
Scheduler Lineage
Application context:
• Application states: start/finished/killed/failed
• Provides a time interval as a context for
lineage file operations.
• X.509 certificates link applications and file
system operations.
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS
X.509
Certificate
Orchestration Lineage
Job context:
• Provides a link between different job runs
(applications).
• Provides configuration context, including
dependencies and its environment (conda).
Pipeline context:
• Provides a link between the pipeline job
components.
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS
Library Lineage
Attach XAttributes to enrich the lineage
information:
• Grouping artifacts together – e.g. model
type.
• Artifact attributes (accuracy)
• Finer grained links between input and output
artifacts
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS
Hopsworks - Lineage
Elasticsearch
ePipe: ChangeDataCapture API
HopsFS
Raw
Datasets
Training
Datasets
Experiments Models
File Operations Log
*ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.
Application context
File A
operations over time
start stop
File B
File C
App1
App1App1
App2 App2
start stop
App2
Provides time as context.
Provides a context for files accessed together.
Detects concurrent usage of artifacts.
Application context
App1
App2
App3
Application Footprint
Application Impact
Input Files Output Files
Temporary
Files
Job Context and Tagging
App1 App2
Tagging with XAttributes for better
expressiveness
You can:
• Manually tag a set of files as being a
ML artifact.
• Query Lineage for similar artifacts
and tag them also.
• Follow platform level artifact location
rules to automatically detect and tag
them.
• Use explicit provenance, with our
hops ml library.
Job1
Similar artifacts
Summary
App1 App2 App3
Job1 Job2
Pipeline1
Tagging
Explicit Provenance
File provenance at filesystem level, with
integrated metadata store.
Provenance is coming to Hopsworks
Application context collected from job scheduler.
Job and Pipeline context collected from
orchestration layers.
Tagging support through the metadata store.
Explicit provenance in hops libraries.
Questions?
Backup slides
Explicit provenance
MetadataStore1 MetadataStore2
Framework1 Framework2 Hopsworks use case:
Running two ML frameworks that support
explicit provenance.
MetadataStore
● Hopefully they allow pluggable Metadata
Stores
● Maybe there is one implementation that
fits both.
● Most likely they have the different models
for tracking and querying lineage.
● Consistency issues on delete.
Implicit provenance
MetadataStore
In Storage Layer
Track provenance at file system
level.Linking between files and ML
artifacts:
● Direct tagging
● Automatic tagging by following
location based platform rules
● Augmented tagging support
through provenance and artifact
similarity

Contenu connexe

Tendances

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data StreamsSujaAldrin
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Hima Patel
 
Web servers – features, installation and configuration
Web servers – features, installation and configurationWeb servers – features, installation and configuration
Web servers – features, installation and configurationwebhostingguy
 
Web Application Development using PHP and MySQL
Web Application Development using PHP and MySQLWeb Application Development using PHP and MySQL
Web Application Development using PHP and MySQLGanesh Kamath
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityEmpower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityDatabricks
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation OptionsMihai Criveti
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitEric Wendelin
 
Web design - Applications and web application definition
Web design - Applications and web application definitionWeb design - Applications and web application definition
Web design - Applications and web application definitionMustafa Kamel Mohammadi
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolAmit Sharma
 
A Framework for Navigating Generative Artificial Intelligence for Enterprise
A Framework for Navigating Generative Artificial Intelligence for EnterpriseA Framework for Navigating Generative Artificial Intelligence for Enterprise
A Framework for Navigating Generative Artificial Intelligence for EnterpriseRocketSource
 
Version Stamps in NOSQL Databases
Version Stamps in NOSQL DatabasesVersion Stamps in NOSQL Databases
Version Stamps in NOSQL DatabasesDr-Dipali Meher
 

Tendances (19)

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Web servers – features, installation and configuration
Web servers – features, installation and configurationWeb servers – features, installation and configuration
Web servers – features, installation and configuration
 
Prompt Engineering
Prompt EngineeringPrompt Engineering
Prompt Engineering
 
Web Application Development using PHP and MySQL
Web Application Development using PHP and MySQLWeb Application Development using PHP and MySQL
Web Application Development using PHP and MySQL
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityEmpower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
Hadoop
Hadoop Hadoop
Hadoop
 
Web design - Applications and web application definition
Web design - Applications and web application definitionWeb design - Applications and web application definition
Web design - Applications and web application definition
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End tool
 
A Framework for Navigating Generative Artificial Intelligence for Enterprise
A Framework for Navigating Generative Artificial Intelligence for EnterpriseA Framework for Navigating Generative Artificial Intelligence for Enterprise
A Framework for Navigating Generative Artificial Intelligence for Enterprise
 
Version Stamps in NOSQL Databases
Version Stamps in NOSQL DatabasesVersion Stamps in NOSQL Databases
Version Stamps in NOSQL Databases
 

Similaire à Data provenance in Hopsworks

Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureGyula Fóra
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone
 
Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Ákos Horváth
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery Labs
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsLightbend
 

Similaire à Data provenance in Hopsworks (20)

Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...
 
Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptx
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 

Dernier

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 

Dernier (20)

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 

Data provenance in Hopsworks

  • 1. Data Provenance in Hopsworks Alex Ormenisan PhD student @KTH Software Engineer @Logical Clocks Hops ML Stockholm September 12th 2019
  • 2. What is Data Lineage? A graph representation of: • Data history • Data evolution • Data transformations • Data dependencies Systems get bigger. Data pipelines get longer. Finding the root cause of data problems gets harder.
  • 3. Is Lineage relevant for ML? • The Machine Learning development process is complex. • It produces many artifacts. Model Serving Data Prep Feature Store Train & Validate • The process is iterative. Raw Datasets Training Datasets Experiments Models
  • 4. Problems • What version of this Feature was used to create the training dataset I just used? • There was a bug introduced in a training dataset – what models are affected? Is Lineage relevant for ML? Model Serving Data Prep Feature Store Train & Validate Raw Datasets Training Datasets Experiments Models
  • 5. Lineage records the direct acyclic graph (DAG) of artifacts generated by the pipeline. Is Lineage relevant for ML? Raw Datasets Training Datasets Experiments Models Model Serving Data Prep Feature Store Train & Validate
  • 7. Data discovery: Free text search for artifacts is not sufficient. Provide more context on artifacts: • What pipelines is it used in? • Most frequent users? • What features are based on it? • When was it last used? • Is there a newer version of it? …search:
  • 8. *Disrupting data discovery – May 15th 2019, SF-Big-Analytics, https://www.meetup.com/SF-Big-Analytics/events/260676161/
  • 9. Complex/repetitive pipelines: Analyse: • Compare current model accuracy with previous models Debug: • Context in case of failure: what changed, root cause? Recompute: • Suggestion to rerun pipeline in case there are significant changes in input. …search: ? Model Serving Data Prep Feature Store Train & Validate ? Model Serving Data Prep Feature Store Train & Validate ! Model Serving Data Prep Feature Store Train & Validate
  • 10. Garbage collection: • Delete model that is no longer in use? • What related data can be deleted? • What related pipelines can be removed? …search: ? Model Serving Data Prep Feature Store Train & Validate ? Model Serving Data Prep Feature Store Train & Validate ! Model Serving Data Prep Feature Store Train & Validate Model Serving Data Prep Feature Store Train & Validate
  • 11. Data Governance: • Traceability • Compliance Governance Traceability …search: ? Model Serving Data Prep Feature Store Train & Validate ? Model Serving Data Prep Feature Store Train & Validate ! Model Serving Data Prep Feature Store Train & Validate Model Serving Data Prep Feature Store Train & Validate
  • 13. Explicit vs. Implicit Provenance Explicit: Top – down tracking of lineage. Requires changes to application or library code. Metadata Store decoupled from the platform. Libraries Application Running platform Metadata store Implicit: Bottom – up tracking of lineage. Requires changing the platform. Naming conventions, link files to artifacts. Metadata strongly consistent with storage platform.
  • 14. ● Wrap existing code in components that execute a stage in the pipeline. ● Interceptors in components inject metadata to Metadata Store. Hooks in the platform layers: ● Filesystem* ● Execution engine ● Job scheduling* ● Pipeline orchestration Explicit vs. Implicit Provenance Explicit: MLFlow, TFX Implicit: Hopsworks*
  • 15. (-) User needs to change code. (+) Lineage data is more exact - API can be arbitrarily rich. (-) Consistency issues between Metadata and Storage/Scheduling layer. (+) Platform developer does the work. (+) Lineage information is complete. All data accesses are logged, not just calls from one specific library. (+) Can detect concurrent access to the same artifacts. (-) Requires tagging to reach similar expressiveness to explicit provenance (+) Works even when artifacts are accessed across different ML frameworks. Pros and Cons Explicit: Implicit:
  • 16. Explicit provenance ● Executions: mlflow.create_experiment(…) mlflow.set_experiment(…) mlflow.start_run(…) mlflow.end_run(…) ● Artifacts: mlflow.log_artifact(…) mlflow.sklearn.log_model(…) ● Executions: hops.experiment.grid_search(…) hops.experiment.collective_allreduce(…) ● FeatureStore: hops.featurestore.create_training_dataset(…) ● Saving and deploying models: hops.serving.export(…) hops.serving.deploy(…) MLFlow: Hopsworks:
  • 17. Explicit provenance - TFX From TFX documentation website: https://www.tensorflow.org/tfx/guide/mlmd MLMD client libraries intercept inject metadata. The MetadataStore contains information about ML artifacts and executions. Potential Problem Metadata and Storage layer need to be kept in sync.
  • 19. Hopsworks Implicit Lineage Lineage in Hopsworks: • File as unit of data • Application as transformations • Jobs – an application is a run of a job • Pipelines – orchestrations of jobs Job Scheduler Application Execution engine Libraries Filesystem Pipeline Orchestration
  • 20. Filesystem Lineage Track file operations: • create/delete/read/append Metadata store based on HDFS Extended Attributes operations (XAttributes): • add/update/remove operations Airflow Jobs Yarn Application Spark HopsUtilPy HopsFS
  • 21. Scheduler Lineage Application context: • Application states: start/finished/killed/failed • Provides a time interval as a context for lineage file operations. • X.509 certificates link applications and file system operations. Airflow Jobs Yarn Application Spark HopsUtilPy HopsFS X.509 Certificate
  • 22. Orchestration Lineage Job context: • Provides a link between different job runs (applications). • Provides configuration context, including dependencies and its environment (conda). Pipeline context: • Provides a link between the pipeline job components. Airflow Jobs Yarn Application Spark HopsUtilPy HopsFS
  • 23. Library Lineage Attach XAttributes to enrich the lineage information: • Grouping artifacts together – e.g. model type. • Artifact attributes (accuracy) • Finer grained links between input and output artifacts Airflow Jobs Yarn Application Spark HopsUtilPy HopsFS
  • 24. Hopsworks - Lineage Elasticsearch ePipe: ChangeDataCapture API HopsFS Raw Datasets Training Datasets Experiments Models File Operations Log *ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.
  • 25. Application context File A operations over time start stop File B File C App1 App1App1 App2 App2 start stop App2 Provides time as context. Provides a context for files accessed together. Detects concurrent usage of artifacts.
  • 26. Application context App1 App2 App3 Application Footprint Application Impact Input Files Output Files Temporary Files
  • 27. Job Context and Tagging App1 App2 Tagging with XAttributes for better expressiveness You can: • Manually tag a set of files as being a ML artifact. • Query Lineage for similar artifacts and tag them also. • Follow platform level artifact location rules to automatically detect and tag them. • Use explicit provenance, with our hops ml library. Job1 Similar artifacts
  • 28. Summary App1 App2 App3 Job1 Job2 Pipeline1 Tagging Explicit Provenance File provenance at filesystem level, with integrated metadata store. Provenance is coming to Hopsworks Application context collected from job scheduler. Job and Pipeline context collected from orchestration layers. Tagging support through the metadata store. Explicit provenance in hops libraries.
  • 31. Explicit provenance MetadataStore1 MetadataStore2 Framework1 Framework2 Hopsworks use case: Running two ML frameworks that support explicit provenance. MetadataStore ● Hopefully they allow pluggable Metadata Stores ● Maybe there is one implementation that fits both. ● Most likely they have the different models for tracking and querying lineage. ● Consistency issues on delete.
  • 32. Implicit provenance MetadataStore In Storage Layer Track provenance at file system level.Linking between files and ML artifacts: ● Direct tagging ● Automatic tagging by following location based platform rules ● Augmented tagging support through provenance and artifact similarity

Notes de l'éditeur

  1. Change Data Capture API for HopsFS used to inject metadata to a metadata store as data flows through the pipeline
  2. Change Data Capture API for HopsFS used to inject metadata to a metadata store as data flows through the pipeline
  3. Change Data Capture API for HopsFS used to inject metadata to a metadata store as data flows through the pipeline