Data lineage has gained popularity in the Machine Learning community as a way to make models and datasets easier to interpret and to help developers debug their ML pipelines by enabling them to go from a model to the dataset/user who trained it. Data provenance and lineage is the process of building up the history of how a data artifact came to be. This history of derivations and interactions can provide a better context for data discovery, debugging, as well as auditing. In this area, others, such as Google and Databricks, have made small steps.
The Hopsworks approach presented provenance information is collected implicitly through the unobtrusive instrumentation of jupyter notebooks and python code - What we call 'implicit provenance'.
2. What is Data Lineage?
A graph representation of:
• Data history
• Data evolution
• Data transformations
• Data dependencies
Systems get bigger.
Data pipelines get longer.
Finding the root cause of data problems gets harder.
3. Is Lineage relevant for ML?
• The Machine Learning development process is complex.
• It produces many artifacts.
Model
Serving
Data Prep
Feature
Store
Train &
Validate
• The process is iterative.
Raw
Datasets
Training
Datasets
Experiments Models
4. Problems
• What version of this Feature was used to create the training dataset I just used?
• There was a bug introduced in a training dataset – what models are affected?
Is Lineage relevant for ML?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Raw
Datasets
Training
Datasets
Experiments Models
5. Lineage records the direct acyclic graph (DAG) of artifacts generated by the pipeline.
Is Lineage relevant for ML?
Raw
Datasets
Training
Datasets
Experiments Models
Model
Serving
Data Prep
Feature
Store
Train &
Validate
7. Data discovery:
Free text search for artifacts is not sufficient.
Provide more context on artifacts:
• What pipelines is it used in?
• Most frequent users?
• What features are based on it?
• When was it last used?
• Is there a newer version of it?
…search:
8. *Disrupting data discovery – May 15th 2019, SF-Big-Analytics, https://www.meetup.com/SF-Big-Analytics/events/260676161/
9. Complex/repetitive pipelines:
Analyse:
• Compare current model accuracy
with previous models
Debug:
• Context in case of failure: what
changed, root cause?
Recompute:
• Suggestion to rerun pipeline in case
there are significant changes in input.
…search:
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
!
Model
Serving
Data Prep
Feature
Store
Train &
Validate
10. Garbage collection:
• Delete model that is no longer in
use?
• What related data can be deleted?
• What related pipelines can be
removed?
…search:
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
!
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Model
Serving
Data Prep
Feature
Store
Train &
Validate
11. Data Governance:
• Traceability
• Compliance
Governance
Traceability
…search:
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
!
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Model
Serving
Data Prep
Feature
Store
Train &
Validate
13. Explicit vs. Implicit Provenance
Explicit:
Top – down tracking of lineage.
Requires changes to application or
library code.
Metadata Store decoupled from
the platform.
Libraries
Application
Running
platform
Metadata
store
Implicit:
Bottom – up tracking of lineage.
Requires changing the platform.
Naming conventions, link files to
artifacts.
Metadata strongly consistent with
storage platform.
14. ● Wrap existing code in components that
execute a stage in the pipeline.
● Interceptors in components inject
metadata to Metadata Store.
Hooks in the platform layers:
● Filesystem*
● Execution engine
● Job scheduling*
● Pipeline orchestration
Explicit vs. Implicit Provenance
Explicit: MLFlow, TFX Implicit: Hopsworks*
15. (-) User needs to change code.
(+) Lineage data is more exact - API can be
arbitrarily rich.
(-) Consistency issues between Metadata
and Storage/Scheduling layer.
(+) Platform developer does the work.
(+) Lineage information is complete. All data
accesses are logged, not just calls from one
specific library.
(+) Can detect concurrent access to the same
artifacts.
(-) Requires tagging to reach similar
expressiveness to explicit provenance
(+) Works even when artifacts are accessed
across different ML frameworks.
Pros and Cons
Explicit: Implicit:
17. Explicit provenance - TFX
From TFX documentation website: https://www.tensorflow.org/tfx/guide/mlmd
MLMD client libraries intercept inject metadata.
The MetadataStore contains information about ML
artifacts and executions.
Potential Problem
Metadata and Storage layer need to be kept in sync.
19. Hopsworks Implicit Lineage
Lineage in Hopsworks:
• File as unit of data
• Application as transformations
• Jobs – an application is a run of a job
• Pipelines – orchestrations of jobs
Job Scheduler
Application
Execution engine
Libraries
Filesystem
Pipeline Orchestration
20. Filesystem Lineage
Track file operations:
• create/delete/read/append
Metadata store based on HDFS Extended
Attributes operations (XAttributes):
• add/update/remove operations
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS
21. Scheduler Lineage
Application context:
• Application states: start/finished/killed/failed
• Provides a time interval as a context for
lineage file operations.
• X.509 certificates link applications and file
system operations.
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS
X.509
Certificate
22. Orchestration Lineage
Job context:
• Provides a link between different job runs
(applications).
• Provides configuration context, including
dependencies and its environment (conda).
Pipeline context:
• Provides a link between the pipeline job
components.
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS
23. Library Lineage
Attach XAttributes to enrich the lineage
information:
• Grouping artifacts together – e.g. model
type.
• Artifact attributes (accuracy)
• Finer grained links between input and output
artifacts
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS
24. Hopsworks - Lineage
Elasticsearch
ePipe: ChangeDataCapture API
HopsFS
Raw
Datasets
Training
Datasets
Experiments Models
File Operations Log
*ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.
25. Application context
File A
operations over time
start stop
File B
File C
App1
App1App1
App2 App2
start stop
App2
Provides time as context.
Provides a context for files accessed together.
Detects concurrent usage of artifacts.
27. Job Context and Tagging
App1 App2
Tagging with XAttributes for better
expressiveness
You can:
• Manually tag a set of files as being a
ML artifact.
• Query Lineage for similar artifacts
and tag them also.
• Follow platform level artifact location
rules to automatically detect and tag
them.
• Use explicit provenance, with our
hops ml library.
Job1
Similar artifacts
28. Summary
App1 App2 App3
Job1 Job2
Pipeline1
Tagging
Explicit Provenance
File provenance at filesystem level, with
integrated metadata store.
Provenance is coming to Hopsworks
Application context collected from job scheduler.
Job and Pipeline context collected from
orchestration layers.
Tagging support through the metadata store.
Explicit provenance in hops libraries.
31. Explicit provenance
MetadataStore1 MetadataStore2
Framework1 Framework2 Hopsworks use case:
Running two ML frameworks that support
explicit provenance.
MetadataStore
● Hopefully they allow pluggable Metadata
Stores
● Maybe there is one implementation that
fits both.
● Most likely they have the different models
for tracking and querying lineage.
● Consistency issues on delete.
32. Implicit provenance
MetadataStore
In Storage Layer
Track provenance at file system
level.Linking between files and ML
artifacts:
● Direct tagging
● Automatic tagging by following
location based platform rules
● Augmented tagging support
through provenance and artifact
similarity
Notes de l'éditeur
Change Data Capture API for HopsFS used to inject metadata to a metadata store as data flows through the pipeline
Change Data Capture API for HopsFS used to inject metadata to a metadata store as data flows through the pipeline
Change Data Capture API for HopsFS used to inject metadata to a metadata store as data flows through the pipeline