Data provenance in Hopsworks

Data Provenance in
Hopsworks
Alex Ormenisan
PhD student @KTH
Software Engineer @Logical Clocks
Hops ML Stockholm
September 12th 2019

What is Data Lineage?
A graph representation of:
• Data history
• Data evolution
• Data transformations
• Data dependencies
Systems get bigger.
Data pipelines get longer.
Finding the root cause of data problems gets harder.

Is Lineage relevant for ML?
• The Machine Learning development process is complex.
• It produces many artifacts.
Model
Serving
Data Prep
Feature
Store
Train &
Validate
• The process is iterative.
Raw
Datasets
Training
Datasets
Experiments Models

Problems
• What version of this Feature was used to create the training dataset I just used?
• There was a bug introduced in a training dataset – what models are affected?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Raw
Datasets
Training
Datasets
Experiments Models

Lineage records the direct acyclic graph (DAG) of artifacts generated by the pipeline.
Raw
Datasets
Training
Datasets
Experiments Models
Model
Serving
Data Prep
Feature
Store
Train &
Validate

Data discovery:
Free text search for artifacts is not sufficient.
Provide more context on artifacts:
• What pipelines is it used in?
• Most frequent users?
• What features are based on it?
• When was it last used?
• Is there a newer version of it?
…search:

*Disrupting data discovery – May 15th 2019, SF-Big-Analytics, https://www.meetup.com/SF-Big-Analytics/events/260676161/

Complex/repetitive pipelines:
Analyse:
• Compare current model accuracy
with previous models
Debug:
• Context in case of failure: what
changed, root cause?
Recompute:
• Suggestion to rerun pipeline in case
there are significant changes in input.
…search:
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
!
Model
Serving
Data Prep
Feature
Store
Train &
Validate

Garbage collection:
• Delete model that is no longer in
use?
• What related data can be deleted?
• What related pipelines can be
removed?
…search:
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
!
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Model
Serving
Data Prep
Feature
Store
Train &
Validate

Data Governance:
• Traceability
• Compliance
Governance
Traceability
…search:
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
?
Model
Serving
Data Prep
Feature
Store
Train &
Validate
!
Model
Serving
Data Prep
Feature
Store
Train &
Validate
Model
Serving
Data Prep
Feature
Store
Train &
Validate

Explicit vs. Implicit
Provenance

Explicit vs. Implicit Provenance
Explicit:
Top – down tracking of lineage.
Requires changes to application or
library code.
Metadata Store decoupled from
the platform.
Libraries
Application
Running
platform
Metadata
store
Implicit:
Bottom – up tracking of lineage.
Requires changing the platform.
Naming conventions, link files to
artifacts.
Metadata strongly consistent with
storage platform.

● Wrap existing code in components that
execute a stage in the pipeline.
● Interceptors in components inject
metadata to Metadata Store.
Hooks in the platform layers:
● Filesystem*
● Execution engine
● Job scheduling*
● Pipeline orchestration
Explicit vs. Implicit Provenance
Explicit: MLFlow, TFX Implicit: Hopsworks*

(-) User needs to change code.
(+) Lineage data is more exact - API can be
arbitrarily rich.
(-) Consistency issues between Metadata
and Storage/Scheduling layer.
(+) Platform developer does the work.
(+) Lineage information is complete. All data
accesses are logged, not just calls from one
specific library.
(+) Can detect concurrent access to the same
artifacts.
(-) Requires tagging to reach similar
expressiveness to explicit provenance
(+) Works even when artifacts are accessed
across different ML frameworks.
Pros and Cons
Explicit: Implicit:

Explicit provenance
● Executions:
mlflow.create_experiment(…)
mlflow.set_experiment(…)
mlflow.start_run(…)
mlflow.end_run(…)
● Artifacts:
mlflow.log_artifact(…)
mlflow.sklearn.log_model(…)
● Executions:
hops.experiment.grid_search(…)
hops.experiment.collective_allreduce(…)
● FeatureStore:
hops.featurestore.create_training_dataset(…)
● Saving and deploying models:
hops.serving.export(…)
hops.serving.deploy(…)
MLFlow: Hopsworks:

Explicit provenance - TFX
From TFX documentation website: https://www.tensorflow.org/tfx/guide/mlmd
MLMD client libraries intercept inject metadata.
The MetadataStore contains information about ML
artifacts and executions.
Potential Problem
Metadata and Storage layer need to be kept in sync.

Hopsworks Implicit Lineage
Lineage in Hopsworks:
• File as unit of data
• Application as transformations
• Jobs – an application is a run of a job
• Pipelines – orchestrations of jobs
Job Scheduler
Application
Execution engine
Libraries
Filesystem
Pipeline Orchestration

Filesystem Lineage
Track file operations:
• create/delete/read/append
Metadata store based on HDFS Extended
Attributes operations (XAttributes):
• add/update/remove operations
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS

Scheduler Lineage
Application context:
• Application states: start/finished/killed/failed
• Provides a time interval as a context for
lineage file operations.
• X.509 certificates link applications and file
system operations.
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS
X.509
Certificate

Orchestration Lineage
Job context:
• Provides a link between different job runs
(applications).
• Provides configuration context, including
dependencies and its environment (conda).
Pipeline context:
• Provides a link between the pipeline job
components.
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS

Library Lineage
Attach XAttributes to enrich the lineage
information:
• Grouping artifacts together – e.g. model
type.
• Artifact attributes (accuracy)
• Finer grained links between input and output
artifacts
Airflow Jobs
Yarn
Application
Spark
HopsUtilPy
HopsFS

Hopsworks - Lineage
Elasticsearch
ePipe: ChangeDataCapture API
HopsFS
Raw
Datasets
Training
Datasets
Experiments Models
File Operations Log
*ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.

Application context
File A
operations over time
start stop
File B
File C
App1
App1App1
App2 App2
start stop
App2
Provides time as context.
Provides a context for files accessed together.
Detects concurrent usage of artifacts.

Application context
App1
App2
App3
Application Footprint
Application Impact
Input Files Output Files
Temporary
Files

Job Context and Tagging
App1 App2
Tagging with XAttributes for better
expressiveness
You can:
• Manually tag a set of files as being a
ML artifact.
• Query Lineage for similar artifacts
and tag them also.
• Follow platform level artifact location
rules to automatically detect and tag
them.
• Use explicit provenance, with our
hops ml library.
Job1
Similar artifacts

Summary
App1 App2 App3
Job1 Job2
Pipeline1
Tagging
Explicit Provenance
File provenance at filesystem level, with
integrated metadata store.
Provenance is coming to Hopsworks
Application context collected from job scheduler.
Job and Pipeline context collected from
orchestration layers.
Tagging support through the metadata store.
Explicit provenance in hops libraries.

Explicit provenance
MetadataStore1 MetadataStore2
Framework1 Framework2 Hopsworks use case:
Running two ML frameworks that support
explicit provenance.
MetadataStore
● Hopefully they allow pluggable Metadata
Stores
● Maybe there is one implementation that
fits both.
● Most likely they have the different models
for tracking and querying lineage.
● Consistency issues on delete.

Implicit provenance
MetadataStore
In Storage Layer
Track provenance at file system
level.Linking between files and ML
artifacts:
● Direct tagging
● Automatic tagging by following
location based platform rules
● Augmented tagging support
through provenance and artifact
similarity

Data provenance in Hopsworks

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Data provenance in Hopsworks

Similaire à Data provenance in Hopsworks (20)

Dernier

Dernier (20)

Data provenance in Hopsworks

Notes de l'éditeur