Apache Beam is a key technology for building scalable End-to-End ML pipelines, as it is the data preparation and model analysis engine for TensorFlow Extended (TFX), a framework for horizontally scalable Machine Learning (ML) pipelines based on TensorFlow. In this talk, we present TFX on Hopsworks, a fully open-source platform for running TFX pipelines on any cloud or on-premise. Hopsworks is a project-based multi-tenant platform for both data parallel programming and horizontally scalable machine learning pipelines. Hopsworks supports Apache Flink as a runner for Beam jobs and TFX pipelines are supported through Airflow support in Hopsworks. We will demonstrate how to build a ML pipeline with TFX, Beam’s Python API and the Flink Runner by using Jupyter notebooks, explain how security is transparently enabled with short-lived TLS certificates, and go through all the pipeline steps, from Data Validation, to Transformation, Model training with TensorFlow, Model Analysis, Model Serving and Monitoring with Kubernetes.
To the best of our knowledge, Hopsworks is the first fully open-source on-premise platform that supports both TFX pipelines and Apache Beam.
2. BERLIN 2019
1. End-to-end ML pipelines
2. What is Hopsworks
3. Beam Portable Runner with Flink in Hopsworks
4. ML Pipelines with Beam and TensorFlow Extended
5. Demo
7. BERLIN 2019
this one paper could repay your investment
HopsFS is a huge win.
World’s first Hadoop
platform to support
GPUs-as-a-Resource
World’s fastest Hadoop
Published at USENIX FAST
with Oracle and Spotify
World’s First
Open Source Feature Store
for Machine Learning
World’s First
Distributed Filesystem to store small
files in metadata on NVMe disks
Winner of IEEE
Scale Challenge 2017
with HopsFS - 1.2m
ops/sec
2017
World’s most scalable
Filesystem with
Multi Data Center Availability
2018 2019
World’s first
Open Source Platform to
support TensorFlow
Extended (TFX) on Beam
9. BERLIN 2019
● Manage Hopsworks resources via the REST API
○ Projects
○ Datasets
○ Jobs
○ Users
○ FeatureStore
○ Kafka
○ ..
● Documented with Swagger and hosted on SwaggerHub
11. BERLIN 2019
Beam Model: Fn Runners
Apache Flink Apache Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java Beam Python
Execution Execution
Cloud
Dataflow
Execution
1. End users: who want to
write pipelines in a language
that’s familiar.
2. SDK writers: who want to
make Beam concepts
available in new languages.
3. Runner writers: who have a
distributed processing
environment and want to
support Beam pipelines
https://s.apache.org/apache-beam-project-overview
12. BERLIN 2019
● Develop Beam pipelines in Python from Jupyter notebooks
● Tooling to simplify deployment and execution
● Manage lifecycle of Job Service
● SDK Workers (harness) with conda env
● Scalable execution on Flink clusters
14. BERLIN 2019
● hops-util-py (Python) and HopsUtil(Java)
● Simplify development by:
○ Setting security config
○ Discovering cluster services
○ Helper methods for the Hopsworks REST API
○ ML Experiments
● Manage Beam Job Service
def start_beam_job_service(
flink_session_name,
artifacts_dir="Resources",
job_server_path="hdfs:///user/flink/",
job_server_jar="beam-runners-flink-1.8-job-server-2.13.0.jar",
sdk_worker_parallelism=1)
https://github.com/logicalclocks/hops-util-py/ https://github.com/logicalclocks/hops-util
15. BERLIN 2019
● Docker:
○ Build image with all your
dependencies
○ Update or modify? build new
containers
○ Additional infrastructure
components
● Process:
○ Install dependencies on all
servers
○ Management of dependencies?
○ Easy to update and modify
libraries
○ Challenge? Multi-tenancy &
keep servers in sync
● SDK Worker (Harness): SDK-provided program responsible for
executing user code
● How to manage the user’s dependencies, libraries, … ?
18. BERLIN 2019
● Execute a Beam Python
pipeline
○ With the Python kernel
either in a docker
container managed by
Kubernetes or as a
local Python process.
○ In a PySpark executor
in the cluster.
25. BERLIN 2019
● Flink JobManager and TaskManager
● Beam Job service
○ Local mode - logs in project’s Jupyter staging dir
○ Cluster - logs in the PySpark container where process is running.
● SDK Worker
○ Logs are in the Flink TaskManager container
● Collect and visualize with the ELK stack
○ Logs are accessible only by project members
40. BERLIN 2019
Raw Data
Event Data
Monitor
Serving
Feature Store /
TFX Transform
Data PrepIngest DeployExperiment /
Train
logs
logs
Metadata Store
External
Model Analysis
FeatureStore
41. BERLIN 2019
● Beam 2.13.0
● Flink 1.8.0
● TensorFlow 1.14.0
● TFX 0.14.0dev
● TensorFlow Model Analysis 0.13.2
43. BERLIN 2019
● Summary
○ Hopsworks v1.0 the first on-prem open source horizontally
scalable platform to support Beam Portable Runner with Flink
runner
○ Develop and Manage lifecycle of horizontally scalable
End-to-End ML Pipelines with Beam and TFX
● Future Work
○ Add support for Spark Runner
○ Export metrics in InfluxDB and monitor with Grafana