Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal Machine Learning with Apache Beam"

Universal Machine Learning
with Apache Beam
Robert Bradshaw <robertwb@apache.org>
Maximilian Michels <mxm@apache.org>

Contents
Intro to Apache Beam
Portability
TFX & Demo

What is Apache Beam?
- A unified batch and stream distributed processing API
- A set of SDK frontends: Java, Python, Go, Scala, SQL, …
- A set of Runners which can execute Beam jobs into various
backends: Local, Apache Flink, Apache Spark, Apache Gearpump,
Apache Samza, Apache Hadoop, Google Cloud Dataflow, …

Beam Vision
Provide a comprehensive portability framework
for data processing pipelines, one that allows you
to write your pipeline once in your language of
choice and run it with minimal effort on the
execution engine of choice.

The Beam Model: A Simple Pipeline
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());

The Beam Model: Asking the Right Questions
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
What, Where, When, How

The Beam Model: Consistent across languages
scores = (input
| WindowInto(FixedWindows(120)
trigger=AfterWatermark(
early=AfterProcessingTime(60),
late=AfterCount(1))
accumulation_mode=ACCUMULATING)
| CombinePerKey(sum))

The Beam Model: IO is just [P]Transforms
input = root | ReadFromText("/path/to/text*") | Map(lambda line: ...)
scores = (input
late=AfterCount(1))
scores | WriteToText("/path/to/outputs")

The Beam Model: Execute on choice of Runner
def pipeline(root):
input = root | ReadFromText("/path/to/text*") | Map(lambda line: ...)
scores = (input
late=AfterCount(1))
scores | WriteToText("/path/to/outputs")
MyRunner().run(pipeline)

Realizing the Beam Vision
- Data processing frameworks usually
only support a single language (e.g.
Java)
- Portability in Beam means Pipelines
can be written and executed in any
supported SDK (Java/Python/Go)
- Pipelines also contain
language-specific code (e.g.
map/reduce functions)
- Libraries of the language can be
used (!)
Execution (Fn API)
Apache
Flink
Apache
Spark
Pipeline Construction (Runner API)
Beam GoBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution

Backend (e.g. Flink)
Task 1 Task 2 Task 3 Task n
Without Portability
SDK
Runner
All components are tight to a single language
language-specific

With Portability
SDK Harness
Job ServerSDK Runner
SDK Harness
language-specific
language-agnostic
Backend (e.g. Flink)
Task 1 Task 2 Task 3 Task n
Job API Runner API
Fn API Fn API
Portable Job

16
TFX: TensorFlowExtended
ML
Code
To do Machine Learning at scale, in addition to the actual ML...

17
Configuration
Data Collection
Data
Verification
Feature Extraction
Process Management
Tools
Analysis Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
ML
Code
...you have to worry about so much more

18
Figure 1: High-level component overview of a machine learning platform.
Data
Ingestion
Data
Analysis + Validation
TensorFlow
Transform
Trainer
TensorFlow
Model Analysis
TensorFlow
Serving
Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Tuner
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization

19
- Tensorflow Transform
- preprocess/transform model training data
- feature extraction and normalization
- statistics and validation
- avoid training/serving skew
- Tensorflow Model Analysis
- analyze performance of Tensorflow models
- slice and dice across population subsets
- discover and investigate bias (see ml-fairness.com)
+

Demo
https://github.com/robertwb/model-analysis/blob/flink-demo/examples/chicago_taxi/chicago_taxi_tfma_local_playground.ipynb

Thank you
- Check out the Beam website
- beam.apache.org
- Documentation
- Examples
- Mailing lists
- user@beam.apache.org
- dev@beam.apache.org
- Join the ASF Slack channel
#beam

Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal Machine Learning with Apache Beam"

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal Machine Learning with Apache Beam"

Similaire à Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal Machine Learning with Apache Beam" (20)

Plus de Flink Forward

Plus de Flink Forward (20)

Dernier

Dernier (20)

Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal Machine Learning with Apache Beam"