SlideShare une entreprise Scribd logo
1  sur  61
Télécharger pour lire hors ligne
Building unified Batch and
Stream processing
pipeline with Apache
Beam
Senior Software Engineer, PhD
Oleksandr Saienko
What is a Stream Data?
Unbounded data:
– Conceptually infinite,
set of data items / events
Unbounded data processing:
– Practically continuous stream
of data, which needs to be processed /
analyzed
Low-latency, approximate, and/or
speculative results:
- These types of results are most often
associated with streaming engines
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Flickr Image: Binary Flow by Adrenalin
Streaming data sources
• The internet of things (IoT)
- Real Time Sensor data collection,
analysis & alerts
• Autonomous Driving
– 1GB data per minute per car (all sensors)
• Traffic Monitoring
– High event rates: millions events / sec
– High query rates: thousands queries / sec
• Pre-processing of sensor data
– CERN experiments generate ~1PB of measurements per second.
– Unfeasible to store or process directly, fast preprocessing is a must.
…
https://www.cohere-technologies.com/technology/overview/
Big Data vs Fast Data vs Big Fast Data
https://www.scads.de/images/Events/3rdSummerSchool/Talks/TRabl_StreamProcessing.pdf
Image by: Peter Pietzuch
Latency
8 Requirements of Stream Processing
• Keep the data moving
• Declarative access
• Handle imperfections
• Predictable outcomes
• Integrate stored and streaming data
• Data safety and availability
• Automatic partitioning and scaling
• Instantaneous processing and response
http://cs.brown.edu/~ugur/8rulesSigRec.pdf
The 8 Requirements of Real-Time Stream Processing – Stonebraker et al. 2005
Big Data Landscape 2018
http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png
Big Data Landscape 2018
http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png
Apache Streaming Technologies
https://databaseline.bitbucket.io/an-overview-of-apache-streaming-technologies/
Customer requirements:
• Unified solution that can be deployed on Cloud
and on-premise (without major changes)
• Cloud agnostic, can be run on GCP, AWS,
Azure, etc…
• Can work both batch and streaming mode
• Easy to find developers
• Easy maintainable
Typical solution:
• Extremely painful to maintain
two different stacks
• Different programming
models and languages
• Multi implementation effort
• Multi operational effort
• …
• Build two (or more) stacks – one for batch, one for
streaming
• Build two (or more) solutions – for cloud (using
cloud managed services), one for on-premise
Distributed Streaming Processing APIs…
What is Apache Beam?
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
https://beam.apache.org/get-started/beam-overview/
What is Apache Beam?
Apache Beam is a unified programming model designed to provide
efficient and portable data processing pipelines
https://beam.apache.org/get-started/beam-overview/
Why Apache Beam?
Unified - One model handles batch and
streaming use cases.
Portable - Pipelines can be executed on
multiple execution environments, avoiding
lock-in.
Extensible - Supports user and
community driven SDKs, Runners,
transformation libraries, and IO
connectors.
https://beam.apache.org/get-started/beam-overview/
What is Apache Beam?
https://beam.apache.org/get-started/beam-overview/
The Apache Beam Vision
The Beam abstraction Model:
● Choice of SDK: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runners: Users
choose the right runtime for their
current needs -- on-prem / cloud,
open source / not, fully managed
● Scalability for Developers:
Clean APIs allow developers to
contribute modules independently
https://beam.apache.org/get-started/beam-overview/
The Apache Beam Vision
● Multiple runners:
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Dataflow
○ Apache Samza
○ Apache Gearpump
● Programming lang:
○ Java
○ Python
○ Go
○ Scala* (Scio)
https://beam.apache.org/get-started/beam-overview/
Beam currently supports the following language-specific SDKs:
Java Go
Python
A Scala interface is also available as Scio
The Apache Beam Vision
https://beam.apache.org/get-started/beam-overview/
*Beam SQL
A Scala API for Apache Beam and
Google Cloud Dataflow
l Scio is a Scala API for Apache Beam and Google Cloud
Dataflow inspired by Apache Spark and Scalding.
Features:
Scala API close to that of Spark and
Scalding core APIs
Unified batch and streaming programming
model
Integration with Algebird and Breeze
https://github.com/spotify/scio
The Cloud Dataflow Service
A great place for executing Beam
pipelines which provides:
● Fully managed, no-ops
execution environment
● Integration with Google Cloud
Platform
https://beam.apache.org/get-started/beam-overview/
In Beam, a big data processing pipeline is a DAG (directed,
acyclic graph) of parallel operations called PTransforms
processing data from PCollections
Beam Processing pipeline
https://beam.apache.org/get-started/beam-overview/
PipelineRunner
PipelineRunner: specifies where and how the pipeline
should execute.
The Spark Runner executes Beam pipelines on top of Apache Spark,
providing:
•Batch and streaming (and combined) pipelines.
•The same fault-tolerance guarantees as provided by RDDs and DStreams.
•Native support for Beam side-inputs via spark’s Broadcast variables.
$ java -Dexec.mainClass=com.examples.WordCount 
-Dexec.args=“
--runner=SparkRunner
….
Options options =
PipelineOptionsFactory.fromArgs(args)
PCollection
• Parallel collection of timestamped elements
• Could be bounded or unbounded.
• Immutable. Once created, you cannot add, remove, or
change individual elements.
• Does not support random access to individual
elements.
• Belongs to the pipeline in which it is created. You
cannot share a PCollection between Pipeline objects.
Built-in I/O Support
Messaging
Amazon Kinesis
AMQP
Apache Kafka
Google Cloud
PubSub
JMS
MQTT
File-based
Apache HDFS,
Amazon S3,
Google GCS,
local filesystems.
FileIO (general-
purpose)
AvroIO
ParquetIO
TextIO
TFRecordIO
XmlIO
TikaIO
Database
Apache Cassandra
Apache Hadoop InputFormat
Apache HBase
Apache Hive (HCatalog)
Apache Solr
Elasticsearch
Google BigQuery
Google Cloud Bigtable
Google Cloud Datastore
JDBC
MongoDB
Redis
https://beam.apache.org/documentation/io/built-in/
In-Progress I/O ...
https://beam.apache.org/documentation/io/built-in/
Core Beam PTransforms
https://beam.apache.org/documentation/programming-guide/
Beam Processing pipeline
[Output PCollection] = [Input PCollection].apply([Transform])
[Final Output PCollection] =
[Initial Input PCollection].apply([First Transform])
.apply([Second Transform])
.apply([Third Transform])
https://beam.apache.org/documentation/programming-guide/#applying-transforms
Performs a user-provided transformation on each element of a
PCollection independently
ParDo can output 1, 0 or many values for each input element
ParDo can be used for many different operations...
Element Wise Transforms: ParDo
https://beam.apache.org/documentation/programming-guide/#applying-transforms
Apache Beam SDK includes other Element Wise Transforms for convenience
FlatMapElements
MapElements
ParDo
Filter
Values
Keys
WithKeys
General; 1-input to (0,1,many)-outputs; side-inputs and side-outputs
1-input to (0 or 1)-outputs
1-input to 1-output
1-input to (0,1,many)-output
value -> KV(f(value), value)
KV(key, value) -> key
KV(key, value) -> value
Element-Wise Transforms
(map)
Element Wise Transforms
https://beam.apache.org/documentation/programming-guide/
Apache Beam SDK includes other Element Wise Transforms for convenience
FlatMapElements
MapElements
ParDo
Filter
Values
Keys
WithKeys
Element Wise Transforms
You can use Java 8 lambda functions with several other Beam
transforms, including Filter, FlatMapElements, and Partition
https://beam.apache.org/documentation/programming-guide/
What your (Java) Code Looks Like
Filter
ToLowerCase
Count
WriteFile
ReadFile
ExtractWords
File
Predictions
Pipeline p = Pipeline.create(new PipelineOptions())
p.run();
.apply("ExtractWords",FlatMapElements.into(TypeDescriptors.strings())
.via((String word) -> Arrays.<String>asList(word.split("...
.apply("Filter",Filter.by((String word) -> word.length()>1))
.apply("ToLowerCase",MapElements.into(TypeDescriptors.strings())
.via((String word) -> word.toLowerCase()))
.apply(TextIO.write().to("...
.p.apply("ReadFile",TextIO.read().from("...
.apply("CountWords", Count.perElement())
...
https://beam.apache.org/documentation/programming-guide/
Grouping Transforms: GroupByKey
The input to GroupByKey is a collection of
key/value pairs, you use GroupByKey to collect
all of the values associated with each unique key.
https://beam.apache.org/documentation/programming-guide/
Grouping Transforms: CoGroupByKey
CoGroupByKey performs a relational join of two or more
key/value PCollections that have the same key type:
https://beam.apache.org/documentation/programming-guide/
Grouping Transforms: CoGroupByKey
https://beam.apache.org/documentation/programming-guide/
Combine
Combine is a Beam transform for combining collections of
elements or values in your data.
When you apply a Combine transform, you must provide the
function that contains the logic for combining the elements or
values.
The combining function should be commutative and associative
https://beam.apache.org/documentation/programming-guide/
Partition
Partition is a Beam transform for PCollection objects that store the
same data type. Partition splits a single PCollection into a fixed
number of smaller collections.
https://beam.apache.org/documentation/programming-guide/
https://beam.apache.org/documentation/pipelines/design-your-pipeline/
A branching pipeline
https://beam.apache.org/documentation/pipelines/design-your-pipeline/
A branching pipeline
Flatten
Flatten and is a Beam transform for PCollection objects that store
the same data type.
Flatten merges multiple PCollection objects into a single logical
PCollection.
https://beam.apache.org/documentation/pipelines/design-your-pipeline/
A branching pipeline
Composite transforms
Transforms can have a nested
structure, where a complex transform
performs multiple simpler transforms
(such as more than one ParDo,
Combine, GroupByKey, or even other
composite transforms).
Nesting multiple transforms inside a
single composite transform can make
your code more modular and easier to
understand.
Composite Transforms
(reusable combinations)
https://beam.apache.org/documentation/programming-guide/
Composite transforms
https://beam.apache.org/documentation/programming-guide/
Requirements for writing user code
for Beam transforms
In general, your user code must fulfill at least
these requirements:
• Your function object must be serializable.
• Your function object must be thread-compatible,
and be aware that the Beam SDKs are not
thread-safe.
https://beam.apache.org/documentation/programming-guide/
Immutability requirements
• You should not in any way modify an element
returned by ProcessContext.element() or
ProcessContext.sideInput()
(the incoming elements from the input collection).
• Once you output a value using
ProcessContext.output() or
ProcessContext.sideOutput(), you should not
modify that value in any way.
https://beam.apache.org/documentation/programming-guide/
Side inputs
Side inputs – global view of a PCollection used for
broadcast / joins.
ParDo can receive extra inputs “on the side”
For example broadcast the count of elements to the
processing of each element
Side inputs are computed (and accessed) per-window
https://beam.apache.org/documentation/programming-guide/
Side Outputs
ParDo(SomeDoFn)
input elements
Main
Output
Bogus
Inputs
Write Out
Continue
Pipeline
ParDos can produce
multiple outputs
For example:
A main output
containing all the
successfully processed
results
A side output containing
all the elements that
failed to be processed
https://beam.apache.org/documentation/programming-guide/
Beam SQL
Beam SQL allows to query bounded
and unbounded PCollections with
SQL statements.
Your SQL query is translated to
a PTransform, an encapsulated
segment of a Beam pipeline.
https://beam.apache.org/documentation/dsls/sql/overview/
http://calcite.apache.org/
PCollection<Row> filteredNames = items.apply(
BeamSql.query( "SELECT appId, description, rowtime
FROM PCOLLECTION WHERE id=1"));
Windowing
Key 2Key 1 Key 3
1
Fixed
2
3
4
Key 2Key 1 Key 3
Sliding
1
2
3
5
4
Key 2Key 1 Key 3
Sessions
2
4
3
1
Windowing - partitions data
based on the timestamps
associated with events
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Windowing
No Windowing
Windowing
https://beam.apache.org/documentation/programming-guide/
Windowing
Unbounded, out of order Streams
8:00
8:00
8:00
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Processing Time vs Event Time
ProcessingTime
Realtime
Event Time
Delay
Triggers
allow you to
deal with
late-arriving
data or to
provide early
results
Determines when to emit the results of aggregation as
unbounded data arrives.
Triggers
When collecting and grouping data into windows, Beam uses
triggers to determine when to emit the aggregated results of each
window
input
.apply(Window.into(FixedWindows.of(...))
.triggering(
AfterWatermark.pastEndOfWindow()))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
https://beam.apache.org/documentation/programming-guide/
Triggers
Basic Triggers:
AfterEndOfWindow()
AfterCount(n)
AfterProcessingTimeDelay(dt)
Composite Triggers:
AfterEndOfWindow()
.withEarlyFirings(A)
.withLateFirings(B)
AfterAny(A,B)
AfterAll(A,B)
Repeat(A)
Sequence(A,B)
https://beam.apache.org/documentation/programming-guide/
Watermarks
• The “event time”, determined by the timestamp on the
data element itself
• Watermark, which is the system’s notion of when all
data in a certain window can be expected to have
arrived in the pipeline.
• Data that arrives with a timestamp after the watermark
is considered late data.
https://beam.apache.org/documentation/programming-guide/#watermarks-and-late-data
Note: Managing late data is not supported in the Beam SDK for Python
Beam Capability Matrix
https://beam.apache.org/documentation/runners/capability-matrix/
Beam Capability Matrix
https://beam.apache.org/documentation/runners/capability-matrix/
Beam Capability Matrix
https://beam.apache.org/documentation/runners/capability-matrix/
Pros of Apache Beam
• Abstraction over different execution backends
and programming languages.
• Clean and Simple programming model.
(easy to understand, implement and maintain)
• Same data pipeline for batch processing as well
as for stream processing.
Apache Beam https://beam.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-102
Why Apache Beam? A Google Perspective
http://goo.gl/eWTLH1
Thank you!

Contenu connexe

Tendances

Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 

Tendances (20)

Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
 
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
 

Similaire à Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline with Apache Beam (RUS)” — Oleksandr Saienko, Tech Leader/ Senior Software Engineer at SoftServe

Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 

Similaire à Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline with Apache Beam (RUS)” — Oleksandr Saienko, Tech Leader/ Senior Software Engineer at SoftServe (20)

Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Apache edgent
Apache edgentApache edgent
Apache edgent
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
BigDataSpain 2016: Stream Processing Applications with Apache Apex
BigDataSpain 2016: Stream Processing Applications with Apache ApexBigDataSpain 2016: Stream Processing Applications with Apache Apex
BigDataSpain 2016: Stream Processing Applications with Apache Apex
 
Going FaaSter, Functions as a Service at Netflix
Going FaaSter, Functions as a Service at NetflixGoing FaaSter, Functions as a Service at Netflix
Going FaaSter, Functions as a Service at Netflix
 

Plus de Provectus

AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
Provectus
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 

Plus de Provectus (20)

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP Solution
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAM
 

Dernier

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Dernier (20)

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline with Apache Beam (RUS)” — Oleksandr Saienko, Tech Leader/ Senior Software Engineer at SoftServe

  • 1. Building unified Batch and Stream processing pipeline with Apache Beam Senior Software Engineer, PhD Oleksandr Saienko
  • 2. What is a Stream Data? Unbounded data: – Conceptually infinite, set of data items / events Unbounded data processing: – Practically continuous stream of data, which needs to be processed / analyzed Low-latency, approximate, and/or speculative results: - These types of results are most often associated with streaming engines https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 Flickr Image: Binary Flow by Adrenalin
  • 3. Streaming data sources • The internet of things (IoT) - Real Time Sensor data collection, analysis & alerts • Autonomous Driving – 1GB data per minute per car (all sensors) • Traffic Monitoring – High event rates: millions events / sec – High query rates: thousands queries / sec • Pre-processing of sensor data – CERN experiments generate ~1PB of measurements per second. – Unfeasible to store or process directly, fast preprocessing is a must. … https://www.cohere-technologies.com/technology/overview/
  • 4. Big Data vs Fast Data vs Big Fast Data https://www.scads.de/images/Events/3rdSummerSchool/Talks/TRabl_StreamProcessing.pdf Image by: Peter Pietzuch Latency
  • 5. 8 Requirements of Stream Processing • Keep the data moving • Declarative access • Handle imperfections • Predictable outcomes • Integrate stored and streaming data • Data safety and availability • Automatic partitioning and scaling • Instantaneous processing and response http://cs.brown.edu/~ugur/8rulesSigRec.pdf The 8 Requirements of Real-Time Stream Processing – Stonebraker et al. 2005
  • 6. Big Data Landscape 2018 http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png
  • 7. Big Data Landscape 2018 http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png
  • 9. Customer requirements: • Unified solution that can be deployed on Cloud and on-premise (without major changes) • Cloud agnostic, can be run on GCP, AWS, Azure, etc… • Can work both batch and streaming mode • Easy to find developers • Easy maintainable
  • 10. Typical solution: • Extremely painful to maintain two different stacks • Different programming models and languages • Multi implementation effort • Multi operational effort • … • Build two (or more) stacks – one for batch, one for streaming • Build two (or more) solutions – for cloud (using cloud managed services), one for on-premise
  • 12. What is Apache Beam? Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines https://beam.apache.org/get-started/beam-overview/
  • 13. What is Apache Beam? Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines https://beam.apache.org/get-started/beam-overview/
  • 14. Why Apache Beam? Unified - One model handles batch and streaming use cases. Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in. Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors. https://beam.apache.org/get-started/beam-overview/
  • 15. What is Apache Beam? https://beam.apache.org/get-started/beam-overview/
  • 16. The Apache Beam Vision The Beam abstraction Model: ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed ● Scalability for Developers: Clean APIs allow developers to contribute modules independently https://beam.apache.org/get-started/beam-overview/
  • 17. The Apache Beam Vision ● Multiple runners: ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Dataflow ○ Apache Samza ○ Apache Gearpump ● Programming lang: ○ Java ○ Python ○ Go ○ Scala* (Scio) https://beam.apache.org/get-started/beam-overview/
  • 18. Beam currently supports the following language-specific SDKs: Java Go Python A Scala interface is also available as Scio The Apache Beam Vision https://beam.apache.org/get-started/beam-overview/ *Beam SQL
  • 19. A Scala API for Apache Beam and Google Cloud Dataflow l Scio is a Scala API for Apache Beam and Google Cloud Dataflow inspired by Apache Spark and Scalding. Features: Scala API close to that of Spark and Scalding core APIs Unified batch and streaming programming model Integration with Algebird and Breeze https://github.com/spotify/scio
  • 20. The Cloud Dataflow Service A great place for executing Beam pipelines which provides: ● Fully managed, no-ops execution environment ● Integration with Google Cloud Platform https://beam.apache.org/get-started/beam-overview/
  • 21. In Beam, a big data processing pipeline is a DAG (directed, acyclic graph) of parallel operations called PTransforms processing data from PCollections Beam Processing pipeline https://beam.apache.org/get-started/beam-overview/
  • 22. PipelineRunner PipelineRunner: specifies where and how the pipeline should execute. The Spark Runner executes Beam pipelines on top of Apache Spark, providing: •Batch and streaming (and combined) pipelines. •The same fault-tolerance guarantees as provided by RDDs and DStreams. •Native support for Beam side-inputs via spark’s Broadcast variables. $ java -Dexec.mainClass=com.examples.WordCount -Dexec.args=“ --runner=SparkRunner …. Options options = PipelineOptionsFactory.fromArgs(args)
  • 23. PCollection • Parallel collection of timestamped elements • Could be bounded or unbounded. • Immutable. Once created, you cannot add, remove, or change individual elements. • Does not support random access to individual elements. • Belongs to the pipeline in which it is created. You cannot share a PCollection between Pipeline objects.
  • 24. Built-in I/O Support Messaging Amazon Kinesis AMQP Apache Kafka Google Cloud PubSub JMS MQTT File-based Apache HDFS, Amazon S3, Google GCS, local filesystems. FileIO (general- purpose) AvroIO ParquetIO TextIO TFRecordIO XmlIO TikaIO Database Apache Cassandra Apache Hadoop InputFormat Apache HBase Apache Hive (HCatalog) Apache Solr Elasticsearch Google BigQuery Google Cloud Bigtable Google Cloud Datastore JDBC MongoDB Redis https://beam.apache.org/documentation/io/built-in/
  • 27. Beam Processing pipeline [Output PCollection] = [Input PCollection].apply([Transform]) [Final Output PCollection] = [Initial Input PCollection].apply([First Transform]) .apply([Second Transform]) .apply([Third Transform]) https://beam.apache.org/documentation/programming-guide/#applying-transforms
  • 28. Performs a user-provided transformation on each element of a PCollection independently ParDo can output 1, 0 or many values for each input element ParDo can be used for many different operations... Element Wise Transforms: ParDo https://beam.apache.org/documentation/programming-guide/#applying-transforms
  • 29. Apache Beam SDK includes other Element Wise Transforms for convenience FlatMapElements MapElements ParDo Filter Values Keys WithKeys General; 1-input to (0,1,many)-outputs; side-inputs and side-outputs 1-input to (0 or 1)-outputs 1-input to 1-output 1-input to (0,1,many)-output value -> KV(f(value), value) KV(key, value) -> key KV(key, value) -> value Element-Wise Transforms (map) Element Wise Transforms https://beam.apache.org/documentation/programming-guide/
  • 30. Apache Beam SDK includes other Element Wise Transforms for convenience FlatMapElements MapElements ParDo Filter Values Keys WithKeys Element Wise Transforms You can use Java 8 lambda functions with several other Beam transforms, including Filter, FlatMapElements, and Partition https://beam.apache.org/documentation/programming-guide/
  • 31. What your (Java) Code Looks Like Filter ToLowerCase Count WriteFile ReadFile ExtractWords File Predictions Pipeline p = Pipeline.create(new PipelineOptions()) p.run(); .apply("ExtractWords",FlatMapElements.into(TypeDescriptors.strings()) .via((String word) -> Arrays.<String>asList(word.split("... .apply("Filter",Filter.by((String word) -> word.length()>1)) .apply("ToLowerCase",MapElements.into(TypeDescriptors.strings()) .via((String word) -> word.toLowerCase())) .apply(TextIO.write().to("... .p.apply("ReadFile",TextIO.read().from("... .apply("CountWords", Count.perElement()) ... https://beam.apache.org/documentation/programming-guide/
  • 32. Grouping Transforms: GroupByKey The input to GroupByKey is a collection of key/value pairs, you use GroupByKey to collect all of the values associated with each unique key. https://beam.apache.org/documentation/programming-guide/
  • 33. Grouping Transforms: CoGroupByKey CoGroupByKey performs a relational join of two or more key/value PCollections that have the same key type: https://beam.apache.org/documentation/programming-guide/
  • 35. Combine Combine is a Beam transform for combining collections of elements or values in your data. When you apply a Combine transform, you must provide the function that contains the logic for combining the elements or values. The combining function should be commutative and associative https://beam.apache.org/documentation/programming-guide/
  • 36. Partition Partition is a Beam transform for PCollection objects that store the same data type. Partition splits a single PCollection into a fixed number of smaller collections. https://beam.apache.org/documentation/programming-guide/
  • 39. Flatten Flatten and is a Beam transform for PCollection objects that store the same data type. Flatten merges multiple PCollection objects into a single logical PCollection.
  • 41. Composite transforms Transforms can have a nested structure, where a complex transform performs multiple simpler transforms (such as more than one ParDo, Combine, GroupByKey, or even other composite transforms). Nesting multiple transforms inside a single composite transform can make your code more modular and easier to understand. Composite Transforms (reusable combinations) https://beam.apache.org/documentation/programming-guide/
  • 43. Requirements for writing user code for Beam transforms In general, your user code must fulfill at least these requirements: • Your function object must be serializable. • Your function object must be thread-compatible, and be aware that the Beam SDKs are not thread-safe. https://beam.apache.org/documentation/programming-guide/
  • 44. Immutability requirements • You should not in any way modify an element returned by ProcessContext.element() or ProcessContext.sideInput() (the incoming elements from the input collection). • Once you output a value using ProcessContext.output() or ProcessContext.sideOutput(), you should not modify that value in any way. https://beam.apache.org/documentation/programming-guide/
  • 45. Side inputs Side inputs – global view of a PCollection used for broadcast / joins. ParDo can receive extra inputs “on the side” For example broadcast the count of elements to the processing of each element Side inputs are computed (and accessed) per-window https://beam.apache.org/documentation/programming-guide/
  • 46. Side Outputs ParDo(SomeDoFn) input elements Main Output Bogus Inputs Write Out Continue Pipeline ParDos can produce multiple outputs For example: A main output containing all the successfully processed results A side output containing all the elements that failed to be processed https://beam.apache.org/documentation/programming-guide/
  • 47. Beam SQL Beam SQL allows to query bounded and unbounded PCollections with SQL statements. Your SQL query is translated to a PTransform, an encapsulated segment of a Beam pipeline. https://beam.apache.org/documentation/dsls/sql/overview/ http://calcite.apache.org/ PCollection<Row> filteredNames = items.apply( BeamSql.query( "SELECT appId, description, rowtime FROM PCOLLECTION WHERE id=1"));
  • 48. Windowing Key 2Key 1 Key 3 1 Fixed 2 3 4 Key 2Key 1 Key 3 Sliding 1 2 3 5 4 Key 2Key 1 Key 3 Sessions 2 4 3 1 Windowing - partitions data based on the timestamps associated with events https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  • 51. Unbounded, out of order Streams 8:00 8:00 8:00 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  • 52. Processing Time vs Event Time ProcessingTime Realtime Event Time Delay
  • 53. Triggers allow you to deal with late-arriving data or to provide early results Determines when to emit the results of aggregation as unbounded data arrives. Triggers When collecting and grouping data into windows, Beam uses triggers to determine when to emit the aggregated results of each window input .apply(Window.into(FixedWindows.of(...)) .triggering( AfterWatermark.pastEndOfWindow())) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...)) https://beam.apache.org/documentation/programming-guide/
  • 55. Watermarks • The “event time”, determined by the timestamp on the data element itself • Watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. • Data that arrives with a timestamp after the watermark is considered late data. https://beam.apache.org/documentation/programming-guide/#watermarks-and-late-data Note: Managing late data is not supported in the Beam SDK for Python
  • 59. Pros of Apache Beam • Abstraction over different execution backends and programming languages. • Clean and Simple programming model. (easy to understand, implement and maintain) • Same data pipeline for batch processing as well as for stream processing.
  • 60. Apache Beam https://beam.apache.org The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-102 Why Apache Beam? A Google Perspective http://goo.gl/eWTLH1