Google slide version of this slide can be accessed from: https://docs.google.com/presentation/d/1Ws73JxlVH39HiKiYuF3vW903j8wFzxPQihXz4CQ_HZM/edit?usp=sharing
2. Imre Nagi
Ping me @imrenagi
Previously:
Software Engineer @CERN @eBay Inc
Currently:
Software Engineer @Traveloka Data
Docker Community Leader, Indonesia
5. Apache Beam ...
A set of SDK that define
programming model that
you use to build your
stream and batch
processing pipeline
Cloud Dataflow
Fully managed distributed
service that runs and optimizes
your beam pipeline
6. Jakarta
Dataflow for ..
● Move
● Filter
● Enrich
ETL
● Connecting to Cloud Pub/Sub
● Read and Write to BigQuery,
Bigtable, etc.
I/O Operation
● Streaming Computing
● Batch Computing
● Machine Learning
Analytics
12. Jakarta
Represents graph of data
processing transformation
PCollection flows through
the pipeline
Can have multiple I/O in the
beginning and end of
pipeline
Beam Pipeline
14. // Define the pipeline option
PipelineOptions options = PipelineOptionsFactory.create();
// Create the pipeline
Pipeline p = Pipeline.create(options);
15. Jakarta
Data Model
PCollection<T> is a
collection of data type T
May be bounded or
unbounded in size
Element might has implicit
or explicit timestamp
16. // Create the PCollection 'lines' by applying a 'Read' transform.
PCollection<String> lines = p.apply(TextIO.read().from("/path/to/some/inputData.txt"));
PCollection<String> linesGCS = p.apply(TextIO.read().from("gs://deeptech/*"));
static final List<String> LINES = Arrays.asList(
"This is the first line",
"You will say this one is the second",
"But it's not. ");
// Generating PCollection from in memory data
PCollection<String> lines = p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of())
// Generate bounded pcollection
PCollection<Long> bounded = p.apply(GenerateSequence.from(0).to(1000));
// Generate unbounded pcollection
PCollection<Long> unbounded = p.apply(GenerateSequence.from(0));
35. Jakarta
Lambda Architecture ever says that Stream Processing only CAN’T produce
accurate analytics result. Thus, Batch Processing is necessary to fix the
inaccuracy of the stream processing.
39. Jakarta
What is windowing?
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded
data.
Fixed Sliding
1 2 3
54
Key
2
Key
1
Key
3
Time
1 2 3 4 A windowing function
computes which
window(s) an element
belongs to. Temporal
functions can be
parameterized with
duration and
frequency.
40. Jakarta
What about data-dependent windowing?
Sessions
2
431
Time
Unique per key - you
can't know a priori
when a session ends,
so the windowing
function is now also
parameterized by
state.
44. Jakarta
Trigger
Triggers: A trigger is a mechanism for declaring
when the output for a window should be
materialized relative to some external signal.
Triggers provide flexibility in choosing when
outputs should be emitted.
They also make it possible to observe the output for a window
multiple times as it evolves
54. Jakarta
1. http://streamingsystems.org/Presentations/Jelena%20Pjesivac-grbo
vic.pdf
2. Stream Analytics with Google Cloud Dataflow: Use Cases &
Patterns, Gaurav Anand
3. Streaming 101 & 102, Tyler Akidau
4. https://streamingbook.net
5. Apache Beam Documentation
Google Slide version from this slide can be accessed from:
https://docs.google.com/presentation/d/1Ws73JxlVH39HiKiYuF3vW
903j8wFzxPQihXz4CQ_HZM/edit?usp=sharing
Credits to: