Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by Ron Pantofaro

Data pipelines on Google Cloud
Ron Pantofaro, Solutions Architect, Google
@panto

User
experience
Business goal: Respond to business events as
they happen
Data
consumers
AnalyzeTransformIngest
IoT
Mobile
Web
Endpoint clients
Transactional &
device data
Databases

IT goal: Simplify ETL architecture
Pub/Sub
Data
producer
Data
consumer
File
Data
producer
Data
consumer
File
File
● Applications must persist millions of small files
● Every file must arrive as a precondition to job completion,
delaying processing
● Unit of access is different in application logic
● No persistence of files required
● Every message guaranteed to be delivered
to every reader
● Unit of access is same in application logic
VSFiles Events

Mobile
Devices
Tens of Thousands Events/sec
Tens of Billions Events/month
Hundreds of Billions Events/year
The Lambda Model

Mobile
Devices
Apache Beam
+
A Unified Model
or
or

What is Cloud Pub/Sub?
Publisher
Subscriber
Topic
Subscription
Message
Message
fully-managed real-time messaging
service that allows you to send and
receive messages between independent
applications.

Compute and Storage
Unbounded
Bounded
Resource Management
Resource Auto-scaler
Dynamic Work
Rebalancer
Work Scheduler
Monitoring
Log Collection
Graph Optimization
Auto-Healing
Intelligent WatermarkingS
O
U
R
C
E
S
I
N
K
What is Cloud Dataflow?

Mobile
Devices
Cloud Pub/Sub Cloud Dataflow
Storage
A Unified Model on Google Cloud Platform

Here’s a simple graphic showing how Dataflow can integrate and transform data from two sources.
One discrete job
Endless
incoming data
Cloud
Dataflow
What is Cloud Dataflow?

Deploy
Schedule & Monitor
Autoscaling mid-job
Fully-managed and auto-configured
Auto graph-optimized for best execution path
Dynamic Work Rebalancing mid-job
1
2
3
4
Why Use Cloud Dataflow?

Autoscaling mid-job
1
2
3
4
C D
C+
D
C D
C+
D

800 RPS 1200 RPS 5000 RPS 50 RPS
*means 100% cluster utilization by definition
Autoscaling mid-job
1
2
3
4

Autoscaling mid-job
1
2
3
4
100 mins. 65 mins.
vs.

Start off with 3 workers,
things are looking okay
10 minutes
3 days
Re-estimation shows there’s
orders of magnitude more work:
need 100 workers!
Idle
You have 100 workers
but you don’t have 100 pieces of work!
...and that’s really the most important part
Autoscaling at Work

public static void main(String[] args) {
…
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadLines")
.from(options.getInputFile()))
.apply(new CountWords())
.apply(ParDo.of(new FormatAsTextFn()))
.apply(TextIO.Write.named("WriteCounts")
.to(options.getOutput()));
p.run();
}

pipeline
.apply(PubsubIO.Read.named("read from PubSub")
.topic(String.format("projects/%s/topics/%s",
options.getSourceProject(), options.getSourceTopic()))
.timestampLabel("ts")
.withCoder(TableRowJsonCoder.of()))
.apply("window 1s",
Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply("mark rides", MapElements.via(new MarkRides()))
.apply("count similar", Count.perKey())
.apply("format rides", MapElements.via(new TransformRides()))
.apply(PubsubIO.Write.named("WriteToPubsub")
.topic(String.format("projects/%s/topics/%s",
options.getSinkProject(), options.getSinkTopic()))
.withCoder(TableRowJsonCoder.of()));
Read from Pubsub
Window of 1 second
Create KV pairs
Count them by key
Format for output
Write to Pubsub

Using Dataflow Templates
Launching a Simple Pipeline
Ingest
Cloud
Pub/Sub
Pipelines
Cloud
Dataflow
Analytics
BigQuer
y

Pub/Sub to BigQuery
Dataflow templates let you stage
your job’s artifacts in Google
Cloud Storage.
Launch template jobs via REST
API, or Cloud Console.

Simpler operation via serverless infrastructure/fully-managed services
Faster development: single code base for batch & streaming with Apache Beam
The integrated, open way to ingest, process, and analyze data that is also easy to adopt, scale,
and manage
Lower cost with efficient scheduling, fast auto-scaling, granular billing
Easier to get started with hybrid deployments via open, standard APIs
1
2
3
4
Why analytics on Google Cloud Platform?

Simplify management and operations
All resources provisioned automatically for nearly limitless scale:
● Ingest data from anywhere to anywhere at up to 100GB/s with consistent performance
● Data processing worker nodes auto-scale for maximum utilization, with dynamic rebalancing
● Rely on encryption everywhere, policy-based access control, and HIPAA compliance
● End-to-end monitoring and alerting helps troubleshoot pipelines while they’re running

v
Data
Studio
Cloud
Pub/Sub
Apache
Kafka
Container
Engine
Apache
Spark
Cloud
Dataproc
3rd-party
BI Tools
Data
consumers
Or Or
Apache
Beam
Cloud
Dataflow
Analyze (data warehouse)
Cloud
Bigtable
HBase
API
Mix-and-match GCP’s native services with open
source
IoT
Mobile
Web
Endpoint clients
Transactional
& device data
Databases
Cloud
ML
BigQuery
TransformIngest

Data
consumers
Metrics collection
Monitor
Telegraf
Sensu
Collectd
Stackdriver

Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by Ron Pantofaro

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by Ron Pantofaro

Similaire à Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by Ron Pantofaro (20)

Plus de InfluxData

Plus de InfluxData (20)

Dernier

Dernier (20)

Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by Ron Pantofaro

Notes de l'éditeur