In this InfluxDays NYC 2019 talk, you will get an overview of the Google data pipelines and some use-cases for infrastructure monitoring and IoT (Google). In addition, we will share some common solutions that can be deployed on GCP including using InfluxDB time series database for Kubernetes Monitoring and IoT.
What Are The Drone Anti-jamming Systems Technology?
Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by Ron Pantofaro
1. Data pipelines on Google Cloud
Ron Pantofaro, Solutions Architect, Google
@panto
2. User
experience
Business goal: Respond to business events as
they happen
Data
consumers
AnalyzeTransformIngest
IoT
Mobile
Web
Endpoint clients
Transactional &
device data
Databases
3. IT goal: Simplify ETL architecture
Pub/Sub
Data
producer
Data
consumer
File
Data
producer
Data
consumer
File
File
● Applications must persist millions of small files
● Every file must arrive as a precondition to job completion,
delaying processing
● Unit of access is different in application logic
● No persistence of files required
● Every message guaranteed to be delivered
to every reader
● Unit of access is same in application logic
VSFiles Events
6. What is Cloud Pub/Sub?
Publisher
Subscriber
Topic
Subscription
Message
Message
fully-managed real-time messaging
service that allows you to send and
receive messages between independent
applications.
7. Compute and Storage
Unbounded
Bounded
Resource Management
Resource Auto-scaler
Dynamic Work
Rebalancer
Work Scheduler
Monitoring
Log Collection
Graph Optimization
Auto-Healing
Intelligent WatermarkingS
O
U
R
C
E
S
I
N
K
What is Cloud Dataflow?
8. Mobile
Devices
Cloud Pub/Sub Cloud Dataflow
Storage
Tens of Thousands Events/sec
Tens of Billions Events/month
Hundreds of Billions Events/year
A Unified Model on Google Cloud Platform
9. Here’s a simple graphic showing how Dataflow can integrate and transform data from two sources.
One discrete job
Endless
incoming data
Cloud
Dataflow
What is Cloud Dataflow?
10. Deploy
Schedule & Monitor
Autoscaling mid-job
Fully-managed and auto-configured
Auto graph-optimized for best execution path
Dynamic Work Rebalancing mid-job
1
2
3
4
Why Use Cloud Dataflow?
11. Autoscaling mid-job
Auto graph-optimized for best execution path
Dynamic Work Rebalancing mid-job
1
2
3
4
C D
C+
D
C D
C+
D
Why Use Cloud Dataflow?
Fully-managed and auto-configured
12. 800 RPS 1200 RPS 5000 RPS 50 RPS
*means 100% cluster utilization by definition
Autoscaling mid-job
Fully-managed and auto-configured
Auto graph-optimized for best execution path
Dynamic Work Rebalancing mid-job
1
2
3
4
Why Use Cloud Dataflow?
13. Autoscaling mid-job
Fully-managed and auto-configured
Auto graph-optimized for best execution path
Dynamic Work Rebalancing mid-job
1
2
3
4
100 mins. 65 mins.
vs.
Why Use Cloud Dataflow?
14. Start off with 3 workers,
things are looking okay
10 minutes
3 days
Re-estimation shows there’s
orders of magnitude more work:
need 100 workers!
Idle
You have 100 workers
but you don’t have 100 pieces of work!
...and that’s really the most important part
Autoscaling at Work
18. Pub/Sub to BigQuery
Dataflow templates let you stage
your job’s artifacts in Google
Cloud Storage.
Launch template jobs via REST
API, or Cloud Console.
19.
20. Simpler operation via serverless infrastructure/fully-managed services
Faster development: single code base for batch & streaming with Apache Beam
The integrated, open way to ingest, process, and analyze data that is also easy to adopt, scale,
and manage
Lower cost with efficient scheduling, fast auto-scaling, granular billing
Easier to get started with hybrid deployments via open, standard APIs
1
2
3
4
Why analytics on Google Cloud Platform?
21. Simplify management and operations
All resources provisioned automatically for nearly limitless scale:
● Ingest data from anywhere to anywhere at up to 100GB/s with consistent performance
● Data processing worker nodes auto-scale for maximum utilization, with dynamic rebalancing
● Rely on encryption everywhere, policy-based access control, and HIPAA compliance
● End-to-end monitoring and alerting helps troubleshoot pipelines while they’re running
This tradeoff gives rise to the Lambda architecture.
In the lambda model we create a data pipeline that handles streaming data, perhaps computing results an event at a time.
We also create a batch pipeline that handles the data once it’s complete in order to true up the results we got from the incomplete data with results from our complete data.
Problems - we potentially wait a long time for correct results. And usually, as is the case in this picture, we end up with two different codebases to process what’s really the same data.
When the job starts, Dataflow will automatically optimize the pipeline, fusing some operations and breaking some apart.
this process of optimization resembles somewhat the process that database execution engines carry out when you provide SQL that it then optimizes into a physical execution plan.
Dataflow might choose to fuse operations together in order to avoid costly processing or IO.
While the job runs, Dataflow monitors the throughput and can automatically scale the worker count up in response to spikes and down when workers are no longer needed.
Dataflow monitors execution time of tasks within the job,
Dataflow automatically rebalances work across workers. This rebalancing ensures that stragglers or skewed data do not cause your job to run longer.
In the picture, we see a couple of examples - on the left, a job without rebalancing. Stragglers take longer to complete their tasks perhaps because a few workers are misbehaving. You can see a bunch of the thin lines reaching up to the top, while the majority finish earlier.
With work rebalancing, we redistribute the work from the stragglers to other nodes, and our pipeline completes dramatically faster.
What does autoscaling do for you? Take an example where you have a pipeline that starts with 3 workers. Everything’s looking good. But some time later, we update our original completion time estimate from 10 minutes to 3 days! Dataflow decides that it would like to use 100 workers to complete the job faster. So what does it do?
Dataflow will create more workers, then take the existing work and redistribute it across the nodes automatically.
So here’s an example of a very simple pipeline.
We read from a TextFile that we specify on the command line, apply a CountWords transform to, you guessed it, count the words, then we format the counts, and write the counts to a file we specify on the command line.
A more complex example which uses Pubsub to get its data, and similarly counts things.
One distinction here is how easy it is to take a count of things based on a window. Here we’re
https://github.com/googlecodelabs/cloud-dataflow-nyc-taxi-tycoon/blob/master/dataflow/src/main/java/com/google/codelabs/dataflow/CountRides.java