The document describes building an event analytics pipeline using Google Cloud services like BigQuery, Dataflow, Pub/Sub, and Kubernetes Engine. It discusses ingesting streaming and batch event data, processing the events to filter, aggregate and group them, analyzing billions of events at scale in BigQuery, and using cost analytics with reOptimize.io to optimize costs. The pipeline has a serverless architecture and is designed to be flexible, scalable, and low cost.
Build Your Own Event Analytics Pipeline Using BigQuery, Dataflow, and K8s
1. Build Your Own Event
Analytics Pipeline Using BigQuery,
Dataflow, and K8s
Aviv Laufer
Principal Reliability Engineer , DoiT International
@avivl
2. Google’s Premier
MSP Partner helping
startups around the
globe with cloud
engineering &
cost optimization
Autoscaling Hadoop
and Spark on top of
Google Dataproc
Opinionated Event Analytics
Pipeline built on top
of Dataflow
Park non-production
instances and save ±60% on
Google Compute Engine
Collaborate with peers and
other teams on configuration
changes in Google Cloud
The most advanced
cost-optimization platform
for Google Cloud
4. Flexible
Unlimited aggregations and
joins on our own data w/ BI
tool of our choice
Lower cost at scale
Cost per event should
decrease as we stream more
events to the system
Global
Short latencies for most of the
users regardless of their location
Event analytics pipeline v2.0
11. Latency distribution
(95th percentile)
North America: 89ms
West Europe: 54ms
Without GKE cluster in asia-east1
250ms
With GKE cluster in asia-east1
61ms (75% improvement!)
Global
12. Managed real-time
messaging
Google Cloud Endpoints
helps to protect and
monitor our APIs.
Authentication
Rate control
Monitoring
Events API
Cloud Endpoints
Android
Web
Endpoint
Clients
Name
Kubernetes Engine
iOS
Google Cloud Endpoints
13. Managed real-time
messaging
Cloud Pub/Sub delivers
each event to every
subscription at least once.
Publisher
Topic
Message
Cloud Pub/Sub Subscription
Subscriber
Pull or
push
Google Cloud Pub/Sub
Message
Ack
20. Data Ingest
Async messaging
Cloud pub/sub
Immutable data
BigQuery
Mutable events
Cloud bigtable
Dataflow/beam
Cloud dataflow
Relocate
Dataflow/beam
Cloud dataflow
Mutable data
The life of event
Some data
may change.
Some events
are immutable.
28. Open sourcing Banias
Opinionated serverless event analytics pipeline
github.com/doitintl/banias
Deployable in just 1 hour
Elastic schemas
29. References
Suggested reading
Building a Mobile Gaming Analytics Platform - a Reference Architecture
How to handle mutating JSON schemas in a streaming pipeline
Google Cloud Analytics with reoptimize.io
github.com/doitintl/banias
blog.doit-intl.com