3. 3
Managed Data Services - Focus on Insight vs Infrastructure
PB+ Scale, No-Ops, Batch & Streaming of Data
Insights/
Analytics
Resource
Provisioning
Performance
Tuning
Monitoring
Reliability
Deployment &
Configuration
Handling
Growing Scale
Utilization
Improvements
Insights/
Analytics
4. 4
15 Years of Tackling Big Data Problems
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Open
Source
2005
Google
Cloud
Product
s
BigTable Spanner
2016
Millwheel TensorflowDataflowFlume JavaDremel
5. 5
15 Years of Tackling Big Data Problems
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Open
Source
2005
Google
Cloud
Product
s
BigTable Millwheel TensorflowSpanner
2016
DataflowFlume JavaDremel
6. 6
“ ”
Google is living a few years in the future and
sending the rest of us messages.
Doug Cutting
Hadoop Co-Creator
7. 7
15 Years of Tackling Big Data Problems
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Flume Java
Open
Source
2005
Google
Cloud
Product
s
BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel Spanner
ML
2016
Millwheel TensorflowDataflow
8. 8
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
9. 9
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
14. 14
Compute and Storage
Unbounded
Bounded
Resource Management
Resource Auto-scaler
Dynamic Work
Rebalancer
Work Scheduler
Monitoring
Log Collection
Graph Optimization
Auto-Healing
Intelligent WatermarkingS
O
U
R
C
E
S
I
N
K
Cloud Dataflow
Serverless, autoscaling, pay for what you use.
15. 15
“
”
We are very excited about the productivity
benefits offered by Cloud Dataflow and
Cloud Pub/Sub. It took half a day to rewrite
something that had previously taken over
six months to build using Spark.
Paul Clarke
Director of Technology, Ocado
16. “From traditional batch processing to rock-solid event delivery to the nearly
magical abilities of BigQuery, building on Google’s data infrastructure
provides us with a significant advantage where it matters the most.”
Nicholas Harteau
VP of Engineering and Infrastructure
16
17. 17
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
18. 18
Cloud Dataproc
Ready-to use Spark and Hadoop clusters in 90 seconds
Integrated with Cloud
Storage, Cloud Logging,
Cloud Monitoring, and more.
While active, Dataproc
clusters are billed minute-by-
minute.
Dataproc clusters can make
use of low-cost preemptible
Compute Engine VMs.
Minute-by-Minute Billing Preemptible VMsNative Spark and Hadoop Cloud Integrated
Run Spark and Hadoop
applications out of the box
without modification.
19. 19
Cloud Dataproc
Demonstrably more cost-effective
Source: Michael Li & Ariel
M'ndange-Pfupfu on O’Reilly:
https://www.oreilly.com/ideas/
spark-comparison-aws-vs-gcp
20. 20
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
22. Google confidential │ Do not distribute
Overview:
Data to process: Data in the Consolidated Audit Trail (CAT).
A data repository of all equities and options orders, quotes,
and events
Challenges:
How to process the CAT and organize 100 billion market
events into an “order lifecycle” in a 4 hour window
Store 6 years (~30PB) of data
Cloud Bigtable to process and run queries
and tolerate volume increases
6 BILLION
MARKET EVENTS
WRITTEN PER HOUR
1.7 GIGs
PER SECOND
PER HOUR
6 TBs
10 BN
WRITTEN
PER HOUR BURSTS
1.7 GIGABYTES
PER SECOND
10 TERABYTES
PER HOUR
23. 23
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
24. 24
– Mattias P Johansson, Software Engineer, Spotify
“With Google Cloud Platform, we benefitted by having a
virtual supercomputer on demand, without having to deal
with all the usual space, power, cooling and networking
issues.
Just a few years ago, we would have needed to use the
largest supercomputers on the planet to do what we’re
now able to do with Google”
– Mark Johnson, CEO, Descartes Labs
“Right at the start of the partnership we were able
to reduce time to insight from 96 hours to 30
minutes by using BigQuery.”
– Gary Sanders, Head of Digital Analytics, Lloyds Banking Group
“Everyone involved unanimously picked GCP. It came
down to this: we believe the core technology is better.”
– Peter Bakkam, Platform Lead, Quizlet
Do you feel this way about your Data Warehouse?
25. 25
BigQuery
Now with full support for Enterprise Data Warehouse
SQL
Flat-rate Pricing
Standard
SQL
ODBC
Connector
DML - Beta
Identity Access and
Management
26. 26
BETA
BETA
GAGA
Use your own data to train models
Cloud Datalab
Cloud Machine Learning
Cloud Storage Google BigQuery Develop/Model/Test
Cloud Dataflow
GA
Train
Predict
27. Features:
● Fully Managed NewSQL database service with relational
semantics and global consistency
● Global replication options for low-latency reads across the
globe
● Consistent transactions across millions of rows
Use Cases:
● Large application workloads with very high write volumes or
large datasets (3 TB+)
● Geographically distributed control planes with global
consistency guarantees
Pricing:
● Pay per hour of node compute for throughput
● Pay per GB/month of data stored
Cloud Spanner
Google Cloud Platform
Notes de l'éditeur
a lot of info in pictures
23 billion words in Wikipedia
40 billion textual lines in StreetView
Make the point that we didn’t realize how valuable the pictures were originally and later we revisited and extracted all this additional value.
Willing to make a bet all the audience have similarly valuable data.
What really tipped the scales towards Google for spotify was their experience with Google’s data platform and tools. “Good infrastructure isn’t just about keeping things up and running, it’s about making all of our teams more efficient and more effective, and Google’s data stack does that for us in spades. From traditional batch processing with Dataproc, to rock-solid event delivery with Pub/Sub to the nearly magical abilities of BigQuery, building on Google’s data infrastructure provides us with a significant advantage where it matters the most.
We have a large and complex backend, so this is a large and complex project that will take us some time to complete. We’re looking forward to sharing our experiences with you as we go, so watch our engineering blog for more information on what we learn, build and break along the way. We’re pretty excited about our Googley future and hope you’ll find it interesting too.