Introduction to Google BigQuery. Slides used at the first GDG Cloud meetup in Brussels, about big data on Google Cloud Platform. (http://www.meetup.com/GDG-Cloud-Belgium/events/228206131)
2. About myself
Matthias Feys
work @Datatonic:
- big data (with Google Cloud)
- machine learning
- data visualizations (Tableau/Spotfire)
Google Qualified Cloud Developer
contact:
- @FsMatt
- matthias@datatonic.com
3. About Datatonic
Datatonic is a team of data science experts that help corporations unleash
the power of data. They use Google Cloud Platform, data visualisation
technologies (like Tableau or Spotfire) and machine learning to build
breakthrough solutions. Either as expert advisors, or built as a fully
managed solution with support end-to-end (A3S).
Some references:
@teamdatatonic
4. ● What is BigQuery?
● How does it scale/work?
● How does it compare to:
- NoSQL datastores
- MapReduce
● Demo
● Pricing Model
● Best Practices
This talk
5. What is BigQuery?
“BigQuery is a fully-managed
and cloud-based interactive
query service for
massive datasets.”
It’s the externalization
of Dremel, one of
Google’s core
technologies
6. What is BigQuery? (2)
BigQuery Service is available via:
● Web UI (bigquery.cloud.google.com)
● console (gcloud)
● API (+ client libraries)
● external tools (Tableau, Excel, …)
● ODBC connector
8. Dremel Architecture
Data Model/Storage:
- Columnar Storage
- Nested/Repeated Fields
- No Index!
-> Single Full Table Scan (from disk)
Query Execution:
- Tree Architecture
- Using tens of thousands
machines over fast Google
network (+1Petabit/s)
9. Columnar Storage
● Traffic minimization:
○ only read selected
columns
● Higher Compression Ratio:
○ Similar values in the
same column
○ From 1:3 → 1:10
10. Tree Architecture
- root server:
->receives query + reads table metadata
->rewrites the query(s)
->sends queries to the next level
<-returns final query results
- intermediate servers:
->(similar steps)
<-parallel partial aggregation
- leaf servers:
->actually scan (parts) of the table
<-send data to intermediate servers
11. NoSQL Datastore vs. BigQuery?
NoSQL Datastore
● Index based
(expected queries)
● Read-write
BigQuery
● Non-index based
(ad hoc queries)
● Read-only (append-
only)
12. MapReduce vs. BigQuery?
MapReduce
● High latency
● Flexible (complex)
batch processing
● Unstructured data
BigQuery
● Low latency
● SQL-like queries
● Structured data
17. Denormalize / Pre-Join Where Possible
● Best performance
● Only pay for the columns you need
● Nested/repeated fields!
Relational Database Design Denormalized Nested/Repeated (JSON)
18. Table Sharding
● You pay for what you read
→ Read less, pay less
● Table wildcards allow for easy reading over multiple tables
https://cloud.google.com/bigquery/query-reference#tablewildcardfunctions
19. Optimize for Query vs. Storage Costs
Common Queries?
- Materialized views
(save intermediate results in tables)
with pre-aggregated data:
→ faster + cheaper queries
- Store data in multiple tables:
- table for daily data
- table for weekly data
- table for monthly data
20. Narrow the Table Scans
You only pay for the columns you read
Don’t use “SELECT *” !!!
21. Table Decorators
Only way to avoid doing full table scans!
Allows undeleting tables
options:
● snapshot decorator + range decorator
● relative value + absolute values
https://cloud.google.com/bigquery/table-decorators