SlideShare a Scribd company logo
1 of 38
Download to read offline
You might be paying too much
for BigQuery
Ryuji Tamagawa @ Osaka, Japan
Agenda
about me
about BigQuery
Basics
Advanced
Tips & Tricks
Agenda
about me
about BigQuery
Basics
Advanced
Tips & Tricks
What you pay for when
using BigQuery
How BigQuery runs your
queries
Agenda
about me
about BigQuery
Basics
Advanced
Tips & Tricks
Selecting columns
Table decorators
Dividing tables
How query cache works
Agenda
about me
about BigQuery
Basics
Advanced
Tips & Tricks
CPUs & Network are
FOR FREE
Sub query optimesed
Repeated Fields
About me
Software engineer working for
ISV, from architecture design to
troubleshooting in the field
Translator working with O’Reilly
Japan
‘Google BigQuery Analytics’ is
the 25th book
Active in GCPUG, especially in
#bq_sushi
A bed for 6 cats
About BigQuery 

(for those who don’t know yet)
Full-managed structured data
store queryable with SQL
Very easy to use
Fast, in that almost no
slowdown with Big Data
Cost-effective
Built on Google’s
infrastructure components
About BigQuery 

(for those who don’t know yet)
Basic operations are available
within Web UI
You can ‘dryrun’ from Web
UI to check the amount of
data to be scanned
You can use the command line
interface (bq) to integrate
BigQuery into your workflow
API are provided for Python,
Java
BigQuery is for analytics
Essentially, data model is same as relational databases with
some extension
BigQuery is for analytics, not for transaction processing
You can insert rows(batch or streaming), but can not update
or delete them.
There’s no index - tables are always read by ‘fullscan’
You can insert rows from GCS or via HTTP in CSV or JSON
format.
Basics
You might be paying too much for BigQuery
What you pay for
Storage - $0.020 per GB / month
Queries - $5 per TB processed (scanned)
Streaming inserts - $0.01 per 100,000 rows until July
20, 2015. After July 20, 2015, $0.01 per 200 MB, with
individual rows calculated using a 1 KB minimum size.
What matters is the Storage
A simple example
Load 1TB data to a table everyday, keep each table for
a month
Query the daily data 5 times everyday to aggregation
For storage : 

1TB * 30 (tables) = $0.020 * 1000 * 30 = $600
For Queries:

1TB * 5 (Queries) * 30 (days) = $750
How your data is stored
Your data is stored
1. in thousands of disks ( depending on the size)
2. in columnar format ( ColumnIO or something)
3. compressed

(However, the cost is based on uncompressed size)
How BigQuery runs your query
Requested rows are read from DFS, sent to compute nodes
Compute nodes (could be thousands) form processing tree on the fly
Results are written back to DFS as a table (anonymous or named)
distributed file storage layer (tables)
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
results
How BigQuery runs your query
When doing JOIN between large tables or GROUP BY on a large dataset,
keys needs to be hashed and associated data will send to nodes depends
on the hash value for in-memory join or grouping.
distributed file storage layer (tables)
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
results
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
‘Shuffle’
Advanced
You might be paying too much for BigQuery
Narrowing your scan is the key
BigQuery does not have
indexes - always do fullscan
BigQuery uses columnar
storage. Selecting least
columns you need makes the
cost lower
C1 C2 C3 C4
R1
R2
R3
R4
R5
R6
R7
R8
R8
Narrowing your scan is the key
SELECT C1, C2, C3, C4
FROM t
For scanned cell (red-filled),
you’ll pay
C1 C2 C3 C4
R1 Scaned Scaned Scaned Scaned
R2 Scaned Scaned Scaned Scaned
R3 Scaned Scaned Scaned Scaned
R4 Scaned Scaned Scaned Scaned
R5 Scaned Scaned Scaned Scaned
R6 Scaned Scaned Scaned Scaned
R7 Scaned Scaned Scaned Scaned
R8 Scaned Scaned Scaned Scaned
R8 Scaned Scaned Scaned Scaned
Narrowing your scan is the key
SELECT C1, C3 FROM t
You’ll pay only for C1 & C3
C1 C2 C3 C4
R1 Scaned Scaned
R2 Scaned Scaned
R3 Scaned Scaned
R4 Scaned Scaned
R5 Scaned Scaned
R6 Scaned Scaned
R7 Scaned Scaned
R8 Scaned Scaned
R8 Scaned Scaned
You shouldn’t ‘SELECT *’
unintentionally
Narrowing your scan is the key
BigQuery’s tables can have
virtually any number of rows
Watch out : All those rows will
be scanned, no matter what
‘WHERE’ you use in your queries
There are 2 ways to work
around this:
table decorators
dividing tables
C1 C2 C3 C4
R1
R2
R3
R4
R5
R6
R7
R8
R9
R99999999990
R99999999991
R99999999992
R99999999993
R99999999994
Narrowing your scan is the key
Snapshot decorators:
you can limit your scan within a snapshot
of the table at a given time
SELECT … FROM t@1430665200000
Time-range decorators:
you can limit your scan between a given
time range
SELECT … FROM t@-1430751600000
You can pass the time within last 7 days
C1 C2 C3 added
R1 4/1
R2 4/1
R3 4/1
R4 4/1
R99999999990 5/8
R99999999991 5/8
R99999999992 5/8
R69999999990 5/3
R69999999991 5/3
R69999999992 5/3
R79999999990 5/5
R79999999991 5/5
R79999999992 5/5
Table arrangement estimated
Batch insert creates an inserted
‘block’
Recent inserted blocks (within
last 7 day) are left separated from
‘main’ block of the table
Blocks past 7day will be merged
with ‘main’ block of the table
streaming inserted rows are not
stored in blocks but in BigTable
C1 C2 C3 added
R1 4/1
R2 4/1
R3 4/1
R4 4/1
R99999999990 5/8
R99999999991 5/8
R99999999992 5/8
R69999999990 5/3
R69999999991 5/3
R69999999992 5/3
R79999999990 5/5
R79999999991 5/5
R79999999992 5/5
This is my estimation : as far as I know, Google didn’t
officially mentioned about things like this.
MainBlockBlockof5/3Blockof5/5Blockof5/8
As of 2015/5/8
Table arrengement estimated
Batch insert creates an inserted
‘block’
Becent inserted blocks (within
last 7 day) are left separated from
‘main’ block of the table
Blocks past 7day will be merged
with ‘main’ block of the table
Streaming inserted rows are not
stored in blocks but in BigTable
C1 C2 C3 added
R1 4/1
R2 4/1
R3 4/1
R4 4/1
R6999999999
0
5/3
R6999999999
1
5/3
R6999999999
2
5/3
R99999999990 5/8
R99999999991 5/8
R99999999992 5/8
R79999999990 5/5
R79999999991 5/5
R79999999992 5/5
MainBlockBlockof5/5Blockof5/8
As of 2015/5/11
If you focus on last 7 days,
decorators are very useful for saving
costs
Narrowing your scan is the key
Tables are often split by date in BigQuery
You can easily union them within FROM
clause, separated with comma ( BQ-
specific notation)
TABLE_DATE_RANGE function is
useful, ex :
SELECT … FROM
(TABLE_DATE_RANGE(sample.T,

TIMESTAMP(‘2015-05-01’),

TIMESTAMP(‘2015-05-10’)))
T20150401
C1 C2 C3 added
R1 12:00
R2 13:23
R3 14:10
R4 14:30
T20150501
C1 C2 C3 added
R1 9:09
R2 10:12
R3 11:00
R4 13:56
T20150510
C1 C2 C3 added
R1 9:09
R2 10:12
R3 11:00
R4 13:56
Narrowing your scan is the key
With traditional RDB, usually you
don’t split tables like this : Expensive
‘Enterprise’ editions support features
like this, but it takes your time for
design, operation, and maintenance
In BigQuery, splitting tables like this
sometimes even makes your query
faster
The difference comes from the
architectural difference: BigQuery is
designed from ground to reads and
processes data from many disks
with many compute nodes
T20150401
C1 C2 C3 added
R1 12:00
R2 13:23
R3 14:10
R4 14:30
T20150503
C1 C2 C3 added
R1 9:09
R2 10:12
R3 11:00
R4 13:56
T20150503-1
C1 C2 C3 added
R1 9:09
R2 10:12
R3 11:00
R4 13:56
Narrowing your scan is the key
DFS Layer
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
results
T2015
0501
T2015
0502
T2015
0503
T2015
0508
T2015
0509
T2015
0510
Actually, any single table is stored in many disks
and the data from a table is read by many nodes.
Using query cache
Result of a query will sent to anonymous dataset,
with a name generated from name of the tables and
their last update timestamp and the query.
When a query is executed, BigQuery checks if the
cached result exists at first.
If the query returns the cached result, it costs
nothing.
Query cache is free
Applications like dashboards, which runs almost same
queries again and again, can save costs by utilizing
query cache
You can write code that save query results somewhere
for later use and avoid running same query, but
sometimes you don’t have to worry about it - query
cache does same for you
Query cache is enabled when:
The query is deterministic (e.g. without NOW() )
The table does NOT have a streaming buffer
The result of the query was not saved to a named table
Actually a large result (>128MB) can not be cached
because in such case you have to specify
‘allowLargeResult’ and thus the result must be saved
to a named table.
Tips & Tricks
You might be paying too much for BigQuery
Trade offs - time & cost
Generally, normalizing your data model makes:
the size of the data small, which means in BigQuery,
you pay less
may use more CPU time and network traffic,
especially when you run complex queries between
large tables
Trade offs - time & cost
You think ‘cost’ in terms of CPU, network, storage in on-
premise way
When using BigQuery:
You don’t pay money for CPU nor network
It takes time to run queries that consume much CPU
and/or network - queries using EACH keyword
If you don’t have to run queries interactively, they could
be run in batch mode with less cost, with an ‘appropriate’
schema.
distributed file storage layer (tables)
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
compute
noderesults
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
免費
distributed file storage layer (tables)
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
compute
noderesults
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
compute
node
Subqueries optimized
For example, if you have several types of log in one
table and you want to join them with different tables
depending on the type, you don’t have to worry
SELECT id, desc FROM 

(select l.id as id, s1.desc as desc

from samples.log l 

join samples.subTable1 s1 

on l.value=s1.subid 

where l.type=0) t1,

(select l.id as id, s2.desc as desc 

from samples.log l 

join samples.subTable2 s2 

on l.value=s2.subid 

where l.type=1) t2
This query scan log table only once
DFS Layer
compute
node
compute
node
compute
node
subTalbe1 Log subTable2
compute
node
Repeated fields
You can store array-like data in a row
This is not standardized feature of SQL
It’s like ‘materialized view’ or pre-joined table - could be compact to
store & fast to query
You should have a good understanding of the logic, or you will get
unexpected result
Do not use too complex schema (e.g. deeply nested repeated field)
The functions for repeated fields are useful, but watch out for
combinational explosion (e.g. FLATTEN)
Thank you for listening.
Questions?

More Related Content

What's hot

Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 

What's hot (20)

Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
 

Similar to You might be paying too much for BigQuery

The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
Calpont
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
cookie1969
 

Similar to You might be paying too much for BigQuery (20)

Simplifying SQL with CTE's and windowing functions
Simplifying SQL with CTE's and windowing functionsSimplifying SQL with CTE's and windowing functions
Simplifying SQL with CTE's and windowing functions
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
CBO choice between Index and Full Scan:  the good, the bad and the ugly param...CBO choice between Index and Full Scan:  the good, the bad and the ugly param...
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
 
Sql 2016 - What's New
Sql 2016 - What's NewSql 2016 - What's New
Sql 2016 - What's New
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
 
MySQL 8 - UKOUG Techfest Brighton December 2nd, 2019
MySQL 8 - UKOUG Techfest Brighton December 2nd, 2019MySQL 8 - UKOUG Techfest Brighton December 2nd, 2019
MySQL 8 - UKOUG Techfest Brighton December 2nd, 2019
 
MySQL 8 -- A new beginning : Sunshine PHP/PHP UK (updated)
MySQL 8 -- A new beginning : Sunshine PHP/PHP UK (updated)MySQL 8 -- A new beginning : Sunshine PHP/PHP UK (updated)
MySQL 8 -- A new beginning : Sunshine PHP/PHP UK (updated)
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
 
Confoo 2021 - MySQL Indexes & Histograms
Confoo 2021 - MySQL Indexes & HistogramsConfoo 2021 - MySQL Indexes & Histograms
Confoo 2021 - MySQL Indexes & Histograms
 
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and HistogramsDutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
 
MySQL Indexes and Histograms - RMOUG Training Days 2022
MySQL Indexes and Histograms - RMOUG Training Days 2022MySQL Indexes and Histograms - RMOUG Training Days 2022
MySQL Indexes and Histograms - RMOUG Training Days 2022
 
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
 
Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 

More from Ryuji Tamagawa

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 

More from Ryuji Tamagawa (20)

20171012 found IT #9 PySparkの勘所
20171012 found  IT #9 PySparkの勘所20171012 found  IT #9 PySparkの勘所
20171012 found IT #9 PySparkの勘所
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
 
hbstudy 74 Site Reliability Engineering
hbstudy 74 Site Reliability Engineeringhbstudy 74 Site Reliability Engineering
hbstudy 74 Site Reliability Engineering
 
PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase) PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase)
 
20170210 sapporotechbar7
20170210 sapporotechbar720170210 sapporotechbar7
20170210 sapporotechbar7
 
20161215 python pandas-spark四方山話
20161215 python pandas-spark四方山話20161215 python pandas-spark四方山話
20161215 python pandas-spark四方山話
 
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
 
20160708 データ処理のプラットフォームとしてのpython 札幌
20160708 データ処理のプラットフォームとしてのpython 札幌20160708 データ処理のプラットフォームとしてのpython 札幌
20160708 データ処理のプラットフォームとしてのpython 札幌
 
20160127三木会 RDB経験者のためのspark
20160127三木会 RDB経験者のためのspark20160127三木会 RDB経験者のためのspark
20160127三木会 RDB経験者のためのspark
 
20151205 Japan.R SparkRとParquet
20151205 Japan.R SparkRとParquet20151205 Japan.R SparkRとParquet
20151205 Japan.R SparkRとParquet
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Apache Sparkの紹介
Apache Sparkの紹介Apache Sparkの紹介
Apache Sparkの紹介
 
足を地に着け落ち着いて考える
足を地に着け落ち着いて考える足を地に着け落ち着いて考える
足を地に着け落ち着いて考える
 
ヘルシープログラマ・翻訳と実践
ヘルシープログラマ・翻訳と実践ヘルシープログラマ・翻訳と実践
ヘルシープログラマ・翻訳と実践
 
Google Big Query
Google Big QueryGoogle Big Query
Google Big Query
 
BigQueryの課金、節約しませんか
BigQueryの課金、節約しませんかBigQueryの課金、節約しませんか
BigQueryの課金、節約しませんか
 
Google BigQueryについて 紹介と推測
Google BigQueryについて 紹介と推測Google BigQueryについて 紹介と推測
Google BigQueryについて 紹介と推測
 
lessons learned from talking at rakuten technology conference
lessons learned from talking at rakuten technology conferencelessons learned from talking at rakuten technology conference
lessons learned from talking at rakuten technology conference
 
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
 
Mongo dbを知ろう devlove関西
Mongo dbを知ろう   devlove関西Mongo dbを知ろう   devlove関西
Mongo dbを知ろう devlove関西
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

You might be paying too much for BigQuery

  • 1. You might be paying too much for BigQuery Ryuji Tamagawa @ Osaka, Japan
  • 3. Agenda about me about BigQuery Basics Advanced Tips & Tricks What you pay for when using BigQuery How BigQuery runs your queries
  • 4. Agenda about me about BigQuery Basics Advanced Tips & Tricks Selecting columns Table decorators Dividing tables How query cache works
  • 5. Agenda about me about BigQuery Basics Advanced Tips & Tricks CPUs & Network are FOR FREE Sub query optimesed Repeated Fields
  • 6. About me Software engineer working for ISV, from architecture design to troubleshooting in the field Translator working with O’Reilly Japan ‘Google BigQuery Analytics’ is the 25th book Active in GCPUG, especially in #bq_sushi A bed for 6 cats
  • 7.
  • 8. About BigQuery 
 (for those who don’t know yet) Full-managed structured data store queryable with SQL Very easy to use Fast, in that almost no slowdown with Big Data Cost-effective Built on Google’s infrastructure components
  • 9. About BigQuery 
 (for those who don’t know yet) Basic operations are available within Web UI You can ‘dryrun’ from Web UI to check the amount of data to be scanned You can use the command line interface (bq) to integrate BigQuery into your workflow API are provided for Python, Java
  • 10. BigQuery is for analytics Essentially, data model is same as relational databases with some extension BigQuery is for analytics, not for transaction processing You can insert rows(batch or streaming), but can not update or delete them. There’s no index - tables are always read by ‘fullscan’ You can insert rows from GCS or via HTTP in CSV or JSON format.
  • 11. Basics You might be paying too much for BigQuery
  • 12. What you pay for Storage - $0.020 per GB / month Queries - $5 per TB processed (scanned) Streaming inserts - $0.01 per 100,000 rows until July 20, 2015. After July 20, 2015, $0.01 per 200 MB, with individual rows calculated using a 1 KB minimum size. What matters is the Storage
  • 13. A simple example Load 1TB data to a table everyday, keep each table for a month Query the daily data 5 times everyday to aggregation For storage : 
 1TB * 30 (tables) = $0.020 * 1000 * 30 = $600 For Queries:
 1TB * 5 (Queries) * 30 (days) = $750
  • 14. How your data is stored Your data is stored 1. in thousands of disks ( depending on the size) 2. in columnar format ( ColumnIO or something) 3. compressed
 (However, the cost is based on uncompressed size)
  • 15. How BigQuery runs your query Requested rows are read from DFS, sent to compute nodes Compute nodes (could be thousands) form processing tree on the fly Results are written back to DFS as a table (anonymous or named) distributed file storage layer (tables) compute node compute node compute node compute node compute node compute node compute node results
  • 16. How BigQuery runs your query When doing JOIN between large tables or GROUP BY on a large dataset, keys needs to be hashed and associated data will send to nodes depends on the hash value for in-memory join or grouping. distributed file storage layer (tables) compute node compute node compute node compute node compute node compute node compute node results compute node compute node compute node compute node compute node compute node ‘Shuffle’
  • 17. Advanced You might be paying too much for BigQuery
  • 18. Narrowing your scan is the key BigQuery does not have indexes - always do fullscan BigQuery uses columnar storage. Selecting least columns you need makes the cost lower C1 C2 C3 C4 R1 R2 R3 R4 R5 R6 R7 R8 R8
  • 19. Narrowing your scan is the key SELECT C1, C2, C3, C4 FROM t For scanned cell (red-filled), you’ll pay C1 C2 C3 C4 R1 Scaned Scaned Scaned Scaned R2 Scaned Scaned Scaned Scaned R3 Scaned Scaned Scaned Scaned R4 Scaned Scaned Scaned Scaned R5 Scaned Scaned Scaned Scaned R6 Scaned Scaned Scaned Scaned R7 Scaned Scaned Scaned Scaned R8 Scaned Scaned Scaned Scaned R8 Scaned Scaned Scaned Scaned
  • 20. Narrowing your scan is the key SELECT C1, C3 FROM t You’ll pay only for C1 & C3 C1 C2 C3 C4 R1 Scaned Scaned R2 Scaned Scaned R3 Scaned Scaned R4 Scaned Scaned R5 Scaned Scaned R6 Scaned Scaned R7 Scaned Scaned R8 Scaned Scaned R8 Scaned Scaned You shouldn’t ‘SELECT *’ unintentionally
  • 21. Narrowing your scan is the key BigQuery’s tables can have virtually any number of rows Watch out : All those rows will be scanned, no matter what ‘WHERE’ you use in your queries There are 2 ways to work around this: table decorators dividing tables C1 C2 C3 C4 R1 R2 R3 R4 R5 R6 R7 R8 R9 R99999999990 R99999999991 R99999999992 R99999999993 R99999999994
  • 22. Narrowing your scan is the key Snapshot decorators: you can limit your scan within a snapshot of the table at a given time SELECT … FROM t@1430665200000 Time-range decorators: you can limit your scan between a given time range SELECT … FROM t@-1430751600000 You can pass the time within last 7 days C1 C2 C3 added R1 4/1 R2 4/1 R3 4/1 R4 4/1 R99999999990 5/8 R99999999991 5/8 R99999999992 5/8 R69999999990 5/3 R69999999991 5/3 R69999999992 5/3 R79999999990 5/5 R79999999991 5/5 R79999999992 5/5
  • 23. Table arrangement estimated Batch insert creates an inserted ‘block’ Recent inserted blocks (within last 7 day) are left separated from ‘main’ block of the table Blocks past 7day will be merged with ‘main’ block of the table streaming inserted rows are not stored in blocks but in BigTable C1 C2 C3 added R1 4/1 R2 4/1 R3 4/1 R4 4/1 R99999999990 5/8 R99999999991 5/8 R99999999992 5/8 R69999999990 5/3 R69999999991 5/3 R69999999992 5/3 R79999999990 5/5 R79999999991 5/5 R79999999992 5/5 This is my estimation : as far as I know, Google didn’t officially mentioned about things like this. MainBlockBlockof5/3Blockof5/5Blockof5/8 As of 2015/5/8
  • 24. Table arrengement estimated Batch insert creates an inserted ‘block’ Becent inserted blocks (within last 7 day) are left separated from ‘main’ block of the table Blocks past 7day will be merged with ‘main’ block of the table Streaming inserted rows are not stored in blocks but in BigTable C1 C2 C3 added R1 4/1 R2 4/1 R3 4/1 R4 4/1 R6999999999 0 5/3 R6999999999 1 5/3 R6999999999 2 5/3 R99999999990 5/8 R99999999991 5/8 R99999999992 5/8 R79999999990 5/5 R79999999991 5/5 R79999999992 5/5 MainBlockBlockof5/5Blockof5/8 As of 2015/5/11 If you focus on last 7 days, decorators are very useful for saving costs
  • 25. Narrowing your scan is the key Tables are often split by date in BigQuery You can easily union them within FROM clause, separated with comma ( BQ- specific notation) TABLE_DATE_RANGE function is useful, ex : SELECT … FROM (TABLE_DATE_RANGE(sample.T,
 TIMESTAMP(‘2015-05-01’),
 TIMESTAMP(‘2015-05-10’))) T20150401 C1 C2 C3 added R1 12:00 R2 13:23 R3 14:10 R4 14:30 T20150501 C1 C2 C3 added R1 9:09 R2 10:12 R3 11:00 R4 13:56 T20150510 C1 C2 C3 added R1 9:09 R2 10:12 R3 11:00 R4 13:56
  • 26. Narrowing your scan is the key With traditional RDB, usually you don’t split tables like this : Expensive ‘Enterprise’ editions support features like this, but it takes your time for design, operation, and maintenance In BigQuery, splitting tables like this sometimes even makes your query faster The difference comes from the architectural difference: BigQuery is designed from ground to reads and processes data from many disks with many compute nodes T20150401 C1 C2 C3 added R1 12:00 R2 13:23 R3 14:10 R4 14:30 T20150503 C1 C2 C3 added R1 9:09 R2 10:12 R3 11:00 R4 13:56 T20150503-1 C1 C2 C3 added R1 9:09 R2 10:12 R3 11:00 R4 13:56
  • 27. Narrowing your scan is the key DFS Layer compute node compute node compute node compute node compute node compute node compute node results T2015 0501 T2015 0502 T2015 0503 T2015 0508 T2015 0509 T2015 0510 Actually, any single table is stored in many disks and the data from a table is read by many nodes.
  • 28. Using query cache Result of a query will sent to anonymous dataset, with a name generated from name of the tables and their last update timestamp and the query. When a query is executed, BigQuery checks if the cached result exists at first. If the query returns the cached result, it costs nothing.
  • 29. Query cache is free Applications like dashboards, which runs almost same queries again and again, can save costs by utilizing query cache You can write code that save query results somewhere for later use and avoid running same query, but sometimes you don’t have to worry about it - query cache does same for you
  • 30. Query cache is enabled when: The query is deterministic (e.g. without NOW() ) The table does NOT have a streaming buffer The result of the query was not saved to a named table Actually a large result (>128MB) can not be cached because in such case you have to specify ‘allowLargeResult’ and thus the result must be saved to a named table.
  • 31. Tips & Tricks You might be paying too much for BigQuery
  • 32. Trade offs - time & cost Generally, normalizing your data model makes: the size of the data small, which means in BigQuery, you pay less may use more CPU time and network traffic, especially when you run complex queries between large tables
  • 33. Trade offs - time & cost You think ‘cost’ in terms of CPU, network, storage in on- premise way When using BigQuery: You don’t pay money for CPU nor network It takes time to run queries that consume much CPU and/or network - queries using EACH keyword If you don’t have to run queries interactively, they could be run in batch mode with less cost, with an ‘appropriate’ schema.
  • 34. distributed file storage layer (tables) compute node compute node compute node compute node compute node compute node compute noderesults compute node compute node compute node compute node compute node compute node compute node compute node
  • 35. 免費 distributed file storage layer (tables) compute node compute node compute node compute node compute node compute node compute noderesults compute node compute node compute node compute node compute node compute node compute node compute node
  • 36. Subqueries optimized For example, if you have several types of log in one table and you want to join them with different tables depending on the type, you don’t have to worry SELECT id, desc FROM 
 (select l.id as id, s1.desc as desc
 from samples.log l 
 join samples.subTable1 s1 
 on l.value=s1.subid 
 where l.type=0) t1,
 (select l.id as id, s2.desc as desc 
 from samples.log l 
 join samples.subTable2 s2 
 on l.value=s2.subid 
 where l.type=1) t2 This query scan log table only once DFS Layer compute node compute node compute node subTalbe1 Log subTable2 compute node
  • 37. Repeated fields You can store array-like data in a row This is not standardized feature of SQL It’s like ‘materialized view’ or pre-joined table - could be compact to store & fast to query You should have a good understanding of the logic, or you will get unexpected result Do not use too complex schema (e.g. deeply nested repeated field) The functions for repeated fields are useful, but watch out for combinational explosion (e.g. FLATTEN)
  • 38. Thank you for listening. Questions?