The 'macro view' on Big Query:
We started with an overview, some typical uses and moved to project hierarchy, access control and security.
In the end we touch about tools and demos.
BigQuery Basics
Topics we cover in this lesson
●
●
●
●
●
●
●
BigQuery Overview
Typical Uses
Project Hierarchy
Access Control and Security
Datasets and Tables
Tools
Demos
BigQuery Basics
How does BigQuery fit in the analytics landscape?
● MapReduce based analysis can be slow for ad-hoc queries
● Managing data centers and tuning software takes time & money
● Analytics tools should be services
BigQuery Basics
Why BigQuery?
● Generate big data reports require expensive servers
and skilled database administrators
● Interacting with big data has been expensive, slow and
inefficient
● BigQuery changes all that
○ Reducing time and expense to query data
BigQuery Basics
What's BigQuery?
● Service for interactive analysis of massive datasets (TBs)
○ Query billions of rows: seconds to write, seconds to return
○ Uses a SQL-style query syntax
○ It's a service, accessed by a RESTful API
● Reliable and secure
○ Replicated across multiple sites
○ Secured through Access Control Lists
● Scalable
○ Store hundreds of terabytes
○ Pay only for what you use
● Fast (really)
○ Run ad hoc queries on multi-terabyte data sets in seconds
BigQuery Basics
Typical Uses
Another way to analyze query results with Google Spreadsheets
○
greenido.wordpress.com/2013/12/16/big-query-and-google-spreadsheet-intergration/
○
greenido.wordpress.com/2013/07/24/big-query-power-with-javascript/
BigQuery Basics
BigQuery Use Cases
● Log Analysis. Making sense of computer generated records
● Retailer. Using data to forecast product sales
● Ads Targeting. Targeting proper customer sections
● Sensor Data. Collect and visualize ambient data
● Data Mashup. Query terabytes of heterogeneous data
BigQuery Basics
Some Customer Case Studies
Uses BigQuery to hone ad targeting
and gain insights into their business
Dashboards using BigQuery to
analyze booking and inventory data
Use BigQuery to provide their
customers ways to expand game
engagement and find new channels for
monetization
Used BigQuery, App Engine and the
Visualizaton API to build a business
intelligence solution
BigQuery Basics
Project Hierarchy
● Project. All data in BigQuery belongs inside a project
○ Set of users, APIs, authentication, billing information
● Dataset. Holds one or more tables
○ Lowest access control unit (to which ACLs are applied)
● Table. Row-column structure that contains actual data
● Job. Used to start potentially long running queries
BigQuery Basics
Datasets and Tables
Table name is represented as
follows:
● Current Project
<dataset>.<table
name>
● Different Project
<project>:<dataset>.<table>
e.g. publicdata:samples.wikipedia
BigQuery Basics
Data Types
●
●
●
●
●
String
○ UTF-8 encoded, <64kB
Integer
○ 64 bit signed
Float
Boolean
○ "true" or "false", case insensitive
Timestamp
○ String format
■ YYYY-MM-DD HH:MM:SS[.sssss] [+/-][HH:MM]
○ Numeric format (seconds from UNIX epoch)
■ 1234567890, 1.234567890123456E9
(*) Max row size: 64kB
Date type is supported as timestamp
BigQuery Basics
Data Format
BigQuery supports the following format for loading data:
1. Comma Separated Values (CSV)
2. JSON
a. BigQuery can load data faster,
embedded newlines.
b. Supports nested/repeated data fields
if your data con
BigQuery Basics
Repeated and Nested Fields
[
[
Schema
example
{
{
"fields": [
"fields": [
{
{
Loading data with repeated and
nested fields is supported by
JSON data format only
"mode":
"mode":
"name":
"name":
"nullable",
"nullable",
"country",
"country",
"type": "string"
"type": "string"
},
},
{
{
"mode": "nullable",
"mode": "nullable",
"name": "city",
"name": "city",
"type": "string"
"type": "string"
}
}
],
],
"mode": "repeated",
"mode": "repeated",
"name": "location",
"name": "location",
"type": "record"
"type": "record"
},
},
...........
...........
BigQuery Basics
Accessing BigQuery
● BigQuery Web browser
○
Imports/exports data, runs
queries
● bq command line tool
○ Performs operations from
the command line
● Service API
○ RESTful API to access
BigQuery programmatically
○
Requires authorization by
OAuth2
○
Google client libraries for
Python, Java, JavaScript,
PHP, ...
○
BigQuery Basics
Example of Visualization Tools
Using commercial visualization tools to graph the query results
BigQuery Basics
Loading Data Using the Web Browser
●
●
●
●
Upload from local disk or from Cloud Storage
Start the Web browser
Select Dataset
Create table and follow the wizard steps
BigQuery Basics
Loading Data Using bq Tool
"bq load" command
Syntax
bq load [--source_format=NEWLINE_DELIMITED_JSON|CSV]
destination_table data_source_uri table_schema
●
●
●
●
If not specified, the default file format is CSV (comma separated values)
The files can also use newline delimited JSON format
Schema
○ Either a filename or a comma-separated list of column_name:datatype
pairs that describe the file format.
Data source may be on local machine or on Cloud Storage
BigQuery Basics
Load Limitations
● 1,000 import jobs per table per day
● 10,000 import jobs per project per day
● File size (for both CSV and JSON)
○ 1GB for compressed file
○ 1TB for uncompressed
■ 4GB for uncompressed CSV with newlines in strings
● 10,000 files per import job
● 1TB per import job
BigQuery Basics
A Few Best Practices
CSV/JSON must be split into chunks less than 1TB
● "split" command with --line-bytes option
● Split to smaller files
○ Easier error recovery
○ To smaller data unit (day, month instead of year)
● Uploading to Cloud Storage is recommended
Cloud Storage
BigQuery
BigQuery Basics
A Few Best Practices
● Split Tables by Dates
○ Minimize cost of data scanned
○ Minimize query time
● Upload Multiple Files to Cloud Storage
○ Allows parallel upload into BigQuery
● Denormalize your data
BigQuery Basics
Exercise
Work through Big Query Exercise 1 -- Basics
● Use the BigQuery UI
● Use the bq command line tool
● Upload a dataset
You will query the public sample GSOD (global summary of
day) weather dataset.
You will get and upload earthquake data.
BigQuery Basics
Questions
● What are the different ways to load data into
BigQuery?
● What is the maximum size of data in a BigQuery
table?
● How can we import data into BigQuery?
○ What's the limitation?
○ What formats does BigQuery accept?
BigQuery Basics
Google I/O Data Sensing
● Start the BigQuery Web browser
● Click on Display Project in the project chooser dialog window
● Enter data-sensing-lab when prompted
● In the dataset data-sensing-lab:io_sensor_data, select the table
moscone_io13
● In the New Query box, enter the following query:
SELECT * FROM [data-sensing-lab:io_sensor_data.moscone_io13] LIMIT 10
● Click Run Query button
● Scroll to see relevant results
BigQuery Basics
Data Structure
● Define table schema when creating table
● Data is stored in per-column structure
● Each column is handled separately and only combined when
necessary
Advantage of this data structure:
● No need to set index in advance
● Load only the relevant Columns