This document provides an overview of BigQuery and how to get started with it. BigQuery is a fully managed data warehouse offered by Google Cloud Platform. It offers a serverless, fast, and scalable solution for data analysis. The document discusses what BigQuery is, its key features and concepts like datasets, tables, and billing. It also covers best practices for query performance and cost optimization. The presenter provides their contact details and links to their website and YouTube channel for additional resources.
1. Getting started with BigQuery
Pradeep Bhadani
Founder, Cloud Native Technologies
cntek.io
pbhadani.com
linkedin.com/in/pradeepbhadani
linkedin.com/company/cloudnativetech
22nd August 2020, Google Next OnAir Extended
2. About Me
IT Consultant with 9 years of experience in Big Data, Cloud & DevOps
GDE (Google Developers Expert) - Cloud
Google Cloud Authorized Trainer
HashiCorp Ambassador
Blog: pbhadani.com
Cloud Native Technologiescntek.io
3. Services
● Big Data Consultancy
● Cloud & DevOps Consultancy
● Tailored Training and Workshops
Cloud Native Technologiescntek.io
4. Agenda
● Overview
○ What is a Data Warehouse?
○ Choosing a Data Warehouse Option?
● Introduction to BigQuery
○ What is BigQuery?
○ Why BigQuery?
○ Concepts
● Best Practices
● Interacting with BigQuery
● Demo
Cloud Native Technologiescntek.io
6. What is a Data Warehouse?
A data warehouse is a critical component in Business Intelligence
solution which enables an organization to make a better decision.
Data warehouse offers:
● Scheduled & ad-hoc reporting
● Ad-hoc analysis
● Integrates with Visualization tools
Cloud Native Technologiescntek.io
10. What is BigQuery?
BigQuery is a fully-managed enterprise-grade modern data warehouse
offering on Google Cloud Platform.
cloud.google.com/bigquery
Cloud Native Technologiescntek.io
11. Why BigQuery?
Cloud Native Technologiescntek.io
Serverless Fast SQL Security Scalable
Data
Encryption
Managed
Storage
Flexible
Pricing
Advanced
Features
12. Advanced Features
Cloud Native Technologiescntek.io
BigQueryML BigQuery GIS
BigQuery Omni
(private alpha)
DataQnA
(private alpha)
18. GCP Project is a top-level logical container to organize all the Google Cloud
Platform resources like Storage, BigQuery.
GCP Project
Cloud Native Technologiescntek.io
GCP Project
19. Logical container to organize the BigQuery tables.
BigQuery Datasets
Cloud Native Technologiescntek.io
GCP Project
Dataset A Dataset B
20. BigQuery tables contains the data and the schema that describe the data.
<project_id>.<dataset_id>.<table>
BigQuery Tables
Cloud Native Technologiescntek.io
Table 2
GCP Project
Dataset A Dataset B
Table 1
Table 2
Table 1
Table 2
22. A BigQuery slot is a combination of CPU, memory and network resources.
BigQuery automatically calculates the number of slots required to execute a
query based on query size and complexity.
Slots
Cloud Native Technologiescntek.io
23. ● Interactive queries — 100 concurrent queries
● Query execution time limit — 6 hours
● Load jobs per table per day — 1,500 (including failures)
● Maximum columns per table — 10,000
● Copy jobs per destination table per day — 1,000 (including failures)
● Number of datasets per project — No limit
● Number of tables per dataset — No limit
● Maximum number of table operations per day — 1,500
● Maximum number of partitions per partitioned table — 4,000
Please refer cloud.google.com/bigquery/quotas for latest service limits
Service Limits
Cloud Native Technologiescntek.io
24. ● On-Demand
○ $5 per TB
○ First 1TB per month is free
● Flat Rate
○ Monthly - $2000 per 100 slots
○ Annual - $1700 per 100 slots
Please refer cloud.google.com/bigquery/pricing for latest Pricing
Pricing
Cloud Native Technologiescntek.io
31. ● Avoid “SELECT *”
● Use of Partitions
● Denormalization
● Use wildcards on tables appropriately
● Use external data source appropriately
● Reduce the amount of data before JOIN
● Avoid repetitive data transformation using SQL Queries
● Use Nested and Repeated fields
Query Performance
Cloud Native Technologiescntek.io
32. ● Use table expiration
● Avoid data duplication
● Avoid full table scan
● Only scan required columns
● Use caching feature
● Use of Partitions
● Use of Clustering
Cost Optimization
Cloud Native Technologiescntek.io