1. 1
Y O U R D A T A , N O L I M I T S
Kent Graziano
Senior Technical Evangelist
Snowflake Computing
Changing the Game with
Cloud Data Warehousing
@KentGraziano
2. 2
My Bio
•Senior Technical Evangelist, Snowflake Computing
•Oracle ACE Director (DW/BI)
•OakTable
•Blogger – The Data Warrior
•Certified Data Vault Master and DV 2.0 Practitioner
•Former Member: Boulder BI Brain Trust (#BBBT)
•Member: DAMA Houston & DAMA International
•Data Architecture and Data Warehouse Specialist
•30+ years in IT
•25+ years of Oracle-related work
•20+ years of data warehousing experience
•Author & Co-Author of a bunch of books (Amazon)
•Past-President of ODTUG and Rocky Mountain Oracle
User Group
3. 3
Agenda
•Data Challenges
•What is a Cloud Data Warehouse?
•What can a Cloud DW do for me?
•Cool Features of Snowflake
•Other Cloud DW – Redshift, Azure, BigQuery
•Real Metrics
5. 5
Scenarios with affinity for cloud
Gartner 2016
Predictions:
By 2018, six
billion connected
things will be
requesting
support.
Connecting applications, devices, and
“things”
Reaching employees, business partners,
and consumers
Anytime, anywhere mobility
On demand, unlimited scale
Understanding behavior;; generating,
retaining, and analyzing data
7. 7
It’s not the data itself
it’s how you take full advantage of the insight it provides
Web ERP3rd party apps Enterprise apps IoTMobile
8. 8
All possible data All possible actions
Most firms don’t consistently turn data into
action
73% 29%
of firms
aspire to be
data-driven.
of firms are
good at turning
data into
action.
Source: Forrester
9. 9
New possibilities with the cloud
•More & more data “born in the cloud”
•Natural integration point for data
•Capacity on demand
•Low-cost, scalable storage
•Compute nodes
11. 11
The evolution of data platforms
Data warehouse
& platform
software
Vertica,
Greenplum,
Paraccel, Hadoop,
Redshift
Data
warehouse
appliance
Teradata
1990s 2000s 2010s
Cloud-native
Data
Warehouse
Snowflake
1980s
Relational
database
Oracle, DB2,
SQL Server
12. 12
What is a Cloud-Native DW?
•DW- Data Warehouse
•Relational database
•Uses standard SQL
•Optimized for fast loads and analytic queries
•aaS – As a Service
•Like SaaS (e.g. SalesForce.com)
•No infrastructure set up
•Minimal to no administration
•Managed for you by the vendor
•Pay as you go, for what you use
13. 13
Goals of Cloud DW
•Make your life easier
•So you can load and use your data faster
•Support business
•Make data accessible to more people
•Reduce time to insights
•Handle big data too!
•Schema-less ingestion
14. 14
Common customer scenarios
Data warehouse for
SaaS offerings
Use Cloud DW as back-
end data warehouse
supporting data-driven
SaaS products
noSQL replacement
Replace use of noSQL
system (e.g. Hadoop) for
transformation and SQL
analytics of multi-
structured data
Data warehouse
modernization
Consolidate legacy
datamarts and support
new projects
15. 15
Over 250 customers demonstrate what’s possible
Up to 200x faster reports that enable analysts to make
decisions in minutes rather than days
Load and update data in near real time by replacing legacy
data warehouse + Hadoop clusters
Developing new applications that provide secure (HIPPA) access
to analytics to 11,000+ pharmacies
17. 17
About Snowflake
Experienced,
accomplished
leadership team
2012
Founded by
industry veterans
with over 120
database patents
Vision:
A world with
no limits on data
First data
warehouse
built for the
cloud
Over 230
enterprise
customers in
one year
since GA
18. 18
The 1st Data Warehouse Built for the Cloud
Data Warehousing…
• SQL relational database
• Optimized storage & processing
• Standard connectivity – BI, ETL, …
•Existing SQL skills and tools
•“Load and go” ease of use
•Cloud-based elasticity to fit any scale
Data
scientists
SQL
users &
tools
…for Everyone
19. 19
Concurrency Simplicity
Fully managed with a
pay-as-you-go model.
Works on any data
Multiple groups access
data simultaneously
with no performance degradation
Multi petabyte-scale, up to 200x faster
performance
and 1/10th the cost
200x
The Snowflake difference
Performance
21. 21
#10 – Persistent Result Sets
•No setup
•In Query History
•By Query ID
•24 Hours
•No re-execution
•No Cost for Compute
22. 22
#9 Connect with JDBC & ODBC
Data Sources
Custom & Packaged
Applications
ODBC WEB UIJDBC
Interfaces
Java
>_
Scripting
Reporting &
Analytics
Data Modeling,
Management &
Transformation
SDDM
SPARK too!
https://github.com/snowflakedb/
JDBC drivers available via MAVIN.org
Python driver available via PyPI
23. 23
#8 - UNDROP
UNDROP TABLE <table name>
UNDROP SCHEMA <schema name>
UNDROP DATABASE <db name>
Part of Time Travel feature: AWESOME!
24. 24
#7 Fast Clone (Zero-Copy)
•Instant copy of table, schema, or
database:
CREATE OR REPLACE
TABLE MyTable_V2
CLONE MyTable
• With Time Travel:
CREATE SCHEMA
mytestschema_clone_restore
CLONE testschema
BEFORE (TIMESTAMP =>
TO_TIMESTAMP(40*365*86400));;
26. 26
#5 – Standard SQL w/Analytic Functions
Complete SQL database
• Data definition language (DDLs)
• Query (SELECT)
• Updates, inserts and deletes (DML)
• Role based security
• Multi-statement transactions
select Nation, Customer, Total
from (select
n.n_name Nation,
c.c_name Customer,
sum(o.o_totalprice) Total,
rank() over (partition by n.n_name
order by sum(o.o_totalprice) desc)
customer_rank
from orders o,
customer c,
nation n
where o.o_custkey = c.c_custkey
and c.c_nationkey = n.n_nationkey
group by 1, 2)
where customer_rank <= 3
order by 1, customer_rank
27. 27
Snowflake’s multi-cluster, shared data architecture
Centralized storage
Instant, automatic scalability & elasticity
Service
Compute
Storage
#4 – Separation of Storage & Compute
28. 28
#3 – Support Multiple Workloads
Scale processing horsepower up and down on-
the-fly, with zero downtime or disruption
Multi-cluster “virtual warehouse” architecture scales
concurrent users & workloads without contention
Run loading & analytics at any time, concurrently, to
get data to users faster
Scale compute to support any workload
Scale concurrency without performance impact
Accelerate the data pipeline
29. 29
#2 – Secure by Design with Automatic Encryption of Data!
Authentication
Embedded
multi-factor authentication
Federated authentication
available
Access control
Role-based access
control model
Granular privileges on all
objects & actions
Data encryption
All data encrypted, always,
end-to-end
Encryption keys managed
automatically
External validation
Certified against enterprise-
class requirements
HIPPA Certified!
30. 30
#1 - Automatic Query Optimization
•Fully managed with no knobs or tuning required
•No indexes, distribution keys, partitioning, vacuuming,…
•Zero infrastructure costs
•Zero admin costs
32. 32
Amazon Redshift
•Amazon's data warehousing offering in AWS
• First announced in fall 2012 and GA in early 2013
• Derived ParAccel (Postgres) moved to the cloud
•Pluses
• Maturity: based on Paraccel, on the market for almost 10 years.
• Ecosystem: Deeper integration with other AWS products
• Amazon backing
33. 33
Amazon Redshift
•Challenges (vs Snowflake)
• Semi-structured data: Redshift cannot natively handle flexible-
schema data (e.g. JSON, Avro, XML) at scale.
• Concurrency: Redshift architecture means that there is a hard
concurrency limit that cannot be addressed short of creating a second,
independent Redshift cluster.
• Scaling: Scaling a cluster means read-only mode for hours while data
is redistributed. Every new cluster has a complete copy of data,
multiplying costs for dev, test, staging, and production as well as for
datamarts created to address concurrency scaling limitations.
• Management overhead: Customers report spending hours, often
every week, doing maintenance such as reloading data, vacuuming,
updating metadata.
34. 34
Customer Analysis – Snowflake vs
Redshift
•Ability to provision compute and storage separately –
•Storage might grow exponentially but compute needs may not
•No need to add nodes to cluster to accommodate storage
•Compute capacity can be provisioned during business hours
and shut down when not required (saving $$$)
•Predictable/ exact processing power for user queries with
dedicated separate warehouses
•No concurrency issues between warehouses
•No constraints on completing the ETL run before business hours
•Analytical workload and ETL can run in parallel
35. 35
Customer Analysis
•0 maintenance/ 100% managed
•No tuning (distkey, sortkey, vacuum, analyze)
•100% uptime, no backups
•Can restore at transaction level through time travel feature
•3X- 5X better compression compared to Redshift
•Auto compressed
•Data at rest is encrypted by default
•With Redshift, performance is degraded by 2x-3x
36. 36
Customer Analysis
•Supports cross database joins
•With cloning feature, we can spin up Dev/ test by
cloning entire prod database in seconds → run tests→
and drop the clone
•There is no charge for a clone. Only incremental updates on
storage are charged
•Instant Resize (scale up)
•NO 20+ hrs read only mode like Redshift!
•Resize also allows provisioning higher compute capacity for faster
processing when required
37. 37
Microsoft Azure DW
•Based on Analytics Platform System
•Emerging out of MSFT’s on-premises MPP data warehouse DataAllegro
acquisition in 2008
•Pluses
• Maturity: very mature (on-prem) database for over 20 years
• Ecosystem: Deep integration with Azure and SQL server ecosystem
• Separation of compute and storage: can scale compute without
the need of unloading/loading/moving the underlying at the database
level.
• Integration w/ Big Data & Hadoop: allows querying 'semi-
structured/unstructured' data, such as Hadoop file formats ORC, RC,
and Parquet. This works via the external table concept
38. 38
Microsoft Azure DW
•Challenges (vs Snowflake)
• Concurrency: Azure architecture means there are hard concurrency
limit (currently 32 users) per DWU (Data Warehouse Units) that cannot
be addressed short of creating a second warehouse and copy of the
data.
• Scaling: Cannot scale clusters easily and automatically.
• Management overhead: Azure is difficult manage, specially with
considerations around data distribution, statistics, indices, encryption,
metadata operations, replication (for disaster recovery) and more.
• Security: lack of end to end encryption.
• Support for modern programmability: Lacks support for wide
range of APIs due to commitments to its own ecosystem.
39. 39
Google BigQuery
•Query processing service offered in the Google Cloud, first
launched in 2010
• Follow up offering from Dremel service developed internally for Google
only
•Pluses
• Google-scale horsepower: Runs jobs across a boatload of servers.
As a result, BigQuery can be very fast on many individual queries.
• Absolutely zero management: You submit your job and wait for it
to return–that's it. No management of infrastructure, no management
of database configuration, no management of compute horsepower,
etc.
40. 40
Google BigQuery
•Challenges (vs Snowflake)
• BigQuery is not a data warehouse: Does not implement key features that
are expected in a relational database, which means existing database
workloads will not work in BigQuery without non-trivial change
• Performance degradation for JOINs: Limits on how many tables can be
joined
• BigQuery is a black box: You submit your job and it finishes when it
finishes–users have no ability to control SLAs nor performance.
• Lots of usage limitations: Quotas on how many concurrent jobs can be
run, how many queries can run per day, how much data can be processed at
once, etc.
• Obscure pricing: Prices per query (based on amount of data processed by
a query), making it difficult to know what it will cost and making costs add up
quickly
• BigQuery only recently introduced full SQL support
42. 42
Simplifying the data pipeline
Event
Data
Kafka Hadoop SQL Database Analysts & BI
Tools
Import
Processor
Key-value
Store
Event
Data
Kafka Amazon S3 Snowflake Analysts & BI
Tools
Scenario
• Evaluating event data from various sources
Pain Points
• 2+ hours to make new data available for
analytics
• Significant management overhead
• Expensive infrastructure
Solution
Send data from Kafka to S3 to Snowflake with
schemaless ingestion and easy querying
Snowflake Value
• Eliminate external pre-processing
• Fewer systems to maintain
• Concurrency without contention & performance
impact
43. 43
Simplifying the Data pipeline
EDW
Game Event Data
Internal Data
Third-party Data
Analysts & BI
Tools
Staging
noSQL
Database
Existing
EDW
Game Event Data
Internal Data
Third-party Data
Analysts & BI
Tools
SnowflakeKinesis
Cleanse Normalize Transform
11-24 hours 15 minutes
Scenario
Complex pipeline slowing down analytics
Pain Points
• Fragile data pipeline
• Delays in getting updated data
• High cost and complexity
• Limited data granularity
Solution
Send data from Kinesis to S3 to Snowflake with
schemaless ingestion and easy querying
Snowflake Value
• >50x faster data updates
• 80% lower costs
• Nearly eliminated pipeline failures
• Able to retain full data granularity
44. 44
Delivering compelling results
Simpler data pipeline
Replace noSQL database with Snowflake for storing &
transforming JSON event data Snowflake: 1.5 minutes
noSQL data base:
8 hours to prepare data
Snowflake: 45 minutes
Data warehouse appliance:
20+ hours
Faster analytics
Replace on-premises data warehouse with Snowflake
for analytics workload
Significantly lower cost
Improved performance while adding new workloads--at
a fraction of the cost
Snowflake: added 2 new workloads for $50K
Data warehouse appliance:
$5M + to expand
45. 45
The fact that we don’t need
to do any configuration or
tuning is great because we
can focus on analyzing data
instead of on managing
and tuning a data
warehouse.”
Craig Lancaster, CTO
The combination of
Snowflake and Looker gives
our business users a
powerful, self-service tool to
explore and analyze diverse
data, like JSON, quickly and
with ease.”
We went from an obstacle
and cost-center to a value-
added partner providing
business intelligence and
unified data warehousing
for our global web property
business lines.”
JP Lester, CTO
Music & film distribution
Publishers of Ask.com, Dictionary.com, About.com,
and other premium websites
Internet access service company to
over 30 million smartphone users
• Replaced MySQL
• Substituted Snowflake
native JSON handling
and queries with SQL
in place of MapReduce
• Integrated with Hadoop
repository
• Consolidated global
web data pipelining
• Replaced 36-node data
warehouse
• Replaced 100-node
Hadoop cluster
Erika Baske, Head of BI
• Eliminated bottlenecks
and scaling pain-points
• Consolidated multiple
cloud data marts
• Now handling larger
datasets and higher
concurrency with ease
47. 47
Steady growth in data processing
•Over 20 PB loaded to date!
•Multiple customers with >1PB
•Multiple customers averaging >1M
jobs / week
•>1PB / day processed
•Experiencing 4X data processing
growth over last six months
Jobs / day
48. 48
What does a Cloud-native DW enable?
Cost effective storage and analysis of GBs, TBs, or even PB’s
Lightning fast query performance
Continuous data loading without impacting query performance
Unlimited user concurrency
ODBC JDBC
Interfaces
Java
>_
Scripting
Full SQL relational support of both structured and
semi-structured data
Support for the tools and languages you already use
50. 50
As easy as 1-2-3!
Discover the performance, concurrency,
and simplicity of Snowflake
1 Visit Snowflake.net
2 Click “Try for Free”
3 Sign up & register
Snowflake is the only data warehouse built for the cloud. You can
automatically scale compute up, out, or down̶—independent of storage.
Plus, you have the power of a complete SQL database, with zero
management, that can grow with you to support all of your data and all
of your users. With Snowflake On Demand™, pay only for what you use.
Sign up and receive
$400 worth of free
usage for 30 days!