SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
1
The Snowflake Elastic Data Warehouse
SIGMOD 2016 and beyond
Ashish Motivala, Jiaqi Yan
2
Our Product
•The Snowflake Elastic Data Warehouse, or “Snowflake”
•Built for the cloud
•Multi-tenant, transactional, secure, highly scalable, elastic
•Implemented from scratch (no Hadoop, Postgres etc.)
•Currently runs on AWS and Azure
•Serves tens of millions of queries per day over hundreds
petabytes of data
•1000+ active customers, growing fast
3
Talk Outline
•Motivation and Vision
•Storage vs. Compute or the Perils of Shared-Nothing
•Architecture
•Feature Highlights
•Lessons Learned
4
Why Cloud?
•Amazing platform for building distributed systems
•Virtually unlimited, elastic compute and storage
•Pay-per-use model (with strong economies of scale)
•Efficient access from anywhere
•Software as a Service (SaaS)
•No need for complex IT organization and infrastructure
•Pay-per-use model
•Radically simplified software delivery, update, and user support
•See “Lessons Learned”
5
Data Warehousing in the Cloud
•Traditional DW systems pre-date the cloud
•Designed for small, fixed clusters of machines
•But to reap benefits of the cloud, software needs to be elastic!
•Traditional DW systems rely on complex ETL
(extract-transform-load) pipelines and physical tuning
•Fundamentally assume predictable, slow-moving, easily categorized
data from internal sources (OLTP, ERP, CRM…)
•Cloud data increasingly stems from changing, external sources
•Logs, click streams, mobile devices, social media, sensor data
•Often arrives in schema-less, semi-structured form (JSON, XML, Avro)
6
What about Big Data?
•Hive, Spark, BigQuery, Impala, Blink…
•Batch and/or stream processing at datacenter scale
•Various SQL’esque front-ends
•Increasingly popular alternative for high-end use cases
•Drawbacks
•Lack efficiency and feature set of traditional DW technology
•Security? Backups? Transactions? …
•Require significant engineering effort to roll out and use
7
Our Vision for a Cloud Data Warehouse
Data warehouse
as a service
No infrastructure to
manage, no knobs to tune
Multidimensional
elasticity
On-demand scalability
data, queries, users
All business
data
Native support for
relational +
semi-structured data
8
Shared-nothing Architecture
•Tables are horizontally partitioned across nodes
•Every node has its own local storage
•Every node is only responsible for its local table partitions
•Elegant and easy to reason about
•Scales well for star-schema queries
•Dominant architecture in data warehousing
•Teradata, Vertica, Netezza…
9
The Perils of Coupling
•Shared-nothing couples compute and storage resources
•Elasticity
•Resizing compute cluster requires redistributing (lots of) data
•Cannot simply shut off unused compute resources → no pay-per-use
•Limited availability
•Membership changes (failures, upgrades) significantly impact
performance and may cause downtime
•Homogeneous resources vs. heterogeneous workload
•Bulk loading, reporting, exploratory analysis
10
Multi-cluster, shared data architecture
Databases
Virtual
Warehouse
Virtual
Warehouse
ETL & Data
Loading
Virtual
Warehouse
Finance
Virtual
Warehouse
Dev, Test,
QA
Dashboards
Virtual
Warehouse
Marketing
Data Science
Clone
• No data silos
Storage decoupled from compute
• Any data
Native for structured & semi-structured
• Unlimited scalability
Along many dimensions
• Low cost
Compute on demand
• Instantly cloning
Isolate production from DEV & QA
• Highly available
11 9’s durability, 4 9’s availability
11
Data Storage
Multi-cluster Shared-data Architecture
• All data in one place
• Independently scale
storage and compute
• No unload / reload to
shut off compute
• Every virtual warehouse
can access all data
Cloud
Services
Transaction
Manager
Security
Optimizer
Infrastructure
manager
Authentication & access control
Virtual
Warehouse
Cache
Virtual
Warehouse
Cache
Virtual
Warehouse
Cache
Virtual
Warehouse
Cache
Rest (JDBC/ODBC/Python)
Metadata
12
Data Storage Layer
•Stores table data and query results
•Table is a set of immutable micro-partitions
•Uses tiered storage with Amazon S3 at the bottom
•Object store (key-value) with HTTP(S) PUT/GET/DELETE interface
•High availability, extreme durability (11-9)
•Some important differences w.r.t. local disks
•Performance (sure…)
•No update-in-place, objects must be written in full
•But: can read parts (byte ranges) of objects
•Strong influence on table micro-partition format and
concurrency control
13
Table Files
•Snowflake uses PAX [Ailamaki01] aka
hybrid columnar storage
•Tables horizontally partitioned into
immutable mirco-partitions (~16 MB)
•Updates add or remove entire files
•Values of each column grouped together
and compressed
•Queries read header + columns they need
14
Other Data
•Tiered storage also used for temp data and query results
•Arbitrarily large queries, never run out of disk
•New forms of client interaction
•No server-side cursors
•Retrieve and reuse previous query results
•Metadata stored in a transactional key-value store (not S3)
•Which table consists of which S3 objects
•Optimizer statistics, lock tables, transaction logs etc.
•Part of Cloud Services layer (see later)
15
Virtual Warehouse
•warehouse = Cluster of EC2 instances called worker nodes
•Pure compute resources
•Created, destroyed, resized on demand
•Users may run multiple warehouses at same time
•Each warehouse has access to all data but isolated performance
•Users may shut down all warehouses when they have nothing to run
•T-Shirt sizes: XS to 4XL
•Users do not know which type or how many EC2 instances
•Service and pricing can evolve independent of cloud platform
16
Worker Nodes
•Worker processes are ephemeral and idempotent
•Worker node forks new worker process when query arrives
•Do not modify micro-partitions directly but queue removal or addition
of micro-partitions
•Each worker node maintains local table cache
•Collection of table files i.e. S3 objects accessed in past
•Shared across concurrent and subsequent worker processes
•Assignment of micro-partitions to nodes using consistent hashing, with
deterministic stealing.
17
Execution Engine
•Columnar [MonetDB, C-Store, many more]
•Effective use of CPU caches, SIMD instructions, and compression
•Vectorized [Zukowski05]
•Operators handle batches of a few thousand rows in columnar format
•Avoids materialization of intermediate results
•Push-based [Neumann11 and many before that]
•Operators push results to downstream operators (no Volcano iterators)
•Removes control logic from tight loops
•Works well with DAG-shaped plans
•No transaction management, no buffer pool
•But: most operators (join, group by, sort) can spill to disk and recurse
18
• Adaptive
• Self-tuning
• Do no harm!
• Automatic
• Default
18
Self Tuning & Self Healing
Automatic
Memory
Management
Automatic
Workload
Management
Automatic
Distribution
Method
Automatic
Degree of
Parallelism
Automatic
Fault
Handling
19
•
•
•
•
•
Example: Automatic Skew Avoidance
Execution Plan
2
scan
join
scan
filter
1
1
2
20
Cloud Services
•Collection of services
•Access control, query optimizer, transaction manager etc.
•Heavily multi-tenant (shared among users) and always on
•Improves utilization and reduces administration
•Each service replicated for availability and scalability
•Hard state stored in transactional key-value store
21
Concurrency Control
•Designed for analytic workloads
•Large reads, bulk or trickle inserts, bulk updates
•Snapshot Isolation (SI) [Berenson95]
•SI based on multi-version concurrency control (MVCC)
•DML statements (insert, update, delete, merge) produce new table
versions of tables by adding or removing whole files
•Natural choice because table files on S3 are immutable
•Additions and removals tracked in metadata (key-value store)
•Versioned snapshots used also for time travel and cloning
22
Pruning
•Database adage: The fastest way to process data? Don’t.
•Limiting access only to relevant data is key aspect of query processing
•Traditional solution: B+
-trees and other indices
•Poor fit for us: random accesses, high load time, manual tuning
•Snowflake approach: pruning
•AKA small materialized aggregates [Moerkotte98], zone maps
[Netezza], data skipping [IBM]
•Per file min/max values, #distinct values, #nulls, bloom filters etc.
•Use metadata to decide which files are relevant for a given query
•Smaller than indices, more load-friendly, no user input required
23
Pure SaaS Experience
•Support for various standard interfaces and third-party tools
•ODBC, JDBC, Python PEP-0249
•Tableau, Informatica, Looker
•Feature-rich web UI
•Worksheet, monitoring, user management,
usage information etc.
•Dramatically reduces time to onboard users
•Focus on ease-of-use and service exp.
•No tuning knobs
•No physical design
•No storage grooming
24
Continuous Availability
•Storage and cloud services replicated across datacenters
•Snowflake remains available even if a whole datacenter fails
•Weekly Online Upgrade
•No downtime, no performance degradation!
•Tremendous effect on pace of development and bug resolution time
•Magic sauce: stateless services
•All state is versioned and stored in common key-value store
•Multiple versions of a service can run concurrently
•Load balancing layer routes new queries to new service version, until
old version finished all its queries
25
Semi-Structured and Schema-Less Data
•Three new data types: VARIANT, ARRAY, OBJECT
•VARIANT: holds values of any standard SQL type + ARRAY + OBJECT
•ARRAY: offset-addressable collection of VARIANT values
•OBJECT: dictionary that maps strings to VARIANT values
•Like JavaScript objects or MongoDB documents
•Self-describing, compact binary serialization
•Designed for fast key-value lookup, comparison, and hashing
•Supported by all SQL operators (joins, group by, sort…)
26
Post-relational Operations
•Extraction from VARIANTs using path syntax
•Flattening (pivoting) a single OBJECT or ARRAY into multiple rows
SELECT p.contact.name.first AS "first_name",
p.contact.name.last AS "last_name",
(f.value.type || ': ' || f.value.contact) AS "contact"
FROM person p,
LATERAL FLATTEN(input => p.contact) f;
------------+-----------+---------------------+
first_name | last_name | contact |
------------+-----------+---------------------+
“John" | “Doe" | email: john@doe.xyz |
“John" | “Doe" | phone: 555-123-4567 |
“John" | “Doe" | phone: 555-666-7777 |
------------+-----------+---------------------+
SELECT sensor.measure.value, sensor.measure.unit
FROM sensor_events
WHERE sensor.type = ‘THERMOMETER’;
27
Schema-Less Data
•Cloudera Impala, Google BigQuery/Dremel
•Columnar storage and processing of semi-structured data
•But: full schema required up front!
•Snowflake introduces automatic type inference and columnar storage for
schema-less data (VARIANT)
•Frequently common paths are detected, projected out, and stored in separate (typed
and compressed) columns in table file
•Collect metadata on these columns for use by optimizer → pruning
•Independent for each micro-partition → schema evolution
28
Automatic Columnarization of
semi-structured data
> SELECT … FROM …
Semi-structured data
(e.g. JSON, Avro, XML)
Structured data
(e.g. CSV, TSV, …)
Native support
Loaded in raw form (e.g.
JSON, Avro, XML)
Optimized storage
Optimized data type, no fixed schema or
transformation required
Optimized SQL querying
Full benefit of database optimizations
(pruning, filtering, …)
29
Schema-Less Performance
30
ETL vs. ELT
•ETL = Extract-Transform-Load
•Classic approach: extract from source systems, run through some
transformations (perhaps using Hadoop), then load into relational DW
•ELT = Extract-Load-Transform
•Schema-later or schema-never: extract from source systems, leave in
or convert to JSON or XML, load into DW, transform there if desired
•Decouples information producers from information consumers
•Snowflake: ELT with speed and expressiveness of RDBMS
31
Time Travel and Cloning
•Previous versions of data
automatically retained
•Same metadata as Snapshot Isolation
•Accessed via SQL extensions
•UNDROP recovers from accidental
deletion
•SELECT AT for point-in-time selection
•CLONE [AT] to recreate past versions
> SELECT * FROM mytable
AT T0
New
data
Modified
data
T0 T1 T2
32
Security
•Encrypted data import and export
•Encryption of table data using NIST 800-57 compliant
hierarchical key management and key lifecycle
•Root keys stored in hardware security module (HSM)
•Integration of S3 access policies
•Role-based access control (RBAC) within SQL
•Two-factor authentication and federated authentication
33
Post-SIGMOD ‘16 Features
•Data sharing
•Serverless ingestion of data
•Reclustering of data
•Spark connector with pushdown
•Support for Azure Cloud
•Lots more connectors
34
Lessons Learned
•Building a relational DW was a controversial decision in 2012
•But turned out correct; Hadoop did not replace RDBMSs
•Multi-cluster, shared-data architecture game changer for org
•Business units can provision warehouses on-demand
•Fewer data silos
•Dramatically lower load times and higher load frequency
•Semi-structured extensions were a bigger hit than expected
•People use Snowflake to replace Hadoop clusters
35
Lessons Learned (2)
•SaaS model dramatically helped speed of development
•Only one platform to develop for
•Every user running the same version
•Bugs can be analyzed, reproduced, and fixed very quickly
•Users love “no tuning” aspect
•But creates continuous stream of hard engineering challenges…
•Core performance less important than anticipated
•Elasticity matters more in practice
36
Ongoing Challenges
•SaaS and multi-tenancy are big challenges
•Support tens of thousands of concurrent users, some of which do
weird things, and need protection for themselves.
•Metadata layer has become huge
•Categorizing and handling failures automatically is hard, but
•Automation is key to keeping operations lean
•Lots of work left to do
•SQL performance improvements, better skew handling etc.
•Cloud platform enables a slew of new classes of features.
37
Future work
•Advisors
•Materialized Views
•Stored procedures
•Data Lake support
•Streaming
•Time series
•Multi-cloud
•Global Snowflake
•Replication
38
Who We Are
•Founded: August 2012
•Mission in 2012: Build an enterprise data warehouse as a cloud
service
•HQ in downtown San Mateo (south of San Francisco), Engr
Office #2 in Seattle
•400+ employees, 80 engrs and hiring…
•Founders: Benoit Dageville, Thierry Cruanes, Marcin Zukowski
•CEO: Bob Muglia
•Raised $283M in 2018
39
Summary
•Snowflake is an enterprise-ready data warehouse as a service
•Novel multi-cluster, shared-data architecture
•Highly elastic and available
•Semi-structured and schema-less data at the speed of relational data
•Pure SaaS experience
•Rapidly growing user base and data volume
•Lots of challenging work left to do
40

Contenu connexe

Tendances

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
12. oracle database architecture
12. oracle database architecture12. oracle database architecture
12. oracle database architectureAmrit Kaur
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and PerformanceMineaki Motohashi
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeKent Graziano
 
Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudData Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudMichael Rainey
 
Managing users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise ManageManaging users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise ManageNR Computer Learning Center
 
Presentation implementing oracle asm successfully
Presentation    implementing oracle asm successfullyPresentation    implementing oracle asm successfully
Presentation implementing oracle asm successfullyxKinAnx
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau ArchitectureVivek Mohan
 
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction Mark Ginnebaugh
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeSnowflake Computing
 
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesSQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesArnon Shimoni
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2inovex GmbH
 
KSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKnoldus Inc.
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMark Kromer
 

Tendances (20)

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
12. oracle database architecture
12. oracle database architecture12. oracle database architecture
12. oracle database architecture
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and Performance
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudData Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
 
SSIS Presentation
SSIS PresentationSSIS Presentation
SSIS Presentation
 
Managing users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise ManageManaging users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise Manage
 
Presentation implementing oracle asm successfully
Presentation    implementing oracle asm successfullyPresentation    implementing oracle asm successfully
Presentation implementing oracle asm successfully
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau Architecture
 
Tableau Server Basics
Tableau Server BasicsTableau Server Basics
Tableau Server Basics
 
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with Snowflake
 
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesSQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
KSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKSnow: Getting started with Snowflake
KSnow: Getting started with Snowflake
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
 

Similaire à 25 snowflake

TechTarget Event - Storage Architectures for the Modern Data Center - Howard ...
TechTarget Event - Storage Architectures for the Modern Data Center - Howard ...TechTarget Event - Storage Architectures for the Modern Data Center - Howard ...
TechTarget Event - Storage Architectures for the Modern Data Center - Howard ...NetApp
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 
Suning OpenStack Cloud and Heat
Suning OpenStack Cloud and HeatSuning OpenStack Cloud and Heat
Suning OpenStack Cloud and HeatQiming Teng
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
TECHunplugged Austin 2016
TECHunplugged Austin 2016TECHunplugged Austin 2016
TECHunplugged Austin 2016Chris Evans
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIBM_Info_Management
 
4. (mjk) extreme performance 2
4. (mjk) extreme performance 24. (mjk) extreme performance 2
4. (mjk) extreme performance 2Doina Draganescu
 
high performance databases
high performance databaseshigh performance databases
high performance databasesmahdi_92
 
Cloudstack for beginners
Cloudstack for beginnersCloudstack for beginners
Cloudstack for beginnersJoseph Amirani
 
MySQL: Know more about open Source Database
MySQL: Know more about open Source DatabaseMySQL: Know more about open Source Database
MySQL: Know more about open Source DatabaseMahesh Salaria
 
The move-to-hybrid-cloud-itsmf-april2015
The move-to-hybrid-cloud-itsmf-april2015The move-to-hybrid-cloud-itsmf-april2015
The move-to-hybrid-cloud-itsmf-april2015Eduserv
 
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!Continuent
 
Scalable relational database with SQL Azure
Scalable relational database with SQL AzureScalable relational database with SQL Azure
Scalable relational database with SQL AzureShy Engelberg
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithMarkus Eisele
 
RedisDay London 2018 - How Redis Powers BBC Online's Biggest Pages
RedisDay London 2018 - How Redis Powers BBC Online's Biggest PagesRedisDay London 2018 - How Redis Powers BBC Online's Biggest Pages
RedisDay London 2018 - How Redis Powers BBC Online's Biggest PagesRedis Labs
 
SQL Server 2019 CTP2.4
SQL Server 2019 CTP2.4SQL Server 2019 CTP2.4
SQL Server 2019 CTP2.4Gianluca Hotz
 
MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014Ryusuke Kajiyama
 

Similaire à 25 snowflake (20)

TechTarget Event - Storage Architectures for the Modern Data Center - Howard ...
TechTarget Event - Storage Architectures for the Modern Data Center - Howard ...TechTarget Event - Storage Architectures for the Modern Data Center - Howard ...
TechTarget Event - Storage Architectures for the Modern Data Center - Howard ...
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Suning OpenStack Cloud and Heat
Suning OpenStack Cloud and HeatSuning OpenStack Cloud and Heat
Suning OpenStack Cloud and Heat
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
TECHunplugged Austin 2016
TECHunplugged Austin 2016TECHunplugged Austin 2016
TECHunplugged Austin 2016
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
 
4. (mjk) extreme performance 2
4. (mjk) extreme performance 24. (mjk) extreme performance 2
4. (mjk) extreme performance 2
 
high performance databases
high performance databaseshigh performance databases
high performance databases
 
Cloudstack for beginners
Cloudstack for beginnersCloudstack for beginners
Cloudstack for beginners
 
MySQL: Know more about open Source Database
MySQL: Know more about open Source DatabaseMySQL: Know more about open Source Database
MySQL: Know more about open Source Database
 
The move-to-hybrid-cloud-itsmf-april2015
The move-to-hybrid-cloud-itsmf-april2015The move-to-hybrid-cloud-itsmf-april2015
The move-to-hybrid-cloud-itsmf-april2015
 
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
 
Scalable relational database with SQL Azure
Scalable relational database with SQL AzureScalable relational database with SQL Azure
Scalable relational database with SQL Azure
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
 
RedisDay London 2018 - How Redis Powers BBC Online's Biggest Pages
RedisDay London 2018 - How Redis Powers BBC Online's Biggest PagesRedisDay London 2018 - How Redis Powers BBC Online's Biggest Pages
RedisDay London 2018 - How Redis Powers BBC Online's Biggest Pages
 
SQL Server 2019 CTP2.4
SQL Server 2019 CTP2.4SQL Server 2019 CTP2.4
SQL Server 2019 CTP2.4
 
MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014
 

Dernier

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 

25 snowflake

  • 1. 1 The Snowflake Elastic Data Warehouse SIGMOD 2016 and beyond Ashish Motivala, Jiaqi Yan
  • 2. 2 Our Product •The Snowflake Elastic Data Warehouse, or “Snowflake” •Built for the cloud •Multi-tenant, transactional, secure, highly scalable, elastic •Implemented from scratch (no Hadoop, Postgres etc.) •Currently runs on AWS and Azure •Serves tens of millions of queries per day over hundreds petabytes of data •1000+ active customers, growing fast
  • 3. 3 Talk Outline •Motivation and Vision •Storage vs. Compute or the Perils of Shared-Nothing •Architecture •Feature Highlights •Lessons Learned
  • 4. 4 Why Cloud? •Amazing platform for building distributed systems •Virtually unlimited, elastic compute and storage •Pay-per-use model (with strong economies of scale) •Efficient access from anywhere •Software as a Service (SaaS) •No need for complex IT organization and infrastructure •Pay-per-use model •Radically simplified software delivery, update, and user support •See “Lessons Learned”
  • 5. 5 Data Warehousing in the Cloud •Traditional DW systems pre-date the cloud •Designed for small, fixed clusters of machines •But to reap benefits of the cloud, software needs to be elastic! •Traditional DW systems rely on complex ETL (extract-transform-load) pipelines and physical tuning •Fundamentally assume predictable, slow-moving, easily categorized data from internal sources (OLTP, ERP, CRM…) •Cloud data increasingly stems from changing, external sources •Logs, click streams, mobile devices, social media, sensor data •Often arrives in schema-less, semi-structured form (JSON, XML, Avro)
  • 6. 6 What about Big Data? •Hive, Spark, BigQuery, Impala, Blink… •Batch and/or stream processing at datacenter scale •Various SQL’esque front-ends •Increasingly popular alternative for high-end use cases •Drawbacks •Lack efficiency and feature set of traditional DW technology •Security? Backups? Transactions? … •Require significant engineering effort to roll out and use
  • 7. 7 Our Vision for a Cloud Data Warehouse Data warehouse as a service No infrastructure to manage, no knobs to tune Multidimensional elasticity On-demand scalability data, queries, users All business data Native support for relational + semi-structured data
  • 8. 8 Shared-nothing Architecture •Tables are horizontally partitioned across nodes •Every node has its own local storage •Every node is only responsible for its local table partitions •Elegant and easy to reason about •Scales well for star-schema queries •Dominant architecture in data warehousing •Teradata, Vertica, Netezza…
  • 9. 9 The Perils of Coupling •Shared-nothing couples compute and storage resources •Elasticity •Resizing compute cluster requires redistributing (lots of) data •Cannot simply shut off unused compute resources → no pay-per-use •Limited availability •Membership changes (failures, upgrades) significantly impact performance and may cause downtime •Homogeneous resources vs. heterogeneous workload •Bulk loading, reporting, exploratory analysis
  • 10. 10 Multi-cluster, shared data architecture Databases Virtual Warehouse Virtual Warehouse ETL & Data Loading Virtual Warehouse Finance Virtual Warehouse Dev, Test, QA Dashboards Virtual Warehouse Marketing Data Science Clone • No data silos Storage decoupled from compute • Any data Native for structured & semi-structured • Unlimited scalability Along many dimensions • Low cost Compute on demand • Instantly cloning Isolate production from DEV & QA • Highly available 11 9’s durability, 4 9’s availability
  • 11. 11 Data Storage Multi-cluster Shared-data Architecture • All data in one place • Independently scale storage and compute • No unload / reload to shut off compute • Every virtual warehouse can access all data Cloud Services Transaction Manager Security Optimizer Infrastructure manager Authentication & access control Virtual Warehouse Cache Virtual Warehouse Cache Virtual Warehouse Cache Virtual Warehouse Cache Rest (JDBC/ODBC/Python) Metadata
  • 12. 12 Data Storage Layer •Stores table data and query results •Table is a set of immutable micro-partitions •Uses tiered storage with Amazon S3 at the bottom •Object store (key-value) with HTTP(S) PUT/GET/DELETE interface •High availability, extreme durability (11-9) •Some important differences w.r.t. local disks •Performance (sure…) •No update-in-place, objects must be written in full •But: can read parts (byte ranges) of objects •Strong influence on table micro-partition format and concurrency control
  • 13. 13 Table Files •Snowflake uses PAX [Ailamaki01] aka hybrid columnar storage •Tables horizontally partitioned into immutable mirco-partitions (~16 MB) •Updates add or remove entire files •Values of each column grouped together and compressed •Queries read header + columns they need
  • 14. 14 Other Data •Tiered storage also used for temp data and query results •Arbitrarily large queries, never run out of disk •New forms of client interaction •No server-side cursors •Retrieve and reuse previous query results •Metadata stored in a transactional key-value store (not S3) •Which table consists of which S3 objects •Optimizer statistics, lock tables, transaction logs etc. •Part of Cloud Services layer (see later)
  • 15. 15 Virtual Warehouse •warehouse = Cluster of EC2 instances called worker nodes •Pure compute resources •Created, destroyed, resized on demand •Users may run multiple warehouses at same time •Each warehouse has access to all data but isolated performance •Users may shut down all warehouses when they have nothing to run •T-Shirt sizes: XS to 4XL •Users do not know which type or how many EC2 instances •Service and pricing can evolve independent of cloud platform
  • 16. 16 Worker Nodes •Worker processes are ephemeral and idempotent •Worker node forks new worker process when query arrives •Do not modify micro-partitions directly but queue removal or addition of micro-partitions •Each worker node maintains local table cache •Collection of table files i.e. S3 objects accessed in past •Shared across concurrent and subsequent worker processes •Assignment of micro-partitions to nodes using consistent hashing, with deterministic stealing.
  • 17. 17 Execution Engine •Columnar [MonetDB, C-Store, many more] •Effective use of CPU caches, SIMD instructions, and compression •Vectorized [Zukowski05] •Operators handle batches of a few thousand rows in columnar format •Avoids materialization of intermediate results •Push-based [Neumann11 and many before that] •Operators push results to downstream operators (no Volcano iterators) •Removes control logic from tight loops •Works well with DAG-shaped plans •No transaction management, no buffer pool •But: most operators (join, group by, sort) can spill to disk and recurse
  • 18. 18 • Adaptive • Self-tuning • Do no harm! • Automatic • Default 18 Self Tuning & Self Healing Automatic Memory Management Automatic Workload Management Automatic Distribution Method Automatic Degree of Parallelism Automatic Fault Handling
  • 19. 19 • • • • • Example: Automatic Skew Avoidance Execution Plan 2 scan join scan filter 1 1 2
  • 20. 20 Cloud Services •Collection of services •Access control, query optimizer, transaction manager etc. •Heavily multi-tenant (shared among users) and always on •Improves utilization and reduces administration •Each service replicated for availability and scalability •Hard state stored in transactional key-value store
  • 21. 21 Concurrency Control •Designed for analytic workloads •Large reads, bulk or trickle inserts, bulk updates •Snapshot Isolation (SI) [Berenson95] •SI based on multi-version concurrency control (MVCC) •DML statements (insert, update, delete, merge) produce new table versions of tables by adding or removing whole files •Natural choice because table files on S3 are immutable •Additions and removals tracked in metadata (key-value store) •Versioned snapshots used also for time travel and cloning
  • 22. 22 Pruning •Database adage: The fastest way to process data? Don’t. •Limiting access only to relevant data is key aspect of query processing •Traditional solution: B+ -trees and other indices •Poor fit for us: random accesses, high load time, manual tuning •Snowflake approach: pruning •AKA small materialized aggregates [Moerkotte98], zone maps [Netezza], data skipping [IBM] •Per file min/max values, #distinct values, #nulls, bloom filters etc. •Use metadata to decide which files are relevant for a given query •Smaller than indices, more load-friendly, no user input required
  • 23. 23 Pure SaaS Experience •Support for various standard interfaces and third-party tools •ODBC, JDBC, Python PEP-0249 •Tableau, Informatica, Looker •Feature-rich web UI •Worksheet, monitoring, user management, usage information etc. •Dramatically reduces time to onboard users •Focus on ease-of-use and service exp. •No tuning knobs •No physical design •No storage grooming
  • 24. 24 Continuous Availability •Storage and cloud services replicated across datacenters •Snowflake remains available even if a whole datacenter fails •Weekly Online Upgrade •No downtime, no performance degradation! •Tremendous effect on pace of development and bug resolution time •Magic sauce: stateless services •All state is versioned and stored in common key-value store •Multiple versions of a service can run concurrently •Load balancing layer routes new queries to new service version, until old version finished all its queries
  • 25. 25 Semi-Structured and Schema-Less Data •Three new data types: VARIANT, ARRAY, OBJECT •VARIANT: holds values of any standard SQL type + ARRAY + OBJECT •ARRAY: offset-addressable collection of VARIANT values •OBJECT: dictionary that maps strings to VARIANT values •Like JavaScript objects or MongoDB documents •Self-describing, compact binary serialization •Designed for fast key-value lookup, comparison, and hashing •Supported by all SQL operators (joins, group by, sort…)
  • 26. 26 Post-relational Operations •Extraction from VARIANTs using path syntax •Flattening (pivoting) a single OBJECT or ARRAY into multiple rows SELECT p.contact.name.first AS "first_name", p.contact.name.last AS "last_name", (f.value.type || ': ' || f.value.contact) AS "contact" FROM person p, LATERAL FLATTEN(input => p.contact) f; ------------+-----------+---------------------+ first_name | last_name | contact | ------------+-----------+---------------------+ “John" | “Doe" | email: john@doe.xyz | “John" | “Doe" | phone: 555-123-4567 | “John" | “Doe" | phone: 555-666-7777 | ------------+-----------+---------------------+ SELECT sensor.measure.value, sensor.measure.unit FROM sensor_events WHERE sensor.type = ‘THERMOMETER’;
  • 27. 27 Schema-Less Data •Cloudera Impala, Google BigQuery/Dremel •Columnar storage and processing of semi-structured data •But: full schema required up front! •Snowflake introduces automatic type inference and columnar storage for schema-less data (VARIANT) •Frequently common paths are detected, projected out, and stored in separate (typed and compressed) columns in table file •Collect metadata on these columns for use by optimizer → pruning •Independent for each micro-partition → schema evolution
  • 28. 28 Automatic Columnarization of semi-structured data > SELECT … FROM … Semi-structured data (e.g. JSON, Avro, XML) Structured data (e.g. CSV, TSV, …) Native support Loaded in raw form (e.g. JSON, Avro, XML) Optimized storage Optimized data type, no fixed schema or transformation required Optimized SQL querying Full benefit of database optimizations (pruning, filtering, …)
  • 30. 30 ETL vs. ELT •ETL = Extract-Transform-Load •Classic approach: extract from source systems, run through some transformations (perhaps using Hadoop), then load into relational DW •ELT = Extract-Load-Transform •Schema-later or schema-never: extract from source systems, leave in or convert to JSON or XML, load into DW, transform there if desired •Decouples information producers from information consumers •Snowflake: ELT with speed and expressiveness of RDBMS
  • 31. 31 Time Travel and Cloning •Previous versions of data automatically retained •Same metadata as Snapshot Isolation •Accessed via SQL extensions •UNDROP recovers from accidental deletion •SELECT AT for point-in-time selection •CLONE [AT] to recreate past versions > SELECT * FROM mytable AT T0 New data Modified data T0 T1 T2
  • 32. 32 Security •Encrypted data import and export •Encryption of table data using NIST 800-57 compliant hierarchical key management and key lifecycle •Root keys stored in hardware security module (HSM) •Integration of S3 access policies •Role-based access control (RBAC) within SQL •Two-factor authentication and federated authentication
  • 33. 33 Post-SIGMOD ‘16 Features •Data sharing •Serverless ingestion of data •Reclustering of data •Spark connector with pushdown •Support for Azure Cloud •Lots more connectors
  • 34. 34 Lessons Learned •Building a relational DW was a controversial decision in 2012 •But turned out correct; Hadoop did not replace RDBMSs •Multi-cluster, shared-data architecture game changer for org •Business units can provision warehouses on-demand •Fewer data silos •Dramatically lower load times and higher load frequency •Semi-structured extensions were a bigger hit than expected •People use Snowflake to replace Hadoop clusters
  • 35. 35 Lessons Learned (2) •SaaS model dramatically helped speed of development •Only one platform to develop for •Every user running the same version •Bugs can be analyzed, reproduced, and fixed very quickly •Users love “no tuning” aspect •But creates continuous stream of hard engineering challenges… •Core performance less important than anticipated •Elasticity matters more in practice
  • 36. 36 Ongoing Challenges •SaaS and multi-tenancy are big challenges •Support tens of thousands of concurrent users, some of which do weird things, and need protection for themselves. •Metadata layer has become huge •Categorizing and handling failures automatically is hard, but •Automation is key to keeping operations lean •Lots of work left to do •SQL performance improvements, better skew handling etc. •Cloud platform enables a slew of new classes of features.
  • 37. 37 Future work •Advisors •Materialized Views •Stored procedures •Data Lake support •Streaming •Time series •Multi-cloud •Global Snowflake •Replication
  • 38. 38 Who We Are •Founded: August 2012 •Mission in 2012: Build an enterprise data warehouse as a cloud service •HQ in downtown San Mateo (south of San Francisco), Engr Office #2 in Seattle •400+ employees, 80 engrs and hiring… •Founders: Benoit Dageville, Thierry Cruanes, Marcin Zukowski •CEO: Bob Muglia •Raised $283M in 2018
  • 39. 39 Summary •Snowflake is an enterprise-ready data warehouse as a service •Novel multi-cluster, shared-data architecture •Highly elastic and available •Semi-structured and schema-less data at the speed of relational data •Pure SaaS experience •Rapidly growing user base and data volume •Lots of challenging work left to do
  • 40. 40