Introduction to Snowflake Datawarehouse and Architecture for Big data company. Centralized data management. Snowpipe and Copy into a command for data loading. Stream loading and Batch Processing.
2. Introduction
• Snowflake is a true SaaS offering.
• No Hardware to select
• No Software to install, configure or manage.
• Ongoing maintenance, management, upgrades and tuning is handles by SF.
• No private cloud deployment. ( AWS + Azure + GCP )
• Not a relational database. ( No PK / FK constrains)
• Insert, Update, Delete, Views, Materialized Views, ACID Transections.
• Analytical Aggregation, Windowing and Hierarchical Queries.
• Query Language -> SnowSQL
• DDL/DML
• SQL Functions
• UDF / Stored Procedure (JS)
3. Integration Support
• Data Integration (Informatica, Talend)
• Self-service BI Tools (Tableau, QlikView)
• Big Data Tools (Kafka, Spark, Databricks etc.)
• JDBC/ODBC Drivers
• Native Language Connectors (Python/Go/Node Js)
• SQL Interface & Client
• Snowflake Web Interface
• Snowflake CLI +DBeaver
4. Snowflake Architecture
• Separation of Storage and Compute.
• High-level Architecture for UI storage layer.
• For Computing use configurable VMs.
• Data stored in S3. only cost for storage.
• DDL/DML Query cost for compute
• Pricing for only what you use.
• Storage separately. ( TB/GB)
• Query Processing Separately. (com.mins)
• Service Layer | Compute Layer | Storage Layer
• Service Layer comes with Fixed price.
• Metadata
• Security
• Optimiser
5. What SF makes unique ?
• Scalability (Storage and Compute)
• Few nobs to tune the database.
• No need Indexing
• No Performance Tuning
• No Partitioning
• No Physical Storage Design.
• Security, Data Governance and Protection.
• Simplification and Automation
• Balance and Scale.
6. Virtual
warehouses
• Instances are Az EC2 instances.
• Normally call SF Warehouse.
• Noting directly interact with them.
• Sizes
• X-Small : Single Node -> Analytical tasks
• Small – Two Nodes
• Medium – Four Nodes
• Large – Eight Nodes -> Data Loading
• X-Large –Sixteen Nodes -> High Performance Query Ex
• Concurrent Queries can ex.
• Additional Queries are queued wait until to execute
• Multi Cluster we can omit this.
7. Micro Partitions
• Automatically divided into
micro-partitions.
• Contiguous units of storage.
(50MB -500MB)
• Actual Size is much less than
that due to compress.
• SF determines the most
efficient algo for each column.
• Columnar scanning feature give
quick response.
8. How Micro Partitioning ?
• FROM S3
• High Availability and Durability of S3 used here.
• API to read parts.
• Break into Small Partitions (Micro Partitions)
• Re organize the data make it columnar. (Column values of partitions
are compressed together.)
• Compress the only the column values individually.
• Add header to metadata of the micro partition. (column offset)
• Micro partitions are stored in S3 as files.
10. Data Loading
Based in the volume and frequency two options mainly.
1. Bulk Loading
2. Continuous Loading
Bulk Loading
- Uses <copyinto> command
-loading batch data from files from cloud storage or coping
-relies on the user provided virtual wh.
-Supports transforming during a load.
• Column Ordering
• Column Omission etc.
Continuous Loading
- Uses the Snowpipe
- Designed to load small volume.
- Loads within minutes after files are added.
- Use compute resources provide by the SF.
Other than that SF provides several connectors to load
data
i.e SF connector for Kafka
11. Preparing to Data Load from S3
• Check the file type for the data loading. JSON, AVRO,ORC etc.
STEP 1 : Create a stage
STEP 2: Execute COPY command over the stage
• Instead of the authentication you can use AWS ARN objects to authentication.
• LOAD_HISTORY Gives you the history of data loading.
create or replace stage my_s3_stage
url='s3://mybucket/encrypted_files/’
credentials=(aws_key_id='1a2b3c' aws_secret_key='4x5y6z’)
encryption=(master_key = 'eSxX0jzYfIamtnBKOEOwq80Au6NbSgPH5r4BDDwOaO8=‘)
file_format = my_csv_format;
copy into mytable
from @my_ext_stage
pattern='.*sales.*.csv';
13. USE CASE : Data Monitoring
• Asked separate data monitoring tool.
• Decoupled the database from the SF
used MySQL db.
• UI Tool for each day/month visualization
of meta data.
15. COMPARE
ELASTIC SEARCH
- Cost is high
- Management is
not easy.
- Development
not easy.
AWS Neptune
- Cost is high.
- Not all in one
package.
- After didn’t see
any graph
requirements.
CASSANDRA
- Key Constrains
- Cant Execute DML