SlideShare a Scribd company logo
1 of 13
Download to read offline
Taming	The	Data	Load/Unload	in	Snowflake	
Sample	Code	and	Best	Practice	
(Faysal	Shaarani)	
		
	
	
Loading Data Into Your Snowflake’s Database(s) from raw data files
	
[1. CREATE YOUR APPLICABLE FILE FORMAT]:
The syntax in this section below allows you to create your CSV file format if you are loading
data from CSV files. Please note that the backslash escapes are coded for use from the
sfsql command line, not from the UI. If this FILE FORMAT below is to be used from the
Snowflake UI, the  occurrences should be changed to 
[CSV FILE FORMAT]
CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.CSV
-- Comma field delimited and n record terminator
TYPE = 'CSV'
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = 'n'
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE'
TRIM_SPACE = false
ERROR_ON_COLUMN_COUNT_MISMATCH = true
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('');
	
The syntax in the section below allows you to create JSON file format if you are loading
JSON data into Snowflake.
	
[JSON FILE FORMAT]
CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.JSON
TYPE ='JSON'
COMPRESSION = 'AUTO'
ENABLE_OCTAL = false
ALLOW_DUPLICATE = false
STRIP_OUTER_ARRAY = false;
[2. CREATE YOUR DESTINATION TABLE]:
Pre-create your table before loading the CSV data into that table.
CREATE OR REPLACE TABLE exhibit
(Id STRING
,Title STRING
,Year NUMBER
,Collection_Id STRING
,Timeline_Id STRING
,Depth INT);
CREATE OR REPLACE TABLE timelines
(Id STRING
,Title VARCHAR
,Regime STRING
,FromYear NUMBER
,ToYear NUMBER
,Height NUMBER
,Timeline_Id STRING
,Collection_Id STRING
,ForkNode NUMBER
,Depth NUMBER
,SubtreeSize NUMBER);
	
If you are using data files that have been staged on your Snowflake’s Customer
Account S3 bucket assigned to your company.
	
Run COPY Command To Load Data From Raw CSV Files
Load the data from your CSV file into the pre-created EXHIBIT table. If you encounter a data
error on any of the records continue to load what you could. If you do not specify
ON_ERROR, the Default would be to skip the file on the first error it encounters on any of the
records in that file. The example below would load whatever it could skipping any bad
records in the file.
	
COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz
FILE_FORMAT = CSV ON_ERROR='continue';
OR
	
Load the data into the pre-created EXHIBIT table from several CSV files matching the file
name regular expression shown on the sample code below:
	
COPY INTO exhibit
FROM @~/errorsExhibit
PATTERN='.*0[1-2].txt.gz'
FILE_FORMAT = CSV ON_ERROR='continue';
To check the listing of files under the subdirectory CleanData under the @~ staging area for
your Snowflake Beta Customer account, while in the sfsql command line, use the following
command:
ls @~/CleanData
To check on the listing of all files whose file names match the regular expression specified in
the PATTERN parameter, use the command below:
ls @~/CleanData PATTERN='.*0[1-2].txt.gz';
	
Verify that the data was loaded successfully into the EXHIBIT table.
	
	 Select * from EXHIBIT;
Run COPY Command to Load/Parse JSON Data Raw Staged Files
[1. Upload JSON File Into The Customer Account's S3 Staging Area]
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam
ple_data @~/json/
[2. Create an External Table with a VARIANT Column To Contain The JSON Data]
CREATE OR REPLACE TABLE public.json_data_table_ext
(json_data variant)
STAGE_LOCATION='@~'
STAGE_FILE_FORMAT=demo_db.public.json
COMMENT='json Data preview table';
[3. COPY the JSON Raw Data Into the Table]
COPY INTO json_data_table_ext
FROM @~/json/json_sample_data
FILE_FORMAT = 'JSON' ON_ERROR='continue';
JSON FILE CONTENT:
Validate that the data in the JSON raw file got loaded into the table
select * from public.json_data_table_ext;
Output:
{ "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ...
select json_data:root[0].kind, json_data:root[0].fullName,
json_data:root[0].age from public.json_data_table_ext ;
Output:
If you are using data files that have been staged on your own company’s Amazon S3
bucket:
Run COPY Command To Load Data From Raw CSV Files
This syntax below is needed to create a stage ONLY if you are using your own company’s
Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to
create a stage object.
Create the staging database object.
	
CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.SAMPLE_STAGE
URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/'
CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE'
AWS_SECRET_KEY = 'SECRET KEY VALUE’)
COMMENT = 'Stage object pointing to the customer's own AWS S3
bucket. Independent of Snowflake';
	
Load	the	data	from	your	CSV	file	into	the	pre-created	EXHIBIT	table.	
COPY INTO exhibit
FROM @DEMO_DB/errorsExhibit/exhibit_03.txt.gz
FILE_FORMAT = CSV ON_ERROR='continue';
OR
	
Load the data into the pre-created EXHIBIT table from several CSV files matching the file
name pattern regular expression stated on the sample code below.
	
COPY INTO exhibit
FROM @DEMO_DB/errorsExhibit
PATTERN='.*0[1-2].txt.gz'
FILE_FORMAT = CSV ON_ERROR='continue';
Verify that the data was loaded successfully into the EXHIBIT table.
Select * from EXHIBIT;
Run COPY Command to Load/Parse JSON Data Raw Staged Files
	
[1. Create a Stage Database Object Pointing to The Location of The JSON File]
This syntax below is needed to create a stage ONLY if you are using your own company’s
Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to
create a stage object.
CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.STAGE_JSON
URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/'
CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE'
AWS_SECRET_KEY = 'SECRET KEY VALUE’)
COMMENT = 'Stage object pointing to the customer's own AWS S3
bucket. Independent of Snowflake';
Place your file in the staging location defined by the above command
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam
ple_data @stage_JSON/json/
[2. Create an External Table with a VARIANT Column To Contain The JSON Data]
CREATE OR REPLACE TABLE public.json_data_table_ext
(json_data variant)
STAGE_LOCATION=demo_db.public.stage_json
STAGE_FILE_FORMAT= demo_db.public.json
COMMENT='json Data preview table';
[3. COPY the JSON Raw Data Into the Table]
COPY INTO json_data_table_ext
FROM @stage_json/json/json_sample_data
FILE_FORMAT = 'JSON' ON_ERROR='continue';
JSON FILE CONTENT:
Validate that the data in the JSON raw file got loaded into the table
select * from public.json_data_table_ext;
Output:
{ "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ...
select json_data:root[0].kind, json_data:root[0].fullName,
json_data:root[0].age from public.json_data_table_ext ;
Output:
	
Using Snowflake to Validate Your Data Files
	
In this section, we will go over validating the raw data files before performing the actual data
load. To illustrate this, we will attempt to load raw data files containing errors and thus
making it intentionally fail to load that data into Snowflake. The VALIDATION_MODE option
on the COPY command would process the data without loading it in the destination table in
Snowflake.
In the proceeding example, the PUT command will stage the exhibit*.txt and timelines.txt files
at the default S3 staging location set up for the Beta Customer Account in Snowflake as well
as illustrate how we can load files under a sub-directory below the root staging directory of a
Snowflake Beta Customer Account.
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs
v_samples/ErrorData/exhibit*.txt @~/errorsExhibit/;
	
----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
source | target | source_size | target_size | source_compression | target_compression | status | details |
----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
exhibit_03.txt | exhibit_03.txt.gz | 8353 | 3734 | NONE | GZIP | UPLOADED | |
exhibit_01.txt | exhibit_01.txt.gz | 14733 | 6207 | NONE | GZIP | UPLOADED | |
exhibit_02.txt | exhibit_02.txt.gz | 14730 | 6106 | NONE | GZIP | UPLOADED | |
----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
3 rows in result (first row: 1.501 sec; total: 1.504 sec)	
Below are three possible raw data validation scenarios and sample code:
	
1. The following example would allow the previewing of 10 records from the first raw data
file exhibit_01.txt. This file does not have any errors.
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_01.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_10_rows';
2. The following example below simulates the scenario of having an extra delimiter in
the record and how the errors that would be displayed.
COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz
FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_errors';
	
3.The	following	example	below	simulates	the	scenario	of	having	a	column	value	that	is	of	
the	wrong	data	type	and	how	the	error	would	look	like	the	output	below	after	running	
the	COPY	command	below:		
COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz
FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_errors';
	
Using Snowflake to Unload Your Snowflake Data to Files
	
To create a data file from a table in the Snowflake database, use the below command:
COPY INTO S3 FROM EXHIBIT table
COPY INTO @~/giant_file/ from exhibit;
OR to overwrite the existing files in the same directory, use the OVERWRITE option as in the
command below:
COPY INTO @~/giant_file/ from exhibit overwrite=true;
Please note that by default, Snowflake will unload the data from the table into multiple files of
a size (16 MB) per file. If you want your data to be unloaded to a single file, then you need to
use the SINGLE option on the COPY command as in the example below:
COPY INTO @~/giant_file/ from exhibit
Single=true
overwrite=true;
Please note that AWS S3 has a limit of (5 GB) on the file size you can stage on S3. You can
use the optional MAX_FILE_SIZE (in bytes) to change the Snowflake default file size. Use
the command below if you want to specify bigger or smaller file sizes than the Snowflake
default file size as long as you do not exceed the AWS S3 max file size. For example, the
below command unloads the data in the EXHIBIT table into files of 50M each:
COPY INTO @~/giant_file/ from exhibit
max_file_size= 50000000 overwrite=true;
Using Snowflake to Split Your Data Files Into Smaller Files
If you are using data files that have been staged on your Snowflake’s Customer
Account S3 bucket assigned to your company.
When loading data into Snowflake, it is recommended that the raw data is split into as many
files as possible to maximize the parallelization of the data loading process and thus
completing the data load in the shortest amount of time possible. If your raw data is in one
raw data file, you can use Snowflake to split your large data file, into multiple files before
loading the data into Snowflake. Below are the steps for achieving this:
• Place the Snowflake sample giant_file from your local machine's directory into the
@~/giant_file/ S3 bucket using the following command:
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs
v_samples/CleanData/giant_file_sample.csv.gz @~/giant_file/;
	
• Create	a	single-column file format for examining the data in the
data file.
CREATE OR REPLACE FILE FORMAT single_column_rows
TYPE='CSV'
SKIP_HEADER=0
RECORD_DELIMITER='n'
TRIM_SPACE=false
DATE_FORMAT='AUTO'
TIMESTAMP_FORMAT='AUTO'
FIELD_DELIMITER='NONE'
FIELD_OPTIONALLY_ENCLOSED_BY='NONE'
ESCAPE_UNENCLOSED_FIELD='134'
NULL_IF=('')
COMMENT='copy each line into single-column row';
	
• Create an external table in the Snowflake database specifying the staging area and file
format to be used:
	
CREATE OR REPLACE TABLE GiantFile_ext
(fullrow varchar(4096) )
STAGE_LOCATION=@~/giant_file/
STAGE_FILE_FORMAT= single_column_rows
COMMENT='GiantFile preview table';
	
• Run the COPY command below to create small files while limiting the file size to 2MB.
This would split the data across multiple small files at 2MB each from a single original
data file.
	
COPY INTO @~/giant_file_parts/
FROM (SELECT * FROM
table(stage(GiantFile_ext ,
pattern => '.*giant_file_sample.csv.gz')))
max_file_size= 2000000;
• Verify Files of the Data You Unloaded
ls @~/giant_file_parts;
To place a copy of the S3 giant file parts onto your local machine after they have been split
into several files of 2 MB each, use the below command:
get
@~/giant_file_parts/
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs
v_samples/CleanData/
To remove all the files at the staging bucket location you want to clean up, use the following
command:
	
remove @~/giant_file_parts;
	
To remove a specific set of files from the giant file directory whose names match a regular
expression (i.e. remove all the files whose name ends with .csv.gz', use the following
command:
remove @~/giant_file pattern='.*.csv.gz';
	
	
Recommended Approach to Debug and Resolve Data Load Issues
Related to Data Problems
	
[WHAT IF YOU HAVE DATA FILES THAT HAVE PROBLEMS]:
This section below suggests a recommended flow for iterating through data fix on the data
file, and loading data into Snowflake via the COPY command. Snowflake’s COPY command
syntax supports several parameters that are helpful in debugging or bypassing bad data files
that are not possible to load due to various potential data problems, which may need to be
fixed before the data file can be read and loaded
[FIRST PASS] LOAD DATA WITH ONE OF THE THREE OPTIONS BELOW:
SKIPPING BAD DATA FILES:
1. Attempt to load with the ON_ERROR = 'SKIP_FILE' error handling parameter. With	
this	error	handling	parameter	setting,	files	with	errors	will	be	skipped	and	will	not	be	
loaded.	
	
[ON_ERROR=’SKIP_FILE’]	
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
ON_ERROR='skip_file';
OR
	
SKIPPING BAD DATA FILES IF ERRORS EXCEED A SPECIFIED LIMIT:
1. Attempt to load with more tolerant error handling
ON_ERROR=SKIP_FILE_[error_limit]. With this option for error handling, files with
errors could be partially loaded as long as the number of errors does not exceed
the stated limit. The file is skipped when the number of errors exceeds the stated
error limit.
[ON_ERROR=’SKIP_FILE_[error_limit]’]
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
ON_ERROR='skip_file_10';
OR
	
PERFORM PARTIAL LOAD FROM THE BAD DATA FILES:
1. Attempt to load with more tolerant error handling using ON_ERROR=’CONTINUE’.
With this option for error handling, files with errors could be partially loaded.
[ON_ERROR=’CONTINUE’]
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
ON_ERROR='continue';
	
[SECOND PASS] RETURN THE DATA ERRORS:
Validate the files, which were skipped and failed to load from the first pass. This time,
attempt to load the bad data files with VALIDATION_MODE='RETURN_ERRORS'. This
allows the COPY command to return the list of errors within each data file and the position of
those errors.
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_errors';
[FIX THE BAD RECORDS]
A. Download the bad data files containing the bad records from the staging area to
your local drive:
get
@~/errorsExhibit/exhibit_02.txt.gz
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloadin
g/csv_samples/ErrorData
[PREVIEW RECORDS FROM YOUR BAD DATA FILE(S)]
To get visibility into the records in the data file and few of its records, use an external
table and read few record from the data file to see what a sample record looks like.
CREATE OR REPLACE TABLE PreviewFile_ext
(fullrow varchar(4096) )
STAGE_LOCATION=@~/errorsExhibit/
STAGE_FILE_FORMAT= single_column_rows
COMMENT='Bad data file preview table';
SELECT *
FROM table(stage(PreviewFile_ext ,
pattern => '.*exhibit_02.txt.gz')) LIMIT 10;
Fix the bad records manually and write them to a new data file, or regenerate a new
data file from the data source containing only the bad records that did not load (as
applicable).
B. Upload the fixed bad data file(s) into the staging area for re-loading and attempt
reloading from that fixed file:
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading
/csv_samples/ErrorData/exhibit_02.txt.gz @~/errorsExhibit/

More Related Content

What's hot

Data guard architecture
Data guard architectureData guard architecture
Data guard architecture
Vimlendu Kumar
 

What's hot (20)

Snowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingSnowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data Warehousing
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflake
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflake
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company Presentation
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Snowflake Data Loading.pptx
Snowflake Data Loading.pptxSnowflake Data Loading.pptx
Snowflake Data Loading.pptx
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Data guard architecture
Data guard architectureData guard architecture
Data guard architecture
 
Data mesh
Data meshData mesh
Data mesh
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
 

Similar to Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

SQLite in Adobe AIR
SQLite in Adobe AIRSQLite in Adobe AIR
SQLite in Adobe AIR
Peter Elst
 

Similar to Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani) (20)

Reporting solutions for ADF Applications
Reporting solutions for ADF ApplicationsReporting solutions for ADF Applications
Reporting solutions for ADF Applications
 
An ADF Special Report
An ADF Special Report An ADF Special Report
An ADF Special Report
 
Sas classes in mumbai
Sas classes in mumbaiSas classes in mumbai
Sas classes in mumbai
 
SQLite in Adobe AIR
SQLite in Adobe AIRSQLite in Adobe AIR
SQLite in Adobe AIR
 
Big datademo
Big datademoBig datademo
Big datademo
 
Laravel 8 export data as excel file with example
Laravel 8 export data as excel file with exampleLaravel 8 export data as excel file with example
Laravel 8 export data as excel file with example
 
Jdbc example program with access and MySql
Jdbc example program with access and MySqlJdbc example program with access and MySql
Jdbc example program with access and MySql
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWS
 
Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
Simple ETL Solution - Marco Kiesewetter
Simple ETL Solution - Marco KiesewetterSimple ETL Solution - Marco Kiesewetter
Simple ETL Solution - Marco Kiesewetter
 
Mule using Salesforce
Mule using SalesforceMule using Salesforce
Mule using Salesforce
 
mule salesforce
mule salesforcemule salesforce
mule salesforce
 
Simple ado program by visual studio
Simple ado program by visual studioSimple ado program by visual studio
Simple ado program by visual studio
 
Simple ado program by visual studio
Simple ado program by visual studioSimple ado program by visual studio
Simple ado program by visual studio
 
Developing Information Schema Plugins
Developing Information Schema PluginsDeveloping Information Schema Plugins
Developing Information Schema Plugins
 
SQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSISSQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSIS
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
Salesforce Admin's guide : the data loader from the command line
Salesforce Admin's guide : the data loader from the command lineSalesforce Admin's guide : the data loader from the command line
Salesforce Admin's guide : the data loader from the command line
 
mule salesforce
mule salesforcemule salesforce
mule salesforce
 

Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

  • 1. Taming The Data Load/Unload in Snowflake Sample Code and Best Practice (Faysal Shaarani) Loading Data Into Your Snowflake’s Database(s) from raw data files [1. CREATE YOUR APPLICABLE FILE FORMAT]: The syntax in this section below allows you to create your CSV file format if you are loading data from CSV files. Please note that the backslash escapes are coded for use from the sfsql command line, not from the UI. If this FILE FORMAT below is to be used from the Snowflake UI, the occurrences should be changed to [CSV FILE FORMAT] CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.CSV -- Comma field delimited and n record terminator TYPE = 'CSV' COMPRESSION = 'AUTO' FIELD_DELIMITER = ',' RECORD_DELIMITER = 'n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = false ERROR_ON_COLUMN_COUNT_MISMATCH = true ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '134' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = (''); The syntax in the section below allows you to create JSON file format if you are loading JSON data into Snowflake. [JSON FILE FORMAT] CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.JSON TYPE ='JSON' COMPRESSION = 'AUTO' ENABLE_OCTAL = false ALLOW_DUPLICATE = false STRIP_OUTER_ARRAY = false; [2. CREATE YOUR DESTINATION TABLE]: Pre-create your table before loading the CSV data into that table.
  • 2. CREATE OR REPLACE TABLE exhibit (Id STRING ,Title STRING ,Year NUMBER ,Collection_Id STRING ,Timeline_Id STRING ,Depth INT); CREATE OR REPLACE TABLE timelines (Id STRING ,Title VARCHAR ,Regime STRING ,FromYear NUMBER ,ToYear NUMBER ,Height NUMBER ,Timeline_Id STRING ,Collection_Id STRING ,ForkNode NUMBER ,Depth NUMBER ,SubtreeSize NUMBER); If you are using data files that have been staged on your Snowflake’s Customer Account S3 bucket assigned to your company. Run COPY Command To Load Data From Raw CSV Files Load the data from your CSV file into the pre-created EXHIBIT table. If you encounter a data error on any of the records continue to load what you could. If you do not specify ON_ERROR, the Default would be to skip the file on the first error it encounters on any of the records in that file. The example below would load whatever it could skipping any bad records in the file. COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz FILE_FORMAT = CSV ON_ERROR='continue'; OR Load the data into the pre-created EXHIBIT table from several CSV files matching the file name regular expression shown on the sample code below: COPY INTO exhibit FROM @~/errorsExhibit PATTERN='.*0[1-2].txt.gz' FILE_FORMAT = CSV ON_ERROR='continue';
  • 3. To check the listing of files under the subdirectory CleanData under the @~ staging area for your Snowflake Beta Customer account, while in the sfsql command line, use the following command: ls @~/CleanData To check on the listing of all files whose file names match the regular expression specified in the PATTERN parameter, use the command below: ls @~/CleanData PATTERN='.*0[1-2].txt.gz'; Verify that the data was loaded successfully into the EXHIBIT table. Select * from EXHIBIT; Run COPY Command to Load/Parse JSON Data Raw Staged Files [1. Upload JSON File Into The Customer Account's S3 Staging Area] PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam ple_data @~/json/ [2. Create an External Table with a VARIANT Column To Contain The JSON Data] CREATE OR REPLACE TABLE public.json_data_table_ext (json_data variant) STAGE_LOCATION='@~' STAGE_FILE_FORMAT=demo_db.public.json COMMENT='json Data preview table'; [3. COPY the JSON Raw Data Into the Table] COPY INTO json_data_table_ext FROM @~/json/json_sample_data FILE_FORMAT = 'JSON' ON_ERROR='continue';
  • 4. JSON FILE CONTENT: Validate that the data in the JSON raw file got loaded into the table select * from public.json_data_table_ext; Output: { "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ... select json_data:root[0].kind, json_data:root[0].fullName, json_data:root[0].age from public.json_data_table_ext ; Output:
  • 5. If you are using data files that have been staged on your own company’s Amazon S3 bucket: Run COPY Command To Load Data From Raw CSV Files This syntax below is needed to create a stage ONLY if you are using your own company’s Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to create a stage object. Create the staging database object. CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.SAMPLE_STAGE URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/' CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE' AWS_SECRET_KEY = 'SECRET KEY VALUE’) COMMENT = 'Stage object pointing to the customer's own AWS S3 bucket. Independent of Snowflake'; Load the data from your CSV file into the pre-created EXHIBIT table. COPY INTO exhibit FROM @DEMO_DB/errorsExhibit/exhibit_03.txt.gz FILE_FORMAT = CSV ON_ERROR='continue'; OR Load the data into the pre-created EXHIBIT table from several CSV files matching the file name pattern regular expression stated on the sample code below. COPY INTO exhibit FROM @DEMO_DB/errorsExhibit PATTERN='.*0[1-2].txt.gz' FILE_FORMAT = CSV ON_ERROR='continue'; Verify that the data was loaded successfully into the EXHIBIT table. Select * from EXHIBIT; Run COPY Command to Load/Parse JSON Data Raw Staged Files [1. Create a Stage Database Object Pointing to The Location of The JSON File] This syntax below is needed to create a stage ONLY if you are using your own company’s Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to
  • 6. create a stage object. CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.STAGE_JSON URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/' CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE' AWS_SECRET_KEY = 'SECRET KEY VALUE’) COMMENT = 'Stage object pointing to the customer's own AWS S3 bucket. Independent of Snowflake'; Place your file in the staging location defined by the above command PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam ple_data @stage_JSON/json/ [2. Create an External Table with a VARIANT Column To Contain The JSON Data] CREATE OR REPLACE TABLE public.json_data_table_ext (json_data variant) STAGE_LOCATION=demo_db.public.stage_json STAGE_FILE_FORMAT= demo_db.public.json COMMENT='json Data preview table'; [3. COPY the JSON Raw Data Into the Table] COPY INTO json_data_table_ext FROM @stage_json/json/json_sample_data FILE_FORMAT = 'JSON' ON_ERROR='continue'; JSON FILE CONTENT:
  • 7. Validate that the data in the JSON raw file got loaded into the table select * from public.json_data_table_ext; Output: { "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ... select json_data:root[0].kind, json_data:root[0].fullName, json_data:root[0].age from public.json_data_table_ext ; Output: Using Snowflake to Validate Your Data Files In this section, we will go over validating the raw data files before performing the actual data load. To illustrate this, we will attempt to load raw data files containing errors and thus making it intentionally fail to load that data into Snowflake. The VALIDATION_MODE option on the COPY command would process the data without loading it in the destination table in Snowflake. In the proceeding example, the PUT command will stage the exhibit*.txt and timelines.txt files at the default S3 staging location set up for the Beta Customer Account in Snowflake as well
  • 8. as illustrate how we can load files under a sub-directory below the root staging directory of a Snowflake Beta Customer Account. PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs v_samples/ErrorData/exhibit*.txt @~/errorsExhibit/; ----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+ source | target | source_size | target_size | source_compression | target_compression | status | details | ----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+ exhibit_03.txt | exhibit_03.txt.gz | 8353 | 3734 | NONE | GZIP | UPLOADED | | exhibit_01.txt | exhibit_01.txt.gz | 14733 | 6207 | NONE | GZIP | UPLOADED | | exhibit_02.txt | exhibit_02.txt.gz | 14730 | 6106 | NONE | GZIP | UPLOADED | | ----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+ 3 rows in result (first row: 1.501 sec; total: 1.504 sec) Below are three possible raw data validation scenarios and sample code: 1. The following example would allow the previewing of 10 records from the first raw data file exhibit_01.txt. This file does not have any errors. COPY INTO exhibit FROM @~/errorsExhibit/exhibit_01.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_10_rows'; 2. The following example below simulates the scenario of having an extra delimiter in the record and how the errors that would be displayed. COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_errors'; 3.The following example below simulates the scenario of having a column value that is of the wrong data type and how the error would look like the output below after running the COPY command below: COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz
  • 9. FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_errors'; Using Snowflake to Unload Your Snowflake Data to Files To create a data file from a table in the Snowflake database, use the below command: COPY INTO S3 FROM EXHIBIT table COPY INTO @~/giant_file/ from exhibit; OR to overwrite the existing files in the same directory, use the OVERWRITE option as in the command below: COPY INTO @~/giant_file/ from exhibit overwrite=true; Please note that by default, Snowflake will unload the data from the table into multiple files of a size (16 MB) per file. If you want your data to be unloaded to a single file, then you need to use the SINGLE option on the COPY command as in the example below: COPY INTO @~/giant_file/ from exhibit Single=true overwrite=true; Please note that AWS S3 has a limit of (5 GB) on the file size you can stage on S3. You can use the optional MAX_FILE_SIZE (in bytes) to change the Snowflake default file size. Use the command below if you want to specify bigger or smaller file sizes than the Snowflake default file size as long as you do not exceed the AWS S3 max file size. For example, the below command unloads the data in the EXHIBIT table into files of 50M each: COPY INTO @~/giant_file/ from exhibit max_file_size= 50000000 overwrite=true; Using Snowflake to Split Your Data Files Into Smaller Files If you are using data files that have been staged on your Snowflake’s Customer Account S3 bucket assigned to your company. When loading data into Snowflake, it is recommended that the raw data is split into as many files as possible to maximize the parallelization of the data loading process and thus
  • 10. completing the data load in the shortest amount of time possible. If your raw data is in one raw data file, you can use Snowflake to split your large data file, into multiple files before loading the data into Snowflake. Below are the steps for achieving this: • Place the Snowflake sample giant_file from your local machine's directory into the @~/giant_file/ S3 bucket using the following command: PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs v_samples/CleanData/giant_file_sample.csv.gz @~/giant_file/; • Create a single-column file format for examining the data in the data file. CREATE OR REPLACE FILE FORMAT single_column_rows TYPE='CSV' SKIP_HEADER=0 RECORD_DELIMITER='n' TRIM_SPACE=false DATE_FORMAT='AUTO' TIMESTAMP_FORMAT='AUTO' FIELD_DELIMITER='NONE' FIELD_OPTIONALLY_ENCLOSED_BY='NONE' ESCAPE_UNENCLOSED_FIELD='134' NULL_IF=('') COMMENT='copy each line into single-column row'; • Create an external table in the Snowflake database specifying the staging area and file format to be used: CREATE OR REPLACE TABLE GiantFile_ext (fullrow varchar(4096) ) STAGE_LOCATION=@~/giant_file/ STAGE_FILE_FORMAT= single_column_rows COMMENT='GiantFile preview table'; • Run the COPY command below to create small files while limiting the file size to 2MB. This would split the data across multiple small files at 2MB each from a single original data file. COPY INTO @~/giant_file_parts/ FROM (SELECT * FROM table(stage(GiantFile_ext , pattern => '.*giant_file_sample.csv.gz'))) max_file_size= 2000000;
  • 11. • Verify Files of the Data You Unloaded ls @~/giant_file_parts; To place a copy of the S3 giant file parts onto your local machine after they have been split into several files of 2 MB each, use the below command: get @~/giant_file_parts/ file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs v_samples/CleanData/ To remove all the files at the staging bucket location you want to clean up, use the following command: remove @~/giant_file_parts; To remove a specific set of files from the giant file directory whose names match a regular expression (i.e. remove all the files whose name ends with .csv.gz', use the following command: remove @~/giant_file pattern='.*.csv.gz'; Recommended Approach to Debug and Resolve Data Load Issues Related to Data Problems [WHAT IF YOU HAVE DATA FILES THAT HAVE PROBLEMS]: This section below suggests a recommended flow for iterating through data fix on the data file, and loading data into Snowflake via the COPY command. Snowflake’s COPY command syntax supports several parameters that are helpful in debugging or bypassing bad data files that are not possible to load due to various potential data problems, which may need to be fixed before the data file can be read and loaded [FIRST PASS] LOAD DATA WITH ONE OF THE THREE OPTIONS BELOW:
  • 12. SKIPPING BAD DATA FILES: 1. Attempt to load with the ON_ERROR = 'SKIP_FILE' error handling parameter. With this error handling parameter setting, files with errors will be skipped and will not be loaded. [ON_ERROR=’SKIP_FILE’] COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) ON_ERROR='skip_file'; OR SKIPPING BAD DATA FILES IF ERRORS EXCEED A SPECIFIED LIMIT: 1. Attempt to load with more tolerant error handling ON_ERROR=SKIP_FILE_[error_limit]. With this option for error handling, files with errors could be partially loaded as long as the number of errors does not exceed the stated limit. The file is skipped when the number of errors exceeds the stated error limit. [ON_ERROR=’SKIP_FILE_[error_limit]’] COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) ON_ERROR='skip_file_10'; OR PERFORM PARTIAL LOAD FROM THE BAD DATA FILES: 1. Attempt to load with more tolerant error handling using ON_ERROR=’CONTINUE’. With this option for error handling, files with errors could be partially loaded. [ON_ERROR=’CONTINUE’] COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) ON_ERROR='continue'; [SECOND PASS] RETURN THE DATA ERRORS: Validate the files, which were skipped and failed to load from the first pass. This time, attempt to load the bad data files with VALIDATION_MODE='RETURN_ERRORS'. This allows the COPY command to return the list of errors within each data file and the position of those errors.
  • 13. COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_errors'; [FIX THE BAD RECORDS] A. Download the bad data files containing the bad records from the staging area to your local drive: get @~/errorsExhibit/exhibit_02.txt.gz file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloadin g/csv_samples/ErrorData [PREVIEW RECORDS FROM YOUR BAD DATA FILE(S)] To get visibility into the records in the data file and few of its records, use an external table and read few record from the data file to see what a sample record looks like. CREATE OR REPLACE TABLE PreviewFile_ext (fullrow varchar(4096) ) STAGE_LOCATION=@~/errorsExhibit/ STAGE_FILE_FORMAT= single_column_rows COMMENT='Bad data file preview table'; SELECT * FROM table(stage(PreviewFile_ext , pattern => '.*exhibit_02.txt.gz')) LIMIT 10; Fix the bad records manually and write them to a new data file, or regenerate a new data file from the data source containing only the bad records that did not load (as applicable). B. Upload the fixed bad data file(s) into the staging area for re-loading and attempt reloading from that fixed file: PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading /csv_samples/ErrorData/exhibit_02.txt.gz @~/errorsExhibit/