SlideShare a Scribd company logo
1 of 41
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
John Loughlin, Solution Architect
Eric Ferreira, Principal Database Engineer
July 22, 2015
Best Practices: Amazon Redshift
Migration and Loading Data
Amazon Redshift – Resources
Getting Started – June Webinar Series:
https://www.youtube.com/watch?v=biqBjWqJi-Q
Best Practices – July Webinar Series:
Optimizing Performance – July 21, 2015
Migration and Data Loading – July 22,2015
Reporting and Advanced Analytics – July 23, 2015
Agenda
Common Migration Patterns
Copy Command
Automation Options
Near real time loading
ETL Options with Partners
Common Migration Patterns
Common Migration Patterns
Data from a variety of relational OLTP systems
structure lends itself to SQL schemas
Data from logs, devices, sensors…
data is less structured
Structured Data Loading
Data is often being loaded into another warehouse
existing ETL process
Temptation is to ‘lift & shift’ workload.
Resist temptation. Instead consider:
What do I really want to do?
What do I need?
Ingesting Less Structured Data
Some data does not lend itself to a relational schema
Common pattern is to use EMR:
impose structure
import into Redshift
Other solutions are often home grown scripting
applications.
Loading Data
Load to an empty Redshift database.
Load changes captured in the source system to Redshift
Truncate and Load
This is by far the easiest option:
Move the data to Amazon Simple Storage Service
multi-part upload
import/export service
direct connect
COPY the data into Redshift, a table at a time.
Load Changes
Identify changes in source systems
Move data to Amazon S3
Load changes
‘Upsert process’
Partner ETL tools
Partner ETL
Amazon Redshift is supported by a variety of ETL vendors
Many simplify the process of data loading
Visit http://aws.amazon.com/redshift/partners
There are a variety of vendors offering a free trial of their
products, allowing you to evaluate and choose the one that
suits your needs.
Upsert
The goal is to insert new rows and update changed rows in
Redshift.
Load data into a temporary staging table
Join the staging with production and delete the common
rows.
Copy the new data into the production table.
See Updating and Inserting New Data in the developer’s
guide
Checkpoint
We’ve talked about common migration patterns
Sources of data and data structure
Methods of getting data to AWS
Options for loading data
COPY
Amazon Redshift Architecture
Leader Node
• SQL endpoint, JDBC/ODBC
• Stores metadata
• Coordinates query execution
Compute Nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via Amazon S3
• Load from Amazon DynamoDB or SSH
Two hardware platforms
• Optimized for data processing
• DS2: HDD; scale from 2TB to 2PB
• DC1: SSD; scale from 160GB to 326TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
A Closer Look
Each node is split into slices
• One slice per core
Each slice is allocated
memory, CPU, and disk space
Each slice processes a piece
of the workload in parallel
COPY command
COMPUPDATE ON when running on an empty table
Use the COPY command.
Each slice can load one file at a time.
Partition input files so every slice can load in parallel.
Use a Manifest file.
Use multiple input files to maximize throughput
Use the COPY command
Each slice can load one file at a
time
A single input file means only one
slice is ingesting data
Instead of 100MB/s, you’re only
getting 6.25MB/s
Use multiple input files to maximize throughput
Use the COPY command
You need at least as many input
files as you have slices
With 16 input files, all slices are
working so you maximize
throughput
Get 100MB/s per node; scale
linearly as you add nodes
Primary keys and manifest files
Amazon Redshift doesn’t enforce primary key constraints
• If you load data multiple times, Amazon Redshift won’t complain
• If you declare primary keys in your DML, the optimizer will
expect the data to be unique
Use manifest files to control exactly what is loaded and
how to respond if input files are missing
• Define a JSON manifest on Amazon S3
• Ensures the cluster loads exactly what you want
Analyze sort/dist key columns after every load
Amazon Redshift’s query
optimizer relies on up-to-date
statistics
Maximize performance by
updating stats on sort/dist key
columns after every load
Automatic compression
Better performance, lower costs
COPY samples data automatically when loading into an empty
table
• Samples up to 100,000 rows and picks optimal encoding
If you have a regular ETL process and you use temp tables or
staging tables, turn off automatic compression
• Use analyze compression to determine the right encodings
• Bake those encodings into your DML
Checking STL_LOAD_COMMITS
SELECT query, trim(filename) as filename, curtime, status
FROM stl_load_commits
WHERE filename LIKE ’%table name%'
ORDER BY query;
After the load operation is complete, query the
STL_LOAD_COMMITS system table to verify that the
expected files were loaded.
COPY and 18 inserts
COPY country FROM
's3://…country.txt' CREDENTIALS …
1.57s then
.
insert into country (country_name)
values ('Slovakia'),('Slovenia'),('South
Africa'),('South Korea'),('Spain'); 5.44s
‘
Insert vs Copy
Commit info
COPY best practice
Use it.
Avoid inserts, which will not run in parallel.
If you are moving data from table to another, use the
deep copy features:
1. Use the original CREATE TABLE ddl and then
INSERT INTO … SELECT
2. CREATE TABLE AS
3. CREATE TABLE LIKE
4. Create a temporary table and truncate the
original.
Automation Options
Automating Data Ingestion
Many customers run custom scripts on EC2 instances to
load data into Redshift.
Another option is to use the Amazon Data Pipeline
automation tool.
AWS Lambda-based Amazon Redshift Loader
Create a Data Pipeline
Create a Data Pipeline
Review Results
Execution Details
Using the Lambda based Redshift Loader
Offers the ability to drop files
into S3 and load them into any
number of database tables in
multiple Amazon Redshift
clusters automatically, with no
servers to maintain.
Configure the sample loader
johnlou$ ./configureSample.sh more.ohno.us-east-1.redshift.amazonaws.com 8192 mydb
johnlou us-east-1
Password for user johnlou:
create user test_lambda_load_user password 'Change-me1!';
CREATE USER
create table lambda_redshift_sample(
column_a int,
column_b int,
column_c int
);
CREATE TABLE
Enter the Region for the Redshift Load Configuration > us-east-1
Enter the S3 Bucket to use for the Sample Input > johnlou-ohno/loader-demo-data
Enter the Access Key used by Redshift to get data from S3 > nope
Enter the Secret Key used by Redshift to get data from S3 > nope
Creating Tables in Dynamo DB if Required
Configuration for johnlou-ohno/loader-demo-data/input successfully written in us-east-1
View Logs
Near Real Time Loading
Micro-batch loading
Ideal for time series data
Balance input files
Pre-configure column encoding
Reduce frequency of statistics calculation
Load in sort key order
Use SSD instances
Consider using the ‘Load Stream’ architecture HasOffers
developed.
ETL Options with Partners
Data Loading Options
Parallel upload to Amazon S3
AWS Direct Connect
AWS Import/Export
Amazon Kinesis
Systems integrators
Data Integration Systems Integrators
Resources on the AWS Big Data Blog
Best Practices for Micro-Batch Loading on Amazon
Redshift
Using Attunity Cloudbeam at UMUC to Replicate Data
to Amazon RDS and Amazon Redshift
A Zero-Administration Amazon Redshift Database
Loader
Best Practices References
Best Practices for Designing Tables
Best Practices for Designing Queries
Best Practices for Loading Data
Thank you!

More Related Content

What's hot

AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
Volodymyr Rovetskiy
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
joeharris76
 

What's hot (20)

Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
 
Scalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedScalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query Speed
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Building AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauBuilding AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and Tableau
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 

Viewers also liked

Scaling by Design: AWS Web Services Patterns
Scaling by Design:AWS Web Services PatternsScaling by Design:AWS Web Services Patterns
Scaling by Design: AWS Web Services Patterns
Amazon Web Services
 

Viewers also liked (20)

Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Encryption and Key Management in AWS
Encryption and Key Management in AWSEncryption and Key Management in AWS
Encryption and Key Management in AWS
 
Scaling by Design: AWS Web Services Patterns
Scaling by Design:AWS Web Services PatternsScaling by Design:AWS Web Services Patterns
Scaling by Design: AWS Web Services Patterns
 
Hybrid Infrastructure Integration
Hybrid Infrastructure IntegrationHybrid Infrastructure Integration
Hybrid Infrastructure Integration
 
Agile BI - Pop-up Loft Tel Aviv
Agile BI - Pop-up Loft Tel AvivAgile BI - Pop-up Loft Tel Aviv
Agile BI - Pop-up Loft Tel Aviv
 
Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...
Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...
Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...
 
Data Storage for the Long Haul: Compliance and Archive
Data Storage for the Long Haul: Compliance and ArchiveData Storage for the Long Haul: Compliance and Archive
Data Storage for the Long Haul: Compliance and Archive
 
AWS March 2016 Webinar Series Getting Started with Serverless Architectures
AWS March 2016 Webinar Series   Getting Started with Serverless ArchitecturesAWS March 2016 Webinar Series   Getting Started with Serverless Architectures
AWS March 2016 Webinar Series Getting Started with Serverless Architectures
 
AWS APAC Webinar Week - Understanding AWS Storage Options
AWS APAC Webinar Week - Understanding AWS Storage OptionsAWS APAC Webinar Week - Understanding AWS Storage Options
AWS APAC Webinar Week - Understanding AWS Storage Options
 
Deep Dive: Hybrid Architectures
Deep Dive: Hybrid ArchitecturesDeep Dive: Hybrid Architectures
Deep Dive: Hybrid Architectures
 
AWS Mobile Services & SDK Introduction & Demo
AWS Mobile Services & SDK Introduction & DemoAWS Mobile Services & SDK Introduction & Demo
AWS Mobile Services & SDK Introduction & Demo
 
The Pace of Innovation - Pop-up Loft Tel Aviv
The Pace of Innovation - Pop-up Loft Tel AvivThe Pace of Innovation - Pop-up Loft Tel Aviv
The Pace of Innovation - Pop-up Loft Tel Aviv
 
(NET307) Pinterest: The road from EC2-Classic To EC2-VPC
(NET307) Pinterest: The road from EC2-Classic To EC2-VPC(NET307) Pinterest: The road from EC2-Classic To EC2-VPC
(NET307) Pinterest: The road from EC2-Classic To EC2-VPC
 
Compute Without Servers – Building Applications with AWS Lambda - Technical 301
Compute Without Servers – Building Applications with AWS Lambda - Technical 301Compute Without Servers – Building Applications with AWS Lambda - Technical 301
Compute Without Servers – Building Applications with AWS Lambda - Technical 301
 
(DEV204) Building High-Performance Native Cloud Apps In C++
(DEV204) Building High-Performance Native Cloud Apps In C++(DEV204) Building High-Performance Native Cloud Apps In C++
(DEV204) Building High-Performance Native Cloud Apps In C++
 
Workshop: AWS Lamda Signal Corps vs Zombies
Workshop: AWS Lamda Signal Corps vs ZombiesWorkshop: AWS Lamda Signal Corps vs Zombies
Workshop: AWS Lamda Signal Corps vs Zombies
 
Security Day IAM Recommended Practices
Security Day IAM Recommended PracticesSecurity Day IAM Recommended Practices
Security Day IAM Recommended Practices
 
Ansible on aws - Pop-up Loft Tel Aviv
Ansible on aws - Pop-up Loft Tel AvivAnsible on aws - Pop-up Loft Tel Aviv
Ansible on aws - Pop-up Loft Tel Aviv
 
My First Big Data Application
My First Big Data ApplicationMy First Big Data Application
My First Big Data Application
 

Similar to AWS July Webinar Series: Amazon redshift migration and load data 20150722

Similar to AWS July Webinar Series: Amazon redshift migration and load data 20150722 (20)

Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Amazon Redshift For Data Analysts
Amazon Redshift For Data AnalystsAmazon Redshift For Data Analysts
Amazon Redshift For Data Analysts
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Accelerate Oracle to Aurora PostgreSQL Migration (GPSTEC313) - AWS re:Invent ...
Accelerate Oracle to Aurora PostgreSQL Migration (GPSTEC313) - AWS re:Invent ...Accelerate Oracle to Aurora PostgreSQL Migration (GPSTEC313) - AWS re:Invent ...
Accelerate Oracle to Aurora PostgreSQL Migration (GPSTEC313) - AWS re:Invent ...
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

AWS July Webinar Series: Amazon redshift migration and load data 20150722

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. John Loughlin, Solution Architect Eric Ferreira, Principal Database Engineer July 22, 2015 Best Practices: Amazon Redshift Migration and Loading Data
  • 2. Amazon Redshift – Resources Getting Started – June Webinar Series: https://www.youtube.com/watch?v=biqBjWqJi-Q Best Practices – July Webinar Series: Optimizing Performance – July 21, 2015 Migration and Data Loading – July 22,2015 Reporting and Advanced Analytics – July 23, 2015
  • 3. Agenda Common Migration Patterns Copy Command Automation Options Near real time loading ETL Options with Partners
  • 5. Common Migration Patterns Data from a variety of relational OLTP systems structure lends itself to SQL schemas Data from logs, devices, sensors… data is less structured
  • 6. Structured Data Loading Data is often being loaded into another warehouse existing ETL process Temptation is to ‘lift & shift’ workload. Resist temptation. Instead consider: What do I really want to do? What do I need?
  • 7. Ingesting Less Structured Data Some data does not lend itself to a relational schema Common pattern is to use EMR: impose structure import into Redshift Other solutions are often home grown scripting applications.
  • 8. Loading Data Load to an empty Redshift database. Load changes captured in the source system to Redshift
  • 9. Truncate and Load This is by far the easiest option: Move the data to Amazon Simple Storage Service multi-part upload import/export service direct connect COPY the data into Redshift, a table at a time.
  • 10. Load Changes Identify changes in source systems Move data to Amazon S3 Load changes ‘Upsert process’ Partner ETL tools
  • 11. Partner ETL Amazon Redshift is supported by a variety of ETL vendors Many simplify the process of data loading Visit http://aws.amazon.com/redshift/partners There are a variety of vendors offering a free trial of their products, allowing you to evaluate and choose the one that suits your needs.
  • 12. Upsert The goal is to insert new rows and update changed rows in Redshift. Load data into a temporary staging table Join the staging with production and delete the common rows. Copy the new data into the production table. See Updating and Inserting New Data in the developer’s guide
  • 13. Checkpoint We’ve talked about common migration patterns Sources of data and data structure Methods of getting data to AWS Options for loading data
  • 14. COPY
  • 15. Amazon Redshift Architecture Leader Node • SQL endpoint, JDBC/ODBC • Stores metadata • Coordinates query execution Compute Nodes • Local, columnar storage • Execute queries in parallel • Load, backup, restore via Amazon S3 • Load from Amazon DynamoDB or SSH Two hardware platforms • Optimized for data processing • DS2: HDD; scale from 2TB to 2PB • DC1: SSD; scale from 160GB to 326TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 16. A Closer Look Each node is split into slices • One slice per core Each slice is allocated memory, CPU, and disk space Each slice processes a piece of the workload in parallel
  • 17. COPY command COMPUPDATE ON when running on an empty table Use the COPY command. Each slice can load one file at a time. Partition input files so every slice can load in parallel. Use a Manifest file.
  • 18. Use multiple input files to maximize throughput Use the COPY command Each slice can load one file at a time A single input file means only one slice is ingesting data Instead of 100MB/s, you’re only getting 6.25MB/s
  • 19. Use multiple input files to maximize throughput Use the COPY command You need at least as many input files as you have slices With 16 input files, all slices are working so you maximize throughput Get 100MB/s per node; scale linearly as you add nodes
  • 20. Primary keys and manifest files Amazon Redshift doesn’t enforce primary key constraints • If you load data multiple times, Amazon Redshift won’t complain • If you declare primary keys in your DML, the optimizer will expect the data to be unique Use manifest files to control exactly what is loaded and how to respond if input files are missing • Define a JSON manifest on Amazon S3 • Ensures the cluster loads exactly what you want
  • 21. Analyze sort/dist key columns after every load Amazon Redshift’s query optimizer relies on up-to-date statistics Maximize performance by updating stats on sort/dist key columns after every load
  • 22. Automatic compression Better performance, lower costs COPY samples data automatically when loading into an empty table • Samples up to 100,000 rows and picks optimal encoding If you have a regular ETL process and you use temp tables or staging tables, turn off automatic compression • Use analyze compression to determine the right encodings • Bake those encodings into your DML
  • 23. Checking STL_LOAD_COMMITS SELECT query, trim(filename) as filename, curtime, status FROM stl_load_commits WHERE filename LIKE ’%table name%' ORDER BY query; After the load operation is complete, query the STL_LOAD_COMMITS system table to verify that the expected files were loaded.
  • 24. COPY and 18 inserts COPY country FROM 's3://…country.txt' CREDENTIALS … 1.57s then . insert into country (country_name) values ('Slovakia'),('Slovenia'),('South Africa'),('South Korea'),('Spain'); 5.44s ‘ Insert vs Copy Commit info
  • 25. COPY best practice Use it. Avoid inserts, which will not run in parallel. If you are moving data from table to another, use the deep copy features: 1. Use the original CREATE TABLE ddl and then INSERT INTO … SELECT 2. CREATE TABLE AS 3. CREATE TABLE LIKE 4. Create a temporary table and truncate the original.
  • 27. Automating Data Ingestion Many customers run custom scripts on EC2 instances to load data into Redshift. Another option is to use the Amazon Data Pipeline automation tool. AWS Lambda-based Amazon Redshift Loader
  • 28. Create a Data Pipeline
  • 29. Create a Data Pipeline
  • 32. Using the Lambda based Redshift Loader Offers the ability to drop files into S3 and load them into any number of database tables in multiple Amazon Redshift clusters automatically, with no servers to maintain.
  • 33. Configure the sample loader johnlou$ ./configureSample.sh more.ohno.us-east-1.redshift.amazonaws.com 8192 mydb johnlou us-east-1 Password for user johnlou: create user test_lambda_load_user password 'Change-me1!'; CREATE USER create table lambda_redshift_sample( column_a int, column_b int, column_c int ); CREATE TABLE Enter the Region for the Redshift Load Configuration > us-east-1 Enter the S3 Bucket to use for the Sample Input > johnlou-ohno/loader-demo-data Enter the Access Key used by Redshift to get data from S3 > nope Enter the Secret Key used by Redshift to get data from S3 > nope Creating Tables in Dynamo DB if Required Configuration for johnlou-ohno/loader-demo-data/input successfully written in us-east-1
  • 35. Near Real Time Loading
  • 36. Micro-batch loading Ideal for time series data Balance input files Pre-configure column encoding Reduce frequency of statistics calculation Load in sort key order Use SSD instances Consider using the ‘Load Stream’ architecture HasOffers developed.
  • 37. ETL Options with Partners
  • 38. Data Loading Options Parallel upload to Amazon S3 AWS Direct Connect AWS Import/Export Amazon Kinesis Systems integrators Data Integration Systems Integrators
  • 39. Resources on the AWS Big Data Blog Best Practices for Micro-Batch Loading on Amazon Redshift Using Attunity Cloudbeam at UMUC to Replicate Data to Amazon RDS and Amazon Redshift A Zero-Administration Amazon Redshift Database Loader
  • 40. Best Practices References Best Practices for Designing Tables Best Practices for Designing Queries Best Practices for Loading Data