Benefícios e melhores práticas no uso do Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Daniel Bento, Arquiteto de Soluções, AWS
Novembro 2016 | São Paulo, Brasil
Workshop
Amazon Redshift

Agenda
• Introduction
• Architecture and Use Cases
• Data Loading and Migration
• Table and Schema Design
• Operations

We start with the customer… and innovate
Managing databases is painful & difficult
SQL DBs do not perform well at scale
Hadoop is difficult to deploy and manage
DWs are complex, costly, and slow
Commercial DBs are punitive & expensive
Streaming data Is difficult to capture & analyze
BI Tools are expensive and hard to manage
ü Amazon RDS
ü Amazon DynamoDB
ü Amazon EMR
ü Amazon Redshift
ü Amazon Aurora
ü Amazon Kinesis
ü Amazon QuickSight
Customers told us… We created…

AnalyzeStore
Glacier
S3
DynamoDB
RDS, Aurora
AWS Big Data Portfolio
Data Pipeline
CloudSearch
EMR EC2
Redshift
Machine
Learning
ElasticSearch
Launched
Database
Migration
New
QuickSight
New
SQL over
Streams
New
Kinesis
Firehose
New
Import Export
Direct Connect
Collect
Kinesis

Global Footprint
14 Regions; 33 Availability Zones; 54 Edge Locations
Redshift

Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Free Tier – 2 months
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper

The legacy view of data warehousing ...
Multi-year commitment
Multi-year deployments
Multi-million dollar deals

… Leads to dark data
This is a narrow view
Small companies also have big data
(mobile, social, gaming, adtech, IoT)
Long cycles, high costs, administrative
complexity all stifle innovation
0
200
400
600
800
1000
1200
Enterprise Data Data in Warehouse

The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, Machine Learning,
Streaming
Analysis in-line with process
flows
Pay as you go, grow as you
need
Managed availability & DR
Enterprise Big Data SaaS

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester
Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed
spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in
the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
The Forrester Wave™: Enterprise Data Warehouse, Q4
2015

Amazon Redshift Security
Petabyte-Scale Data Warehousing
Service
Amazon Redshift Architecture

Amazon Redshift Architecture
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS1/DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)

Benefit #1: Amazon Redshift is fast
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959

SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2016’
MIN: 01-JUNE-2016
MAX: 20-JUNE-2016
MIN: 08-JUNE-2016
MAX: 30-JUNE-2016
MIN: 12-JUNE-2016
MAX: 20-JUNE-2016
MIN: 02-JUNE-2016
MAX: 25-JUNE-2016
Unsorted Table
MIN: 01-JUNE-2016
MAX: 06-JUNE-2016
MIN: 07-JUNE-2016
MAX: 12-JUNE-2016
MIN: 13-JUNE-2016
MAX: 18-JUNE-2016
MIN: 19-JUNE-2016
MAX: 24-JUNE-2016
Sorted By Date
Sort Keys and Zone Maps

Parallel and Distributed
Query
Load
Export
Backup
Restore
Resize

ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
Distribution Keys

H/W optimized for I/O intensive workloads
Choice of storage type, instance size
Regular cadence of auto-patched improvements
Example: Our new Dense Storage (HDD) instance type
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: same as our prior generation !

Benefit #2: Amazon Redshift is inexpensive
DS2 (HDD)
Price Per Hour for
DS2.XL Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
DC1.L Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No up front costs
Pay as you go

Amazon Redshift lets you start small and grow big
Dense Storage (DS2.XL)
2 TB HDD, 31 GB RAM, 2 slices/4 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Dense Storage (DS2.8XL)
16 TB HDD, 244 GB RAM, 16 slices/36 cores, 10 GigE
Cluster 2-128 Nodes (32 TB – 2 PB)
Note: Nodes not to scale
Benefit #2: Amazon Redshift is inexpensive

Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2

Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/Region level disasters

Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3
encrypted
• Block key, Cluster key, Master key (AES-256)
• On-premises HSM, AWS CloudHSM & KMS
support
• Audit logging and AWS CloudTrail
integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC

Benefit #5: We innovate quickly
2013 20152014
•Initial Release in US East (N. Virginia; US West (Oregon), EU
(Ireland); Asia Pacific (Tokyo, Singapore, Sydney) Regions
•MANIFEST option for the COPY & UNLOAD commands
•SQL Functions: Most recent queries
•Resource-level IAM, CRC32
•Data Pipeline
•Event notifications, encryption, key rotation, audit logging, on-
premises or AWS CloudHSM; PCI, SOC 1/2/3
•Cross-Region Snapshot Copy
•Audit features, cursor support, 500 concurrent client connections
•EIP Address for VPC Cluster
•New system views to tune table design and track WLM query queues
•Custom ODBC/JDBC drivers; Query Visualization
•Mobile Analytics auto export
•KMS for GovCloud Region; HIPAA BAA
•Interleaved sort keys
•New Dense Storage Nodes (DS2) with better RAM and CPU.
•New Reserved Storage Nodes: No, Partial & All Upfront Options
•Cross-region backups for KMS encrypted clusters
•Scaler UDFs in Python
•AVRO Ingestion; Kinesis Firehose; Database Migration Service (DMS)
•Modify Cluster Dynamically
•Tag-based permissions and BZIP2
•System Tables for query Tuning
•Dense Compute Nodes
•Gzip & Lzop; JSON , RegEx, Cursors
•EMR Data Loading & Bootstrap Action with COPY command; WLM concurrency limit
to 50; support for the ECDH cipher suites for SSL connections; FedRAMP
•Cross-region ingestion
•Free trials & price reductions in Asia Pacific
•CloudWatch Alarm for Disk Usage
•AES 128-bit encryption; UTF-16; KMS Integration
•EU (Frankfurt); GovCloud Regions
•S3 Servier-side encryption support for UNLOAD
•Tagging Support for Cost-allocation
•WLM Queue-Hopping for timed-out
queries
•Append rows & Export to BZIP-2
•Lambda for Clusters in VPC; Data
Schema Conversion Support from
ML Console
•US West (N. California) Region.
2016
100+ new features added since launch
Release every two weeks
Automatic patching

Benefit #6: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine Learning
• Data Science
Amazon ML

Benefit #7: Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence

Benefit #8: Service oriented architecture
DynamoDB
EMR
S3
EC2/SSH
RDS/Aurora
Amazon
Redshift
Amazon Kinesis
Machine
Learning
Data Pipeline
CloudSearch
Mobile
Analytics

Amazon
Redshift
Starts at
$0.25/hour
EC2
Starts at
$0.02/hour
S3
$0.030/GB-Mo
Amazon Glacier
$0.010/GB-Mo
Amazon Kinesis
$0.015/shard 1MB/s in; 2MB/out
$0.028/million puts
Analyzing Twitter Firehose

500MM tweets/day = ~ 5,800 tweets/sec
2k/tweet is ~12MB/sec (~1TB/day)
$0.015/hour per shard, $0.028/million PUTS
Amazon Kinesis cost is $0.765/hour
Amazon Redshift cost is $0.850/hour (for a 2TB node)
S3 cost is $1.28/hour (no compression)
Total: $2.895/hour
Data warehouses
can be
inexpensive
and
powerful

Use only the services you need
Scale only the services you need
Pay for what you use
~40% discount with 1 year commitment
~70% discounts with 3 year commitment
Data warehouses
can be
inexpensive
and
powerful

Amazon.com – Weblog analysis
Web log analysis for Amazon.com
1PB+ workload, 2TB/day, growing 67% YoY
Largest table: 400 TB
Want to understand customer behavior

Query 15 months of data (1PB) in 14 minutes
Load 5B rows in 10 minutes
21B rows joined with 10B rows – 3 days (Hive) to 2 hours
Load pipeline: 90 hours (Oracle) to 8 hours
64 clusters
800 total nodes
13PB provisioned storage
2 DBAs
Data warehouses
can be
fast
and
simple

Sushiro – Real-time streaming from IoT & analysis

Sushiro – Real-time streaming & analysis
Real-time data ingested by Amazon Kinesis is analyzed in Amazon Redshift
380 stores stream live data from
Sushi plates
Inventory information combined
with consumption information
near real-time
Forecast demand by store,
minimize food waste, and
improve efficiencies
Amazon

Big data does not mean batch
Can be streamed in
Can be processed in near real time
Can be used to respond quickly to requests
You can mix and match
On premises and cloud
Custom development and managed services
Infrastructure with managed scaling, security
Data warehouses
can support
real-time data

Amazon Redshift Data Loading and Migration
Service

Data Loading Process
Data Source Extraction Transformation Loading
Amazon
Redshift
Target
Focus

Amazon Redshift Loading Data Overview
AWS CloudCorporate Data center
Amazon
DynamoDB
Amazon S3
Data
Volume
Amazon Elastic
MapReduce
Amazon
RDS
Amazon
Redshift
Amazon
Glacier
logs / files
Source DBs
VPN
Connection
AWS Direct
Connect
S3 Multipart
Upload
AWS Import/
Export
EC2 or On-
Prem (using
SSH)
Database
Migration Service
Kinesis
AWS Lambda
AWS
Datapipeline

Uploading Files to Amazon S3
Amazon
Redshiftmydata
Client.txt
Corporate Data center
Region
Ensure that your
data resides in the
same Region as
your Redshift
clusters
Split the data into
multiple files to
facilitate parallel
processing
Optionally, you
can encrypt your
data using
Amazon S3
Server-Side or
Client-Side
Encryption
Client.txt.
1
Client.txt.
2
Client.txt.
3
Client.txt.
4
Files should be
individually
compressed using
GZIP or LZOP

Loading Data From Amazon S3
Preparing Input Data Files
Uploading files to Amazon S3
Using COPY to load data from Amazon S3

Splitting Data Files
Slice 0
Slice 1
Slice 0
Slice 1
Client.txt.
1
Client.txt.
2
Client.txt.
3
Client.txt.
4
Node 0
Node 1
2 XL Compute Nodes
Copy customer from ‘s3://mydata/client.txt’
Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_
Delimiter ‘|’;
mydata

Use the COPY command
Each slice can load one file at a
time
A single input file means only one
slice is ingesting data
Instead of 100MB/s, you’re only
getting 6.25MB/s
Loading – Use multiple input files to maximize
throughput

Use the COPY command
You need at least as many input
files as you have slices
With 16 input files, all slices are
working so you maximize
throughput
Get 100MB/s per node; scale
linearly as you add nodes
Loading – Use multiple input files to maximize
throughput

Amazon Confidential – Slides not intended for redistribution.
• Comece sua primeira migração em 10 minutos ou menos
• Mantenha suas aplicações em execução durante a migração
• Mova dados para a mesma engine ou para uma diferente
AWS Database Migration
Service

Customer
Premises
Application Users
AWS
Internet
VPN
Start a replication instance
Connect to source and target databases
Select tables, schemas, or databases
Let AWS Database Migration Service
create tables, load data, and keep
them in sync
Switch applications over to the target
at your convenience
AWS DMS - Keep your apps running during the
migration
AWS
Database Migration
Service

Loading Data from an Amazon DynamoDB Table
Differences Amazon DynamoDB Amazon Redshift
Table Names • Up to 255 characters
• May contain ‘.’ (dot) and ‘-’ (dash)
characters
• Case-sensitive
• Limited to 127 characters
• Can’t contain dots or dashes
• Are NOT case-sensitive
• Can’t conflict with any Amazon
Redshift reserved words
NULL Does not support the SQL concept of
NULL
Must specify how Amazon Redshift
interprets empty or blank attribute
values in Amazon DynamoDB
Data Types STRING and NUMBER Data Types Supported
BINARY and SET Data Types Not Supported

Loading Data from Amazon Elastic MapReduce
Load data from Amazon EMR in parallel using COPY
Specify Amazon EMR cluster ID and HDFS file path/name
Amazon EMR must be running until COPY completes.
copy sales from 'emr:// j-1H7OUO3B52HI5/myoutput/part*'
credentials 'aws_access_key_id=<access-key id>;
aws_secret_access_key=<secret-access-key>';

Loading Data using Lambda
AWS Lambda-based Amazon Redshift
loader to offer you the ability to drop files into
S3 and load them into any number of
database tables in multiple Amazon Redshift
clusters automatically, with no servers to
maintain.
Blog post
https://blogs.aws.amazon.com/bigdata/post/
Tx24VJ6XF1JVJAA/A-Zero-Administration-
Amazon-Redshift-Database-Loader
GitHub http://github.com/awslabs/aws-
lambda-redshift-loader

Remote Loading using SSH
Redshift COPY command can reach
out to remote locations (EC2 and on
premise) to load data using a secure
shell script (SSH)
To Remote Load follow this process:
• Add cluster's public key to the
remote host's authorized keys file
• Configure remote host to accept
connections from all cluster IP
addresses
• Create manifest file in JSON format
and upload to S3 bucket
• Issue a COPY command, including
a reference to the manifest file

Vacuuming Tables
1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING
2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE
3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE
4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINER
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas
Column 0
Column 1 Column 2
Amazon Redshift serializes all of the
values of a column together.

Vacuuming Tables
1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING
2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE
3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE
4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINER
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas
Column 0 Column 1 Column 2
Delete customer where column_0 = 3;
x xxx xxxxxxxxxxxxxxx

Vacuuming Tables
1,2,4 RFK,JFK,GWB 900 Columbus,800 Washington,600 Kansas
VACUUM Customer;
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansasx xxx xxxxxxxxxxxxxxx
Redshift does not automatically reclaim
and reuse space that is freed when you
delete rows from tables or update rows
in tables. The VACUUM command
reclaims space following deletes,
which improves performance as well as
increasing available storage.

Analyzing Tables
ANALYZE command
The entire
current
database
A single
Table
One or
more
specific
columns in
a single
table
The ANALYZE
command obtains
a sample of rows,
does some
calculations, and
saves resulting
column statistics.
You do not need
to analyze all
columns in all
tables regularly
Analyze the columns that are frequently
used in the following:
• Sorting and grouping operations
• Joins
• Query Predicates
To maintain current statistics for tables:
• Run the ANALYZE command before
running queries
• Run the ANALYZE command against
the database routinely at the end of
every regular load or update cycle
• Run the ANALYZE command against
an new tables
Statistics

Managing Concurrent Write Operations
Concurrent COPY/INSERT/DELETE/UPDATE
Operations into the same table
Transaction 1
Transaction 2 CUSTOM
ER
Copy customer from
‘s3://mydata/client.txt’…
;
Copy customer from
‘s3://mydata/client.txt’…
;
Session A
Session B
Transaction 1 puts on
the write lock on the
CUSTOMER table
Transaction 2 waits
until transaction 1
releases the write
lock

Managing Concurrent Write Operations
Concurrent COPY/INSERT/DELETE/UPDATE
Operations into the same table
Transaction 1
Transaction 2
CUSTOM
ER
Begin;
Delete one row from
CUSTOMERS;
Copy…;
Select count(*) from
CUSTOMERS;
End;
Begin;
Delete one row from
CUSTOMERS;
Copy…;
Select count(*) from
CUSTOMERS;
End;
Session A
Session B
Transaction 1 puts on
the write lock on the
CUSTOMER table
Transaction 2 waits
until transaction 1
releases the write
lock

Managing Current Write Operations
Potential deadlock situation for
concurrent write transactions
Transaction 1
Transaction 2
CUSTOM
ER
Begin;
Delete 3000 rows from
CUSTOMERS;
Copy…;
Delete 5000 rows from PARTS;
End;
Begin;
Delete 3000 rows from PARTS;
Copy…;
Delete 5000 rows from
CUSTOMERS;
End;
Session A
Session B
Transaction 1 puts on the write
lock on the CUSTOMER table
PARTS
Transaction 2 puts on the write
lock on the PARTS table
Wait
Wait

Service
Amazon Redshift Table and Schema Design

Goals of distribution
• Distribute data evenly for parallel processing
• Minimize data movement
• Co-located joins
• Localized aggregations
Distribution key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Full table data on first
slice of every nodeSame key to same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round-robin
distribution

Choosing a distribution style
Key
• Large FACT tables
• Rapidly changing tables used
in joins
• Localize columns used within
aggregations
All
• Have slowly changing data
• Reasonable size (i.e., few
millions but not 100s of
millions of rows)
• No common distribution key
for frequent joins
• Typical use case: joined
dimension table without a
common distribution key
Even
• Tables not frequently joined or
aggregated
• Large tables without
acceptable candidate keys

Data Distribution and Distribution Keys
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Poor key choices lead to uneven distribution of records…

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Unevenly distributed data cause processing imbalances!

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records2M records 2M records 2M records
Evenly distributed data improves query performance

Single Column
Compound
Interleaved
Sort Keys

Goals of sorting
• Physically sort data within blocks and throughout a table
• Optimal SORTKEY is dependent on:
• Query patterns
• Data profile
• Business requirements

COMPOUND
• Most common
• Well-defined filter criteria
• Time-series data
Choosing a SORTKEY
INTERLEAVED
• Edge cases
• Large tables (>billion rows)
• No common filter criteria
• Non time-series data
• Primarily as a query predicate (date, identifier, …)
• Optionally, choose a column frequently used for aggregates
• Optionally, choose same as distribution key column for most
efficient joins (merge join)

Table is sorted by 1 column
[ SORTKEY ( date ) ]
Best for:
• Queries that use 1st column (i.e. date) as primary filter
• Can speed up joins and group bys
• Quickest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Single Column

• Table is sorted by 1st column , then 2nd column etc.
[ SORTKEY COMPOUND ( date, region, country) ]
• Best for:
• Queries that use 1st column as primary filter, then other cols
• Can speed up joins and group bys
• Slower to VACUUM
Date Region Country
Sort Keys – Compound

• Equal weight is given to each column.
[ SORTKEY INTERLEAVED ( date, region, country) ]
• Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter (up to 8)
• Slowest to VACUUM
Date Region Country
Sort Keys – Interleaved

Compressing data
• COPY automatically analyzes and compresses data
when loading into empty tables
• ANALYZE COMPRESSION checks existing tables and
proposes optimal compression algorithms for each
column
• Changing column encoding requires a table rebuild

Automatic compression is a good thing (mostly)
• From the zone maps we know:
• Which blocks contain the
range
• Which row offsets to scan
• Highly compressed SORTKEYs:
• Many rows per block
• Large row offset
Skip compression on just the
leading column of the compound
SORTKEY

During queries and ingestion,
the system allocates buffers
based on column width
Wider than needed columns
mean memory is wasted
Fewer rows fit into memory;
increased likelihood of queries
spilling to disk
DDL – Make Columns as narrow as possible

Service
Amazon Redshift Operations

Backup and Restore
Backups to Amazon S3 are
automatic, continuous &
incremental
Configurable system snapshot
retention period
User-defined snapshots are
on-demand
Streaming restores enable you
to resume querying faster
128GB RAM
16TB disk
16 coresRedshift
Cluster Node
128GB RAM
16TB disk
16 coresRedshift
Cluster Node
128GB RAM
16TB disk
16 coresRedshift
Cluster Node
Amazon S3

Snapshots
Automated Snapshots
Redshift enables automated snapshots by
default
Snapshots are incremental
Redshift retains incremental backup data
required to restore cluster using manual
snapshot or automatic snapshot
Only for clusters that haven’t reached the
snapshot retention period
Amazon Redshift provides free storage for
snapshots that is equal to the storage
capacity of your cluster.
Manual Snapshots
The system will never delete a manual snapshot
You can authorize access to an existing manual
snapshot for as many as 20 AWS customer
accounts
Shared access between accounts means you
can move data between dev/test/prod without
reloading

Cross-Region Snapshots
Copy cluster snapshots to another region using the management console
Can’t modify destination region without disabling cross-region snapshots and re-
enabling with a new destination region and retention period

Streaming Restore
During restore your node is provisioned within less than two minutes
Query provisioned node immediately as data is automatically streamed
from the S3 snapshot

― Cluster is put into read-only mode
― New cluster is provisioned according to resizing needs
― Node-to-node parallel data copy
― Only charged for source cluster
Resize – Phase 1

Resize - Phase 2
― Automatic SQL endpoint switchover via DNS
― Decommission the source cluster

Resizing a
Redshift Data
Warehouse via
the AWS
Console

Resizing a
Redshift Data
Warehouse via
the AWS
Console
http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#cluster-resize-intro

Resizing without Production Impact
When you resize your cluster is read-only
• Query cluster throughout resizing
• There will be a short outage at the end of cluster resizing

Parameter Groups
Allow you to set a variety of configuration
options:
• date style, search path and query timeout
• user activity logging
• whether the leader node should accept only
SSL protected connections
Apply to every database in a cluster
View events using the console, the SDK, or the
CLI/API
Use SET to change some properties
temporarily, for example:
SET wlm-query-slot-count = 3
Some changes to the parameter group may
require a restart to take effect
Default Parameter Values
http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html#wlm-dynamic-and-static-properties

Events
Event information is available for a
period of several weeks
Source types:
• Cluster
• Parameter Group
• Security Group
• Snapshot
Categories:
• Configuration
• Management
• Monitoring
• Security
Subscribe to events to receive SNS
notifications
http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#redshift-event-messages

Resource Tagging
Define custom tags for your cluster
• provide metadata about a resource
• categorize billing reports for cost
allocation
• Activate tags in the billing and cost
management service
Cluster resize or a restore from a
snapshot in the original region
preserves tags but not in cross
region snapshots
http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-tagging.html

Concurrent Query Execution
Long-running queries may cause short-running queries to wait in a queue
• As a result users may experience unpredictable performance
By default, a Redshift cluster is configured with one queue with 5 execution slots by
default (50 max)
• 5 slots means 5 queries can execute concurrently while additional queries
will queue until additional resources free up
Running
Default queue

Workload Management
Workload management is about creating queues for different workloads
User Group A
Short-running queueLong-running queue
Short
Query Group
Long
Query Group

Workload Management
Don’t set concurrency to more that you need
set query_group to allqueries;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;
reset query_group;

Redshift Utils
Script Purpose
top_queries.sql Get the top 50 most time-consuming statements in the last 7 days.
perf_alerts.sql Get the top occurrences of alerts; join with table scans.
filter_used.sql Get the filter applied to tables on scans. To aid on choosing sortkey.
commit_stats.sql Get information on consumption of cluster resources through COMMIT statements.
current_session_info.sql Get information about sessions with currently running queries.
missing_table_stats.sql Get EXPLAIN plans that flagged missing statistics on underlying tables.
queuing_queries.sql Get queries that are waiting on a WLM query slot.
table_info.sql Get table storage information (size, skew, etc.).
• Redshift Admin Scripts provide diagnostic information for tuning and
troubleshooting
• https://github.com/awslabs/amazon-redshift-utils

Redshift Utils
View Purpose
v_check_data_distribution.sql Get data distribution across slices.
v_constraint_dependency.sql Get the the foreign key constraints between tables.
v_generate_group_ddl.sql Get the DDL for a group.
v_generate_schema_ddl.sql Get the DDL for schemas.
v_generate_tbl_ddl.sql Get the DDL for a table. This will contain the distkey, sortkey, and constraints.
v_generate_unload_copy_cmd.sql Generate unload and copy commands for an object.
v_generate_user_object_permissions.sql Get the DDL for a users permissions to tables and views.
v_generate_view_ddl.sql Get the DDL for a view.
v_get_obj_priv_by_user.sql Get the table/views that a user has access to.
v_get_schema_priv_by_user.sql Get the schema that a user has access to.
v_get_tbl_priv_by_user.sql Get the tables that a user has access to.
v_get_users_in_group.sql Get all users in a group.
v_get_view_priv_by_user.sql Get the views that a user has access to.
v_object_dependency.sql Merge the different dependency views.
v_space_used_per_tbl.sql Get pull space used per table.
v_view_dependency.sql Get the names of the views that are dependent on other tables/views.
• Redshift Admin Views provide information about user and group access, various table
constraints, object and view dependencies, data distribution across slices, and pull space
used per table

Open source tools
https://github.com/awslabs/amazon-redshift-utils
https://github.com/awslabs/amazon-redshift-monitoring
https://github.com/awslabs/amazon-redshift-udfs
Admin scripts
Collection of utilities for running diagnostics on your cluster
Admin views
Collection of utilities for managing your cluster, generating schema DDL, etc.
ColumnEncodingUtility
Gives you the ability to apply optimal column encoding to an established
schema with data already loaded

PlayKids iFood MapLink Apontador Rapiddo Superplayer Cinepapaya ChefTime

“Com os serviços da AWS pudemos dosar os
investimentos iniciais e prospectar os custos
para expansões futuras”
“O Redshift nos
permitiu
transformar dados
em informações
self-service”
- Wanderley Paiva
Database Specialist

O Desafio
• Escalabilidade
• Disponibilidade
• Centralização dos dados
• Custos reduzidos e
preferencialmente diluído

Benefícios e melhores práticas no uso do Amazon Redshift

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Benefícios e melhores práticas no uso do Amazon Redshift

Similaire à Benefícios e melhores práticas no uso do Amazon Redshift (20)

Plus de Amazon Web Services LATAM

Plus de Amazon Web Services LATAM (20)

Dernier

Dernier (20)

Benefícios e melhores práticas no uso do Amazon Redshift