O Amazon Redshift é um data warehouse rápido, gerenciado e em escala de petabytes que torna mais simples e econômica a análise de todos os seus dados usando as ferramentas de inteligência de negócios de que você já dispõe. Comece aos poucos, por apenas 0,25 USD por hora, sem compromissos, e aumente a escala até petabytes por 1.000 USD por terabyte por ano, menos de um décimo do custo das soluções tradicionais. Normalmente, os clientes relatam uma compactação de 3x, que reduz seus custos para 333 USD por terabyte não compactado por ano.
3. We start with the customer… and innovate
Managing databases is painful & difficult
SQL DBs do not perform well at scale
Hadoop is difficult to deploy and manage
DWs are complex, costly, and slow
Commercial DBs are punitive & expensive
Streaming data Is difficult to capture & analyze
BI Tools are expensive and hard to manage
ü Amazon RDS
ü Amazon DynamoDB
ü Amazon EMR
ü Amazon Redshift
ü Amazon Aurora
ü Amazon Kinesis
ü Amazon QuickSight
Customers told us… We created…
4. AnalyzeStore
Glacier
S3
DynamoDB
RDS, Aurora
AWS Big Data Portfolio
Data Pipeline
CloudSearch
EMR EC2
Redshift
Machine
Learning
ElasticSearch
Launched
Database
Migration
New
QuickSight
New
SQL over
Streams
New
Kinesis
Firehose
New
Import Export
Direct Connect
Collect
Kinesis
6. Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Free Tier – 2 months
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
7. The legacy view of data warehousing ...
Multi-year commitment
Multi-year deployments
Multi-million dollar deals
8. … Leads to dark data
This is a narrow view
Small companies also have big data
(mobile, social, gaming, adtech, IoT)
Long cycles, high costs, administrative
complexity all stifle innovation
0
200
400
600
800
1000
1200
Enterprise Data Data in Warehouse
9. The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, Machine Learning,
Streaming
Analysis in-line with process
flows
Pay as you go, grow as you
need
Managed availability & DR
Enterprise Big Data SaaS
10. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester
Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed
spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in
the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
The Forrester Wave™: Enterprise Data Warehouse, Q4
2015
11. Selected Amazon Redshift Customers
NTT Docomo | Telecom FINRA | Financial Svcs Philips | Healthcare Yelp | Technology NASDAQ | Financial Svcs
The Weather Company | Media Nokia | Telecom Pinterest | Technology Foursquare | Technology Coursera | Education
Coinbase | Bitcoin Amazon | E-Commerce Etix | Entertainment Spuul | Entertainment Vivaki | Ad Tech
Z2 | Gaming Neustar | Ad Tech SoundCloud | Technology BeachMint | E-Commerce Civis | Technology
15. SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2016’
MIN: 01-JUNE-2016
MAX: 20-JUNE-2016
MIN: 08-JUNE-2016
MAX: 30-JUNE-2016
MIN: 12-JUNE-2016
MAX: 20-JUNE-2016
MIN: 02-JUNE-2016
MAX: 25-JUNE-2016
Unsorted Table
MIN: 01-JUNE-2016
MAX: 06-JUNE-2016
MIN: 07-JUNE-2016
MAX: 12-JUNE-2016
MIN: 13-JUNE-2016
MAX: 18-JUNE-2016
MIN: 19-JUNE-2016
MAX: 24-JUNE-2016
Sorted By Date
Benefit #1: Amazon Redshift is fast
Sort Keys and Zone Maps
16. Benefit #1: Amazon Redshift is fast
Parallel and Distributed
Query
Load
Export
Backup
Restore
Resize
17. ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
Benefit #1: Amazon Redshift is fast
Distribution Keys
18. Benefit #1: Amazon Redshift is fast
H/W optimized for I/O intensive workloads
Choice of storage type, instance size
Regular cadence of auto-patched improvements
Example: Our new Dense Storage (HDD) instance type
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: same as our prior generation !
19. Benefit #2: Amazon Redshift is inexpensive
DS2 (HDD)
Price Per Hour for
DS2.XL Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
DC1.L Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No up front costs
Pay as you go
20. Amazon Redshift lets you start small and grow big
Dense Storage (DS2.XL)
2 TB HDD, 31 GB RAM, 2 slices/4 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Dense Storage (DS2.8XL)
16 TB HDD, 244 GB RAM, 16 slices/36 cores, 10 GigE
Cluster 2-128 Nodes (32 TB – 2 PB)
Note: Nodes not to scale
Benefit #2: Amazon Redshift is inexpensive
21. Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
22. Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/Region level disasters
23. Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3
encrypted
• Block key, Cluster key, Master key (AES-256)
• On-premises HSM, AWS CloudHSM & KMS
support
• Audit logging and AWS CloudTrail
integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
24. Benefit #5: We innovate quickly
2013 20152014
•Initial Release in US East (N. Virginia; US West (Oregon), EU
(Ireland); Asia Pacific (Tokyo, Singapore, Sydney) Regions
•MANIFEST option for the COPY & UNLOAD commands
•SQL Functions: Most recent queries
•Resource-level IAM, CRC32
•Data Pipeline
•Event notifications, encryption, key rotation, audit logging, on-
premises or AWS CloudHSM; PCI, SOC 1/2/3
•Cross-Region Snapshot Copy
•Audit features, cursor support, 500 concurrent client connections
•EIP Address for VPC Cluster
•New system views to tune table design and track WLM query queues
•Custom ODBC/JDBC drivers; Query Visualization
•Mobile Analytics auto export
•KMS for GovCloud Region; HIPAA BAA
•Interleaved sort keys
•New Dense Storage Nodes (DS2) with better RAM and CPU.
•New Reserved Storage Nodes: No, Partial & All Upfront Options
•Cross-region backups for KMS encrypted clusters
•Scaler UDFs in Python
•AVRO Ingestion; Kinesis Firehose; Database Migration Service (DMS)
•Modify Cluster Dynamically
•Tag-based permissions and BZIP2
•System Tables for query Tuning
•Dense Compute Nodes
•Gzip & Lzop; JSON , RegEx, Cursors
•EMR Data Loading & Bootstrap Action with COPY command; WLM concurrency limit
to 50; support for the ECDH cipher suites for SSL connections; FedRAMP
•Cross-region ingestion
•Free trials & price reductions in Asia Pacific
•CloudWatch Alarm for Disk Usage
•AES 128-bit encryption; UTF-16; KMS Integration
•EU (Frankfurt); GovCloud Regions
•S3 Servier-side encryption support for UNLOAD
•Tagging Support for Cost-allocation
•WLM Queue-Hopping for timed-out
queries
•Append rows & Export to BZIP-2
•Lambda for Clusters in VPC; Data
Schema Conversion Support from
ML Console
•US West (N. California) Region.
2016
100+ new features added since launch
Release every two weeks
Automatic patching
25. Benefit #6: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine Learning
• Data Science
Amazon ML
26. Benefit #7: Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence
27. Benefit #8: Service oriented architecture
DynamoDB
EMR
S3
EC2/SSH
RDS/Aurora
Amazon
Redshift
Amazon Kinesis
Machine
Learning
Data Pipeline
CloudSearch
Mobile
Analytics
31. 500MM tweets/day = ~ 5,800 tweets/sec
2k/tweet is ~12MB/sec (~1TB/day)
$0.015/hour per shard, $0.028/million PUTS
Amazon Kinesis cost is $0.765/hour
Amazon Redshift cost is $0.850/hour (for a 2TB node)
S3 cost is $1.28/hour (no compression)
Total: $2.895/hour
Data warehouses
can be
inexpensive
and
powerful
32. Use only the services you need
Scale only the services you need
Pay for what you use
~40% discount with 1 year commitment
~70% discounts with 3 year commitment
Data warehouses
can be
inexpensive
and
powerful
33. Amazon.com – Weblog analysis
Web log analysis for Amazon.com
1PB+ workload, 2TB/day, growing 67% YoY
Largest table: 400 TB
Want to understand customer behavior
34. Query 15 months of data (1PB) in 14 minutes
Load 5B rows in 10 minutes
21B rows joined with 10B rows – 3 days (Hive) to 2 hours
Load pipeline: 90 hours (Oracle) to 8 hours
64 clusters
800 total nodes
13PB provisioned storage
2 DBAs
Data warehouses
can be
fast
and
simple
36. Sushiro – Real-time streaming & analysis
Real-time data ingested by Amazon Kinesis is analyzed in Amazon Redshift
380 stores stream live data from
Sushi plates
Inventory information combined
with consumption information
near real-time
Forecast demand by store,
minimize food waste, and
improve efficiencies
Amazon
37. Big data does not mean batch
Can be streamed in
Can be processed in near real time
Can be used to respond quickly to requests
You can mix and match
On premises and cloud
Custom development and managed services
Infrastructure with managed scaling, security
Data warehouses
can support
real-time data
38. Amazon Redshift Data Loading and Migration
Petabyte-Scale Data Warehousing
Service
40. Amazon Redshift Loading Data Overview
AWS CloudCorporate Data center
Amazon
DynamoDB
Amazon S3
Data
Volume
Amazon Elastic
MapReduce
Amazon
RDS
Amazon
Redshift
Amazon
Glacier
logs / files
Source DBs
VPN
Connection
AWS Direct
Connect
S3 Multipart
Upload
AWS Import/
Export
EC2 or On-
Prem (using
SSH)
Database
Migration Service
Kinesis
AWS Lambda
AWS
Datapipeline
41. Uploading Files to Amazon S3
Amazon
Redshiftmydata
Client.txt
Corporate Data center
Region
Ensure that your
data resides in the
same Region as
your Redshift
clusters
Split the data into
multiple files to
facilitate parallel
processing
Optionally, you
can encrypt your
data using
Amazon S3
Server-Side or
Client-Side
Encryption
Client.txt.
1
Client.txt.
2
Client.txt.
3
Client.txt.
4
Files should be
individually
compressed using
GZIP or LZOP
42. Loading Data From Amazon S3
Preparing Input Data Files
Uploading files to Amazon S3
Using COPY to load data from Amazon S3
44. Use the COPY command
Each slice can load one file at a
time
A single input file means only one
slice is ingesting data
Instead of 100MB/s, you’re only
getting 6.25MB/s
Loading – Use multiple input files to maximize
throughput
45. Use the COPY command
You need at least as many input
files as you have slices
With 16 input files, all slices are
working so you maximize
throughput
Get 100MB/s per node; scale
linearly as you add nodes
Loading – Use multiple input files to maximize
throughput
46. Amazon Confidential – Slides not intended for redistribution.
• Comece sua primeira migração em 10 minutos ou menos
• Mantenha suas aplicações em execução durante a migração
• Mova dados para a mesma engine ou para uma diferente
AWS Database Migration
Service
47. Customer
Premises
Application Users
AWS
Internet
VPN
Start a replication instance
Connect to source and target databases
Select tables, schemas, or databases
Let AWS Database Migration Service
create tables, load data, and keep
them in sync
Switch applications over to the target
at your convenience
AWS DMS - Keep your apps running during the
migration
AWS
Database Migration
Service
48. Loading Data from an Amazon DynamoDB Table
Differences Amazon DynamoDB Amazon Redshift
Table Names • Up to 255 characters
• May contain ‘.’ (dot) and ‘-’ (dash)
characters
• Case-sensitive
• Limited to 127 characters
• Can’t contain dots or dashes
• Are NOT case-sensitive
• Can’t conflict with any Amazon
Redshift reserved words
NULL Does not support the SQL concept of
NULL
Must specify how Amazon Redshift
interprets empty or blank attribute
values in Amazon DynamoDB
Data Types STRING and NUMBER Data Types Supported
BINARY and SET Data Types Not Supported
49. Loading Data from Amazon Elastic MapReduce
Load data from Amazon EMR in parallel using COPY
Specify Amazon EMR cluster ID and HDFS file path/name
Amazon EMR must be running until COPY completes.
copy sales from 'emr:// j-1H7OUO3B52HI5/myoutput/part*'
credentials 'aws_access_key_id=<access-key id>;
aws_secret_access_key=<secret-access-key>';
50. Loading Data using Lambda
AWS Lambda-based Amazon Redshift
loader to offer you the ability to drop files into
S3 and load them into any number of
database tables in multiple Amazon Redshift
clusters automatically, with no servers to
maintain.
Blog post
https://blogs.aws.amazon.com/bigdata/post/
Tx24VJ6XF1JVJAA/A-Zero-Administration-
Amazon-Redshift-Database-Loader
GitHub http://github.com/awslabs/aws-
lambda-redshift-loader
51. Remote Loading using SSH
Redshift COPY command can reach
out to remote locations (EC2 and on
premise) to load data using a secure
shell script (SSH)
To Remote Load follow this process:
• Add cluster's public key to the
remote host's authorized keys file
• Configure remote host to accept
connections from all cluster IP
addresses
• Create manifest file in JSON format
and upload to S3 bucket
• Issue a COPY command, including
a reference to the manifest file
52. Vacuuming Tables
1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING
2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE
3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE
4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINER
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas
Column 0
Column 1 Column 2
Amazon Redshift serializes all of the
values of a column together.
53. Vacuuming Tables
1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING
2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE
3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE
4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINER
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas
Column 0 Column 1 Column 2
Delete customer where column_0 = 3;
x xxx xxxxxxxxxxxxxxx
54. Vacuuming Tables
1,2,4 RFK,JFK,GWB 900 Columbus,800 Washington,600 Kansas
VACUUM Customer;
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansasx xxx xxxxxxxxxxxxxxx
Redshift does not automatically reclaim
and reuse space that is freed when you
delete rows from tables or update rows
in tables. The VACUUM command
reclaims space following deletes,
which improves performance as well as
increasing available storage.
55. Analyzing Tables
ANALYZE command
The entire
current
database
A single
Table
One or
more
specific
columns in
a single
table
The ANALYZE
command obtains
a sample of rows,
does some
calculations, and
saves resulting
column statistics.
You do not need
to analyze all
columns in all
tables regularly
Analyze the columns that are frequently
used in the following:
• Sorting and grouping operations
• Joins
• Query Predicates
To maintain current statistics for tables:
• Run the ANALYZE command before
running queries
• Run the ANALYZE command against
the database routinely at the end of
every regular load or update cycle
• Run the ANALYZE command against
an new tables
Statistics
56. Managing Concurrent Write Operations
Concurrent COPY/INSERT/DELETE/UPDATE
Operations into the same table
Transaction 1
Transaction 2 CUSTOM
ER
Copy customer from
‘s3://mydata/client.txt’…
;
Copy customer from
‘s3://mydata/client.txt’…
;
Session A
Session B
Transaction 1 puts on
the write lock on the
CUSTOMER table
Transaction 2 waits
until transaction 1
releases the write
lock
57. Managing Concurrent Write Operations
Concurrent COPY/INSERT/DELETE/UPDATE
Operations into the same table
Transaction 1
Transaction 2
CUSTOM
ER
Begin;
Delete one row from
CUSTOMERS;
Copy…;
Select count(*) from
CUSTOMERS;
End;
Begin;
Delete one row from
CUSTOMERS;
Copy…;
Select count(*) from
CUSTOMERS;
End;
Session A
Session B
Transaction 1 puts on
the write lock on the
CUSTOMER table
Transaction 2 waits
until transaction 1
releases the write
lock
58. Managing Current Write Operations
Potential deadlock situation for
concurrent write transactions
Transaction 1
Transaction 2
CUSTOM
ER
Begin;
Delete 3000 rows from
CUSTOMERS;
Copy…;
Delete 5000 rows from PARTS;
End;
Begin;
Delete 3000 rows from PARTS;
Copy…;
Delete 5000 rows from
CUSTOMERS;
End;
Session A
Session B
Transaction 1 puts on the write
lock on the CUSTOMER table
PARTS
Transaction 2 puts on the write
lock on the PARTS table
Wait
Wait
60. Goals of distribution
• Distribute data evenly for parallel processing
• Minimize data movement
• Co-located joins
• Localized aggregations
Distribution key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Full table data on first
slice of every nodeSame key to same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round-robin
distribution
61. Choosing a distribution style
Key
• Large FACT tables
• Rapidly changing tables used
in joins
• Localize columns used within
aggregations
All
• Have slowly changing data
• Reasonable size (i.e., few
millions but not 100s of
millions of rows)
• No common distribution key
for frequent joins
• Typical use case: joined
dimension table without a
common distribution key
Even
• Tables not frequently joined or
aggregated
• Large tables without
acceptable candidate keys
62. Data Distribution and Distribution Keys
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Poor key choices lead to uneven distribution of records…
63. Data Distribution and Distribution Keys
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Unevenly distributed data cause processing imbalances!
64. Data Distribution and Distribution Keys
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records2M records 2M records 2M records
Evenly distributed data improves query performance
66. Goals of sorting
• Physically sort data within blocks and throughout a table
• Optimal SORTKEY is dependent on:
• Query patterns
• Data profile
• Business requirements
67. COMPOUND
• Most common
• Well-defined filter criteria
• Time-series data
Choosing a SORTKEY
INTERLEAVED
• Edge cases
• Large tables (>billion rows)
• No common filter criteria
• Non time-series data
• Primarily as a query predicate (date, identifier, …)
• Optionally, choose a column frequently used for aggregates
• Optionally, choose same as distribution key column for most
efficient joins (merge join)
68. Table is sorted by 1 column
[ SORTKEY ( date ) ]
Best for:
• Queries that use 1st column (i.e. date) as primary filter
• Can speed up joins and group bys
• Quickest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Single Column
69. • Table is sorted by 1st column , then 2nd column etc.
[ SORTKEY COMPOUND ( date, region, country) ]
• Best for:
• Queries that use 1st column as primary filter, then other cols
• Can speed up joins and group bys
• Slower to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Compound
70. • Equal weight is given to each column.
[ SORTKEY INTERLEAVED ( date, region, country) ]
• Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter (up to 8)
• Slowest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Interleaved
71. Compressing data
• COPY automatically analyzes and compresses data
when loading into empty tables
• ANALYZE COMPRESSION checks existing tables and
proposes optimal compression algorithms for each
column
• Changing column encoding requires a table rebuild
72. Automatic compression is a good thing (mostly)
• From the zone maps we know:
• Which blocks contain the
range
• Which row offsets to scan
• Highly compressed SORTKEYs:
• Many rows per block
• Large row offset
Skip compression on just the
leading column of the compound
SORTKEY
73. During queries and ingestion,
the system allocates buffers
based on column width
Wider than needed columns
mean memory is wasted
Fewer rows fit into memory;
increased likelihood of queries
spilling to disk
DDL – Make Columns as narrow as possible
75. Backup and Restore
Backups to Amazon S3 are
automatic, continuous &
incremental
Configurable system snapshot
retention period
User-defined snapshots are
on-demand
Streaming restores enable you
to resume querying faster
128GB RAM
16TB disk
16 coresRedshift
Cluster Node
128GB RAM
16TB disk
16 coresRedshift
Cluster Node
128GB RAM
16TB disk
16 coresRedshift
Cluster Node
Amazon S3
76. Snapshots
Automated Snapshots
Redshift enables automated snapshots by
default
Snapshots are incremental
Redshift retains incremental backup data
required to restore cluster using manual
snapshot or automatic snapshot
Only for clusters that haven’t reached the
snapshot retention period
Amazon Redshift provides free storage for
snapshots that is equal to the storage
capacity of your cluster.
Manual Snapshots
The system will never delete a manual snapshot
You can authorize access to an existing manual
snapshot for as many as 20 AWS customer
accounts
Shared access between accounts means you
can move data between dev/test/prod without
reloading
77. Cross-Region Snapshots
Copy cluster snapshots to another region using the management console
Can’t modify destination region without disabling cross-region snapshots and re-
enabling with a new destination region and retention period
78. Streaming Restore
During restore your node is provisioned within less than two minutes
Query provisioned node immediately as data is automatically streamed
from the S3 snapshot
79. ― Cluster is put into read-only mode
― New cluster is provisioned according to resizing needs
― Node-to-node parallel data copy
― Only charged for source cluster
Resize – Phase 1
80. Resize - Phase 2
― Automatic SQL endpoint switchover via DNS
― Decommission the source cluster
82. Resizing a
Redshift Data
Warehouse via
the AWS
Console
http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#cluster-resize-intro
83. Resizing without Production Impact
When you resize your cluster is read-only
• Query cluster throughout resizing
• There will be a short outage at the end of cluster resizing
84. Parameter Groups
Allow you to set a variety of configuration
options:
• date style, search path and query timeout
• user activity logging
• whether the leader node should accept only
SSL protected connections
Apply to every database in a cluster
View events using the console, the SDK, or the
CLI/API
Use SET to change some properties
temporarily, for example:
SET wlm-query-slot-count = 3
Some changes to the parameter group may
require a restart to take effect
Default Parameter Values
http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html#wlm-dynamic-and-static-properties
85. Events
Event information is available for a
period of several weeks
Source types:
• Cluster
• Parameter Group
• Security Group
• Snapshot
Categories:
• Configuration
• Management
• Monitoring
• Security
Subscribe to events to receive SNS
notifications
http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#redshift-event-messages
86. Resource Tagging
Define custom tags for your cluster
• provide metadata about a resource
• categorize billing reports for cost
allocation
• Activate tags in the billing and cost
management service
Cluster resize or a restore from a
snapshot in the original region
preserves tags but not in cross
region snapshots
http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-tagging.html
87. Concurrent Query Execution
Long-running queries may cause short-running queries to wait in a queue
• As a result users may experience unpredictable performance
By default, a Redshift cluster is configured with one queue with 5 execution slots by
default (50 max)
• 5 slots means 5 queries can execute concurrently while additional queries
will queue until additional resources free up
Running
Default queue
88. Workload Management
Workload management is about creating queues for different workloads
User Group A
Short-running queueLong-running queue
Short
Query Group
Long
Query Group
90. Workload Management
Don’t set concurrency to more that you need
set query_group to allqueries;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;
reset query_group;
91. Redshift Utils
Script Purpose
top_queries.sql Get the top 50 most time-consuming statements in the last 7 days.
perf_alerts.sql Get the top occurrences of alerts; join with table scans.
filter_used.sql Get the filter applied to tables on scans. To aid on choosing sortkey.
commit_stats.sql Get information on consumption of cluster resources through COMMIT statements.
current_session_info.sql Get information about sessions with currently running queries.
missing_table_stats.sql Get EXPLAIN plans that flagged missing statistics on underlying tables.
queuing_queries.sql Get queries that are waiting on a WLM query slot.
table_info.sql Get table storage information (size, skew, etc.).
• Redshift Admin Scripts provide diagnostic information for tuning and
troubleshooting
• https://github.com/awslabs/amazon-redshift-utils
92. Redshift Utils
View Purpose
v_check_data_distribution.sql Get data distribution across slices.
v_constraint_dependency.sql Get the the foreign key constraints between tables.
v_generate_group_ddl.sql Get the DDL for a group.
v_generate_schema_ddl.sql Get the DDL for schemas.
v_generate_tbl_ddl.sql Get the DDL for a table. This will contain the distkey, sortkey, and constraints.
v_generate_unload_copy_cmd.sql Generate unload and copy commands for an object.
v_generate_user_object_permissions.sql Get the DDL for a users permissions to tables and views.
v_generate_view_ddl.sql Get the DDL for a view.
v_get_obj_priv_by_user.sql Get the table/views that a user has access to.
v_get_schema_priv_by_user.sql Get the schema that a user has access to.
v_get_tbl_priv_by_user.sql Get the tables that a user has access to.
v_get_users_in_group.sql Get all users in a group.
v_get_view_priv_by_user.sql Get the views that a user has access to.
v_object_dependency.sql Merge the different dependency views.
v_space_used_per_tbl.sql Get pull space used per table.
v_view_dependency.sql Get the names of the views that are dependent on other tables/views.
• Redshift Admin Views provide information about user and group access, various table
constraints, object and view dependencies, data distribution across slices, and pull space
used per table
96. “Com os serviços da AWS pudemos dosar os
investimentos iniciais e prospectar os custos
para expansões futuras”
“O Redshift nos
permitiu
transformar dados
em informações
self-service”
- Wanderley Paiva
Database Specialist
97. O Desafio
• Escalabilidade
• Disponibilidade
• Centralização dos dados
• Custos reduzidos e
preferencialmente diluído