SlideShare une entreprise Scribd logo
1  sur  99
Télécharger pour lire hors ligne
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Daniel Bento, Arquiteto de Soluções, AWS
Novembro 2016 | São Paulo, Brasil
Workshop
Amazon Redshift
Agenda
• Introduction
• Architecture and Use Cases
• Data Loading and Migration
• Table and Schema Design
• Operations
We start with the customer… and innovate
Managing databases is painful & difficult
SQL DBs do not perform well at scale
Hadoop is difficult to deploy and manage
DWs are complex, costly, and slow
Commercial DBs are punitive & expensive
Streaming data Is difficult to capture & analyze
BI Tools are expensive and hard to manage
ü Amazon RDS
ü Amazon DynamoDB
ü Amazon EMR
ü Amazon Redshift
ü Amazon Aurora
ü Amazon Kinesis
ü Amazon QuickSight
Customers told us… We created…
AnalyzeStore
Glacier
S3
DynamoDB
RDS, Aurora
AWS Big Data Portfolio
Data Pipeline
CloudSearch
EMR EC2
Redshift
Machine
Learning
ElasticSearch
Launched
Database
Migration
New
QuickSight
New
SQL over
Streams
New
Kinesis
Firehose
New
Import Export
Direct Connect
Collect
Kinesis
Global Footprint
14 Regions; 33 Availability Zones; 54 Edge Locations
Redshift
Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Free Tier – 2 months
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
The legacy view of data warehousing ...
Multi-year commitment
Multi-year deployments
Multi-million dollar deals
… Leads to dark data
This is a narrow view
Small companies also have big data
(mobile, social, gaming, adtech, IoT)
Long cycles, high costs, administrative
complexity all stifle innovation
0
200
400
600
800
1000
1200
Enterprise Data Data in Warehouse
The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, Machine Learning,
Streaming
Analysis in-line with process
flows
Pay as you go, grow as you
need
Managed availability & DR
Enterprise Big Data SaaS
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester
Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed
spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in
the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
The Forrester Wave™: Enterprise Data Warehouse, Q4
2015
Selected Amazon Redshift Customers
NTT Docomo | Telecom FINRA | Financial Svcs Philips | Healthcare Yelp | Technology NASDAQ | Financial Svcs
The Weather Company | Media Nokia | Telecom Pinterest | Technology Foursquare | Technology Coursera | Education
Coinbase | Bitcoin Amazon | E-Commerce Etix | Entertainment Spuul | Entertainment Vivaki | Ad Tech
Z2 | Gaming Neustar | Ad Tech SoundCloud | Technology BeachMint | E-Commerce Civis | Technology
Amazon Redshift Security
Petabyte-Scale Data Warehousing
Service
Amazon Redshift Architecture
Amazon Redshift Architecture
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS1/DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
Benefit #1: Amazon Redshift is fast
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2016’
MIN: 01-JUNE-2016
MAX: 20-JUNE-2016
MIN: 08-JUNE-2016
MAX: 30-JUNE-2016
MIN: 12-JUNE-2016
MAX: 20-JUNE-2016
MIN: 02-JUNE-2016
MAX: 25-JUNE-2016
Unsorted Table
MIN: 01-JUNE-2016
MAX: 06-JUNE-2016
MIN: 07-JUNE-2016
MAX: 12-JUNE-2016
MIN: 13-JUNE-2016
MAX: 18-JUNE-2016
MIN: 19-JUNE-2016
MAX: 24-JUNE-2016
Sorted By Date
Benefit #1: Amazon Redshift is fast
Sort Keys and Zone Maps
Benefit #1: Amazon Redshift is fast
Parallel and Distributed
Query
Load
Export
Backup
Restore
Resize
ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
Benefit #1: Amazon Redshift is fast
Distribution Keys
Benefit #1: Amazon Redshift is fast
H/W optimized for I/O intensive workloads
Choice of storage type, instance size
Regular cadence of auto-patched improvements
Example: Our new Dense Storage (HDD) instance type
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: same as our prior generation !
Benefit #2: Amazon Redshift is inexpensive
DS2 (HDD)
Price Per Hour for
DS2.XL Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
DC1.L Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No up front costs
Pay as you go
Amazon Redshift lets you start small and grow big
Dense Storage (DS2.XL)
2 TB HDD, 31 GB RAM, 2 slices/4 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Dense Storage (DS2.8XL)
16 TB HDD, 244 GB RAM, 16 slices/36 cores, 10 GigE
Cluster 2-128 Nodes (32 TB – 2 PB)
Note: Nodes not to scale
Benefit #2: Amazon Redshift is inexpensive
Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/Region level disasters
Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3
encrypted
• Block key, Cluster key, Master key (AES-256)
• On-premises HSM, AWS CloudHSM & KMS
support
• Audit logging and AWS CloudTrail
integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Benefit #5: We innovate quickly
2013 20152014
•Initial Release in US East (N. Virginia; US West (Oregon), EU
(Ireland); Asia Pacific (Tokyo, Singapore, Sydney) Regions
•MANIFEST option for the COPY & UNLOAD commands
•SQL Functions: Most recent queries
•Resource-level IAM, CRC32
•Data Pipeline
•Event notifications, encryption, key rotation, audit logging, on-
premises or AWS CloudHSM; PCI, SOC 1/2/3
•Cross-Region Snapshot Copy
•Audit features, cursor support, 500 concurrent client connections
•EIP Address for VPC Cluster
•New system views to tune table design and track WLM query queues
•Custom ODBC/JDBC drivers; Query Visualization
•Mobile Analytics auto export
•KMS for GovCloud Region; HIPAA BAA
•Interleaved sort keys
•New Dense Storage Nodes (DS2) with better RAM and CPU.
•New Reserved Storage Nodes: No, Partial & All Upfront Options
•Cross-region backups for KMS encrypted clusters
•Scaler UDFs in Python
•AVRO Ingestion; Kinesis Firehose; Database Migration Service (DMS)
•Modify Cluster Dynamically
•Tag-based permissions and BZIP2
•System Tables for query Tuning
•Dense Compute Nodes
•Gzip & Lzop; JSON , RegEx, Cursors
•EMR Data Loading & Bootstrap Action with COPY command; WLM concurrency limit
to 50; support for the ECDH cipher suites for SSL connections; FedRAMP
•Cross-region ingestion
•Free trials & price reductions in Asia Pacific
•CloudWatch Alarm for Disk Usage
•AES 128-bit encryption; UTF-16; KMS Integration
•EU (Frankfurt); GovCloud Regions
•S3 Servier-side encryption support for UNLOAD
•Tagging Support for Cost-allocation
•WLM Queue-Hopping for timed-out
queries
•Append rows & Export to BZIP-2
•Lambda for Clusters in VPC; Data
Schema Conversion Support from
ML Console
•US West (N. California) Region.
2016
100+ new features added since launch
Release every two weeks
Automatic patching
Benefit #6: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine Learning
• Data Science
Amazon ML
Benefit #7: Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence
Benefit #8: Service oriented architecture
DynamoDB
EMR
S3
EC2/SSH
RDS/Aurora
Amazon
Redshift
Amazon Kinesis
Machine
Learning
Data Pipeline
CloudSearch
Mobile
Analytics
Use cases
Analyzing Twitter Firehose
Amazon
Redshift
Starts at
$0.25/hour
EC2
Starts at
$0.02/hour
S3
$0.030/GB-Mo
Amazon Glacier
$0.010/GB-Mo
Amazon Kinesis
$0.015/shard 1MB/s in; 2MB/out
$0.028/million puts
Analyzing Twitter Firehose
500MM tweets/day = ~ 5,800 tweets/sec
2k/tweet is ~12MB/sec (~1TB/day)
$0.015/hour per shard, $0.028/million PUTS
Amazon Kinesis cost is $0.765/hour
Amazon Redshift cost is $0.850/hour (for a 2TB node)
S3 cost is $1.28/hour (no compression)
Total: $2.895/hour
Data warehouses
can be
inexpensive
and
powerful
Use only the services you need
Scale only the services you need
Pay for what you use
~40% discount with 1 year commitment
~70% discounts with 3 year commitment
Data warehouses
can be
inexpensive
and
powerful
Amazon.com – Weblog analysis
Web log analysis for Amazon.com
1PB+ workload, 2TB/day, growing 67% YoY
Largest table: 400 TB
Want to understand customer behavior
Query 15 months of data (1PB) in 14 minutes
Load 5B rows in 10 minutes
21B rows joined with 10B rows – 3 days (Hive) to 2 hours
Load pipeline: 90 hours (Oracle) to 8 hours
64 clusters
800 total nodes
13PB provisioned storage
2 DBAs
Data warehouses
can be
fast
and
simple
Sushiro – Real-time streaming from IoT & analysis
Sushiro – Real-time streaming & analysis
Real-time data ingested by Amazon Kinesis is analyzed in Amazon Redshift
380 stores stream live data from
Sushi plates
Inventory information combined
with consumption information
near real-time
Forecast demand by store,
minimize food waste, and
improve efficiencies
Amazon
Big data does not mean batch
Can be streamed in
Can be processed in near real time
Can be used to respond quickly to requests
You can mix and match
On premises and cloud
Custom development and managed services
Infrastructure with managed scaling, security
Data warehouses
can support
real-time data
Amazon Redshift Data Loading and Migration
Petabyte-Scale Data Warehousing
Service
Data Loading Process
Data Source Extraction Transformation Loading
Amazon
Redshift
Target
Focus
Amazon Redshift Loading Data Overview
AWS CloudCorporate Data center
Amazon
DynamoDB
Amazon S3
Data
Volume
Amazon Elastic
MapReduce
Amazon
RDS
Amazon
Redshift
Amazon
Glacier
logs / files
Source DBs
VPN
Connection
AWS Direct
Connect
S3 Multipart
Upload
AWS Import/
Export
EC2 or On-
Prem (using
SSH)
Database
Migration Service
Kinesis
AWS Lambda
AWS
Datapipeline
Uploading Files to Amazon S3
Amazon
Redshiftmydata
Client.txt
Corporate Data center
Region
Ensure that your
data resides in the
same Region as
your Redshift
clusters
Split the data into
multiple files to
facilitate parallel
processing
Optionally, you
can encrypt your
data using
Amazon S3
Server-Side or
Client-Side
Encryption
Client.txt.
1
Client.txt.
2
Client.txt.
3
Client.txt.
4
Files should be
individually
compressed using
GZIP or LZOP
Loading Data From Amazon S3
Preparing Input Data Files
Uploading files to Amazon S3
Using COPY to load data from Amazon S3
Splitting Data Files
Slice 0
Slice 1
Slice 0
Slice 1
Client.txt.
1
Client.txt.
2
Client.txt.
3
Client.txt.
4
Node 0
Node 1
2 XL Compute Nodes
Copy customer from ‘s3://mydata/client.txt’
Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_
Delimiter ‘|’;
mydata
Use the COPY command
Each slice can load one file at a
time
A single input file means only one
slice is ingesting data
Instead of 100MB/s, you’re only
getting 6.25MB/s
Loading – Use multiple input files to maximize
throughput
Use the COPY command
You need at least as many input
files as you have slices
With 16 input files, all slices are
working so you maximize
throughput
Get 100MB/s per node; scale
linearly as you add nodes
Loading – Use multiple input files to maximize
throughput
Amazon Confidential – Slides not intended for redistribution.
• Comece	sua	primeira	migração	em	10	minutos	ou	menos
• Mantenha	suas	aplicações	em	execução	durante	a	migração
• Mova	dados	para	a	mesma	engine ou	para	uma	diferente	
AWS Database Migration
Service
Customer
Premises
Application Users
AWS
Internet
VPN
Start a replication instance
Connect to source and target databases
Select tables, schemas, or databases
Let AWS Database Migration Service
create tables, load data, and keep
them in sync
Switch applications over to the target
at your convenience
AWS DMS - Keep your apps running during the
migration
AWS
Database Migration
Service
Loading Data from an Amazon DynamoDB Table
Differences Amazon DynamoDB Amazon Redshift
Table Names • Up to 255 characters
• May contain ‘.’ (dot) and ‘-’ (dash)
characters
• Case-sensitive
• Limited to 127 characters
• Can’t contain dots or dashes
• Are NOT case-sensitive
• Can’t conflict with any Amazon
Redshift reserved words
NULL Does not support the SQL concept of
NULL
Must specify how Amazon Redshift
interprets empty or blank attribute
values in Amazon DynamoDB
Data Types STRING and NUMBER Data Types Supported
BINARY and SET Data Types Not Supported
Loading Data from Amazon Elastic MapReduce
Load data from Amazon EMR in parallel using COPY
Specify Amazon EMR cluster ID and HDFS file path/name
Amazon EMR must be running until COPY completes.
copy sales from 'emr:// j-1H7OUO3B52HI5/myoutput/part*'
credentials 'aws_access_key_id=<access-key id>;
aws_secret_access_key=<secret-access-key>';
Loading Data using Lambda
AWS Lambda-based Amazon Redshift
loader to offer you the ability to drop files into
S3 and load them into any number of
database tables in multiple Amazon Redshift
clusters automatically, with no servers to
maintain.
Blog post
https://blogs.aws.amazon.com/bigdata/post/
Tx24VJ6XF1JVJAA/A-Zero-Administration-
Amazon-Redshift-Database-Loader
GitHub http://github.com/awslabs/aws-
lambda-redshift-loader
Remote Loading using SSH
Redshift COPY command can reach
out to remote locations (EC2 and on
premise) to load data using a secure
shell script (SSH)
To Remote Load follow this process:
• Add cluster's public key to the
remote host's authorized keys file
• Configure remote host to accept
connections from all cluster IP
addresses
• Create manifest file in JSON format
and upload to S3 bucket
• Issue a COPY command, including
a reference to the manifest file
Vacuuming Tables
1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING
2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE
3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE
4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINER
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas
Column 0
Column 1 Column 2
Amazon Redshift serializes all of the
values of a column together.
Vacuuming Tables
1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING
2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE
3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE
4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINER
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas
Column 0 Column 1 Column 2
Delete customer where column_0 = 3;
x xxx xxxxxxxxxxxxxxx
Vacuuming Tables
1,2,4 RFK,JFK,GWB 900 Columbus,800 Washington,600 Kansas
VACUUM Customer;
1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansasx xxx xxxxxxxxxxxxxxx
Redshift does not automatically reclaim
and reuse space that is freed when you
delete rows from tables or update rows
in tables. The VACUUM command
reclaims space following deletes,
which improves performance as well as
increasing available storage.
Analyzing Tables
ANALYZE command
The entire
current
database
A single
Table
One or
more
specific
columns in
a single
table
The ANALYZE
command obtains
a sample of rows,
does some
calculations, and
saves resulting
column statistics.
You do not need
to analyze all
columns in all
tables regularly
Analyze the columns that are frequently
used in the following:
• Sorting and grouping operations
• Joins
• Query Predicates
To maintain current statistics for tables:
• Run the ANALYZE command before
running queries
• Run the ANALYZE command against
the database routinely at the end of
every regular load or update cycle
• Run the ANALYZE command against
an new tables
Statistics
Managing Concurrent Write Operations
Concurrent COPY/INSERT/DELETE/UPDATE
Operations into the same table
Transaction 1
Transaction 2 CUSTOM
ER
Copy customer from
‘s3://mydata/client.txt’…
;
Copy customer from
‘s3://mydata/client.txt’…
;
Session A
Session B
Transaction 1 puts on
the write lock on the
CUSTOMER table
Transaction 2 waits
until transaction 1
releases the write
lock
Managing Concurrent Write Operations
Concurrent COPY/INSERT/DELETE/UPDATE
Operations into the same table
Transaction 1
Transaction 2
CUSTOM
ER
Begin;
Delete one row from
CUSTOMERS;
Copy…;
Select count(*) from
CUSTOMERS;
End;
Begin;
Delete one row from
CUSTOMERS;
Copy…;
Select count(*) from
CUSTOMERS;
End;
Session A
Session B
Transaction 1 puts on
the write lock on the
CUSTOMER table
Transaction 2 waits
until transaction 1
releases the write
lock
Managing Current Write Operations
Potential deadlock situation for
concurrent write transactions
Transaction 1
Transaction 2
CUSTOM
ER
Begin;
Delete 3000 rows from
CUSTOMERS;
Copy…;
Delete 5000 rows from PARTS;
End;
Begin;
Delete 3000 rows from PARTS;
Copy…;
Delete 5000 rows from
CUSTOMERS;
End;
Session A
Session B
Transaction 1 puts on the write
lock on the CUSTOMER table
PARTS
Transaction 2 puts on the write
lock on the PARTS table
Wait
Wait
Amazon Redshift Security
Petabyte-Scale Data Warehousing
Service
Amazon Redshift Table and Schema Design
Goals of distribution
• Distribute data evenly for parallel processing
• Minimize data movement
• Co-located joins
• Localized aggregations
Distribution key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Full table data on first
slice of every nodeSame key to same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round-robin
distribution
Choosing a distribution style
Key
• Large FACT tables
• Rapidly changing tables used
in joins
• Localize columns used within
aggregations
All
• Have slowly changing data
• Reasonable size (i.e., few
millions but not 100s of
millions of rows)
• No common distribution key
for frequent joins
• Typical use case: joined
dimension table without a
common distribution key
Even
• Tables not frequently joined or
aggregated
• Large tables without
acceptable candidate keys
Data Distribution and Distribution Keys
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Poor key choices lead to uneven distribution of records…
Data Distribution and Distribution Keys
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Unevenly distributed data cause processing imbalances!
Data Distribution and Distribution Keys
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records2M records 2M records 2M records
Evenly distributed data improves query performance
Single Column
Compound
Interleaved
Sort Keys
Goals of sorting
• Physically sort data within blocks and throughout a table
• Optimal SORTKEY is dependent on:
• Query patterns
• Data profile
• Business requirements
COMPOUND
• Most common
• Well-defined filter criteria
• Time-series data
Choosing a SORTKEY
INTERLEAVED
• Edge cases
• Large tables (>billion rows)
• No common filter criteria
• Non time-series data
• Primarily as a query predicate (date, identifier, …)
• Optionally, choose a column frequently used for aggregates
• Optionally, choose same as distribution key column for most
efficient joins (merge join)
Table is sorted by 1 column
[ SORTKEY ( date ) ]
Best for:
• Queries that use 1st column (i.e. date) as primary filter
• Can speed up joins and group bys
• Quickest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Single Column
• Table is sorted by 1st column , then 2nd column etc.
[ SORTKEY COMPOUND ( date, region, country) ]
• Best for:
• Queries that use 1st column as primary filter, then other cols
• Can speed up joins and group bys
• Slower to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Compound
• Equal weight is given to each column.
[ SORTKEY INTERLEAVED ( date, region, country) ]
• Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter (up to 8)
• Slowest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Interleaved
Compressing data
• COPY automatically analyzes and compresses data
when loading into empty tables
• ANALYZE COMPRESSION checks existing tables and
proposes optimal compression algorithms for each
column
• Changing column encoding requires a table rebuild
Automatic compression is a good thing (mostly)
• From the zone maps we know:
• Which blocks contain the
range
• Which row offsets to scan
• Highly compressed SORTKEYs:
• Many rows per block
• Large row offset
Skip compression on just the
leading column of the compound
SORTKEY
During queries and ingestion,
the system allocates buffers
based on column width
Wider than needed columns
mean memory is wasted
Fewer rows fit into memory;
increased likelihood of queries
spilling to disk
DDL – Make Columns as narrow as possible
Amazon Redshift Security
Petabyte-Scale Data Warehousing
Service
Amazon Redshift Operations
Backup and Restore
Backups to Amazon S3 are
automatic, continuous &
incremental
Configurable system snapshot
retention period
User-defined snapshots are
on-demand
Streaming restores enable you
to resume querying faster
128GB RAM
16TB disk
16 coresRedshift
Cluster Node
128GB RAM
16TB disk
16 coresRedshift
Cluster Node
128GB RAM
16TB disk
16 coresRedshift
Cluster Node
Amazon S3
Snapshots
Automated Snapshots
Redshift enables automated snapshots by
default
Snapshots are incremental
Redshift retains incremental backup data
required to restore cluster using manual
snapshot or automatic snapshot
Only for clusters that haven’t reached the
snapshot retention period
Amazon Redshift provides free storage for
snapshots that is equal to the storage
capacity of your cluster.
Manual Snapshots
The system will never delete a manual snapshot
You can authorize access to an existing manual
snapshot for as many as 20 AWS customer
accounts
Shared access between accounts means you
can move data between dev/test/prod without
reloading
Cross-Region Snapshots
Copy cluster snapshots to another region using the management console
Can’t modify destination region without disabling cross-region snapshots and re-
enabling with a new destination region and retention period
Streaming Restore
During restore your node is provisioned within less than two minutes
Query provisioned node immediately as data is automatically streamed
from the S3 snapshot
― Cluster is put into read-only mode
― New cluster is provisioned according to resizing needs
― Node-to-node parallel data copy
― Only charged for source cluster
Resize – Phase 1
Resize - Phase 2
― Automatic SQL endpoint switchover via DNS
― Decommission the source cluster
Resizing a
Redshift Data
Warehouse via
the AWS
Console
Resizing a
Redshift Data
Warehouse via
the AWS
Console
http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#cluster-resize-intro
Resizing without Production Impact
When you resize your cluster is read-only
• Query cluster throughout resizing
• There will be a short outage at the end of cluster resizing
Parameter Groups
Allow you to set a variety of configuration
options:
• date style, search path and query timeout
• user activity logging
• whether the leader node should accept only
SSL protected connections
Apply to every database in a cluster
View events using the console, the SDK, or the
CLI/API
Use SET to change some properties
temporarily, for example:
SET wlm-query-slot-count = 3
Some changes to the parameter group may
require a restart to take effect
Default Parameter Values
http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html#wlm-dynamic-and-static-properties
Events
Event information is available for a
period of several weeks
Source types:
• Cluster
• Parameter Group
• Security Group
• Snapshot
Categories:
• Configuration
• Management
• Monitoring
• Security
Subscribe to events to receive SNS
notifications
http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#redshift-event-messages
Resource Tagging
Define custom tags for your cluster
• provide metadata about a resource
• categorize billing reports for cost
allocation
• Activate tags in the billing and cost
management service
Cluster resize or a restore from a
snapshot in the original region
preserves tags but not in cross
region snapshots
http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-tagging.html
Concurrent Query Execution
Long-running queries may cause short-running queries to wait in a queue
• As a result users may experience unpredictable performance
By default, a Redshift cluster is configured with one queue with 5 execution slots by
default (50 max)
• 5 slots means 5 queries can execute concurrently while additional queries
will queue until additional resources free up
Running
Default queue
Workload Management
Workload management is about creating queues for different workloads
User Group A
Short-running queueLong-running queue
Short
Query Group
Long
Query Group
Workload Management
Workload Management
Don’t set concurrency to more that you need
set query_group to allqueries;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;
reset query_group;
Redshift Utils
Script Purpose
top_queries.sql Get the top 50 most time-consuming statements in the last 7 days.
perf_alerts.sql Get the top occurrences of alerts; join with table scans.
filter_used.sql Get the filter applied to tables on scans. To aid on choosing sortkey.
commit_stats.sql Get information on consumption of cluster resources through COMMIT statements.
current_session_info.sql Get information about sessions with currently running queries.
missing_table_stats.sql Get EXPLAIN plans that flagged missing statistics on underlying tables.
queuing_queries.sql Get queries that are waiting on a WLM query slot.
table_info.sql Get table storage information (size, skew, etc.).
• Redshift Admin Scripts provide diagnostic information for tuning and
troubleshooting
• https://github.com/awslabs/amazon-redshift-utils
Redshift Utils
View Purpose
v_check_data_distribution.sql Get data distribution across slices.
v_constraint_dependency.sql Get the the foreign key constraints between tables.
v_generate_group_ddl.sql Get the DDL for a group.
v_generate_schema_ddl.sql Get the DDL for schemas.
v_generate_tbl_ddl.sql Get the DDL for a table. This will contain the distkey, sortkey, and constraints.
v_generate_unload_copy_cmd.sql Generate unload and copy commands for an object.
v_generate_user_object_permissions.sql Get the DDL for a users permissions to tables and views.
v_generate_view_ddl.sql Get the DDL for a view.
v_get_obj_priv_by_user.sql Get the table/views that a user has access to.
v_get_schema_priv_by_user.sql Get the schema that a user has access to.
v_get_tbl_priv_by_user.sql Get the tables that a user has access to.
v_get_users_in_group.sql Get all users in a group.
v_get_view_priv_by_user.sql Get the views that a user has access to.
v_object_dependency.sql Merge the different dependency views.
v_space_used_per_tbl.sql Get pull space used per table.
v_view_dependency.sql Get the names of the views that are dependent on other tables/views.
• Redshift Admin Views provide information about user and group access, various table
constraints, object and view dependencies, data distribution across slices, and pull space
used per table
Open source tools
https://github.com/awslabs/amazon-redshift-utils
https://github.com/awslabs/amazon-redshift-monitoring
https://github.com/awslabs/amazon-redshift-udfs
Admin scripts
Collection of utilities for running diagnostics on your cluster
Admin views
Collection of utilities for managing your cluster, generating schema DDL, etc.
ColumnEncodingUtility
Gives you the ability to apply optimal column encoding to an established
schema with data already loaded
Q&A
PlayKids iFood MapLink Apontador Rapiddo Superplayer Cinepapaya ChefTime
“Com os serviços da AWS pudemos dosar os
investimentos iniciais e prospectar os custos
para expansões futuras”
“O Redshift nos
permitiu
transformar dados
em informações
self-service”
- Wanderley Paiva
Database Specialist
O Desafio
• Escalabilidade
• Disponibilidade
• Centralização dos dados
• Custos reduzidos e
preferencialmente diluído
Solução
Solução v2

Contenu connexe

Tendances

Tendances (20)

Intro to AWS: Database Services
Intro to AWS: Database ServicesIntro to AWS: Database Services
Intro to AWS: Database Services
 
Introduction to Block and File storage on AWS
Introduction to Block and File storage on AWSIntroduction to Block and File storage on AWS
Introduction to Block and File storage on AWS
 
New Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesNew Database Migration Services & RDS Updates
New Database Migration Services & RDS Updates
 
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
 
Getting Started with EC2, S3 and EMR
Getting Started with EC2, S3 and EMRGetting Started with EC2, S3 and EMR
Getting Started with EC2, S3 and EMR
 
Deep Dive: Amazon RDS
Deep Dive: Amazon RDSDeep Dive: Amazon RDS
Deep Dive: Amazon RDS
 
AWS Data migration services
AWS Data migration servicesAWS Data migration services
AWS Data migration services
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Deep Dive on EC2 and S3
Deep Dive on EC2 and S3Deep Dive on EC2 and S3
Deep Dive on EC2 and S3
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration  AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Introduction to Database Services
Introduction to Database ServicesIntroduction to Database Services
Introduction to Database Services
 
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksMigrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
 
(DAT303) Oracle on AWS and Amazon RDS: Secure, Fast, and Scalable
(DAT303) Oracle on AWS and Amazon RDS: Secure, Fast, and Scalable(DAT303) Oracle on AWS and Amazon RDS: Secure, Fast, and Scalable
(DAT303) Oracle on AWS and Amazon RDS: Secure, Fast, and Scalable
 
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon AuroraNEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
Building a Bigdata Architecture on AWS
Building a Bigdata Architecture on AWSBuilding a Bigdata Architecture on AWS
Building a Bigdata Architecture on AWS
 
Amazon rds
Amazon rdsAmazon rds
Amazon rds
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
 

En vedette

En vedette (20)

Comenzando con AWS Mobile Services
Comenzando con AWS Mobile ServicesComenzando con AWS Mobile Services
Comenzando con AWS Mobile Services
 
Bases de datos en la nube con AWS
Bases de datos en la nube con AWSBases de datos en la nube con AWS
Bases de datos en la nube con AWS
 
Contruyendo tu primera aplicación con AWS
Contruyendo tu primera aplicación con AWSContruyendo tu primera aplicación con AWS
Contruyendo tu primera aplicación con AWS
 
Contruyendo tu primera aplicación con AWS
Contruyendo tu primera aplicación con AWSContruyendo tu primera aplicación con AWS
Contruyendo tu primera aplicación con AWS
 
Escalando para sus primeros 10 millones de usuarios
Escalando para sus primeros 10 millones de usuariosEscalando para sus primeros 10 millones de usuarios
Escalando para sus primeros 10 millones de usuarios
 
Compute Services con AWS
Compute Services con AWSCompute Services con AWS
Compute Services con AWS
 
Comenzando con los servicios móviles en AWS
Comenzando con los servicios móviles en AWSComenzando con los servicios móviles en AWS
Comenzando con los servicios móviles en AWS
 
EC2 Computo en la Nube
EC2 Computo en la NubeEC2 Computo en la Nube
EC2 Computo en la Nube
 
Escalando para sus primeros 10 millones de usuarios
Escalando para sus primeros 10 millones de usuariosEscalando para sus primeros 10 millones de usuarios
Escalando para sus primeros 10 millones de usuarios
 
Ask the Trainer - Treinamentos e Certificações da AWS
Ask the Trainer - Treinamentos e Certificações da AWSAsk the Trainer - Treinamentos e Certificações da AWS
Ask the Trainer - Treinamentos e Certificações da AWS
 
Almacenamiento en la nube con AWS
Almacenamiento en la nube con AWSAlmacenamiento en la nube con AWS
Almacenamiento en la nube con AWS
 
Começando com aplicações serverless na AWS
 Começando com aplicações serverless na AWS Começando com aplicações serverless na AWS
Começando com aplicações serverless na AWS
 
Introdução ao Data Warehouse Amazon Redshift
Introdução ao Data Warehouse Amazon RedshiftIntrodução ao Data Warehouse Amazon Redshift
Introdução ao Data Warehouse Amazon Redshift
 
Cloud Computing con Amazon Web Services
 Cloud Computing con Amazon Web Services Cloud Computing con Amazon Web Services
Cloud Computing con Amazon Web Services
 
AWS Services Overview
AWS Services OverviewAWS Services Overview
AWS Services Overview
 
Delivering Mobile Apps Using AWS Mobile Services
Delivering Mobile Apps Using AWS Mobile ServicesDelivering Mobile Apps Using AWS Mobile Services
Delivering Mobile Apps Using AWS Mobile Services
 
Intro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute ServicesIntro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute Services
 
Criando e conectando seu datacenter virtual
Criando e conectando seu datacenter virtualCriando e conectando seu datacenter virtual
Criando e conectando seu datacenter virtual
 
Construindo APIs com Amazon API Gateway e AWS Lambda
Construindo APIs com Amazon API Gateway e AWS LambdaConstruindo APIs com Amazon API Gateway e AWS Lambda
Construindo APIs com Amazon API Gateway e AWS Lambda
 
Escalando sua aplicação Web com Beanstalk
Escalando sua aplicação Web com BeanstalkEscalando sua aplicação Web com Beanstalk
Escalando sua aplicação Web com Beanstalk
 

Similaire à Benefícios e melhores práticas no uso do Amazon Redshift

Similaire à Benefícios e melhores práticas no uso do Amazon Redshift (20)

Getting started with amazon redshift - Toronto
Getting started with amazon redshift - TorontoGetting started with amazon redshift - Toronto
Getting started with amazon redshift - Toronto
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
 
Introdução ao data warehouse Amazon Redshift
Introdução ao data warehouse Amazon RedshiftIntrodução ao data warehouse Amazon Redshift
Introdução ao data warehouse Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftData Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
 
Getting started with Amazon Redshift
Getting started with Amazon RedshiftGetting started with Amazon Redshift
Getting started with Amazon Redshift
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 

Plus de Amazon Web Services LATAM

Plus de Amazon Web Services LATAM (20)

AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvemAWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
 
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e BackupAWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
 
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
 
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvemAWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
 
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e BackupAWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
 
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
 
Automatice el proceso de entrega con CI/CD en AWS
Automatice el proceso de entrega con CI/CD en AWSAutomatice el proceso de entrega con CI/CD en AWS
Automatice el proceso de entrega con CI/CD en AWS
 
Automatize seu processo de entrega de software com CI/CD na AWS
Automatize seu processo de entrega de software com CI/CD na AWSAutomatize seu processo de entrega de software com CI/CD na AWS
Automatize seu processo de entrega de software com CI/CD na AWS
 
Cómo empezar con Amazon EKS
Cómo empezar con Amazon EKSCómo empezar con Amazon EKS
Cómo empezar con Amazon EKS
 
Como começar com Amazon EKS
Como começar com Amazon EKSComo começar com Amazon EKS
Como começar com Amazon EKS
 
Ransomware: como recuperar os seus dados na nuvem AWS
Ransomware: como recuperar os seus dados na nuvem AWSRansomware: como recuperar os seus dados na nuvem AWS
Ransomware: como recuperar os seus dados na nuvem AWS
 
Ransomware: cómo recuperar sus datos en la nube de AWS
Ransomware: cómo recuperar sus datos en la nube de AWSRansomware: cómo recuperar sus datos en la nube de AWS
Ransomware: cómo recuperar sus datos en la nube de AWS
 
Ransomware: Estratégias de Mitigação
Ransomware: Estratégias de MitigaçãoRansomware: Estratégias de Mitigação
Ransomware: Estratégias de Mitigação
 
Ransomware: Estratégias de Mitigación
Ransomware: Estratégias de MitigaciónRansomware: Estratégias de Mitigación
Ransomware: Estratégias de Mitigación
 
Aprenda a migrar y transferir datos al usar la nube de AWS
Aprenda a migrar y transferir datos al usar la nube de AWSAprenda a migrar y transferir datos al usar la nube de AWS
Aprenda a migrar y transferir datos al usar la nube de AWS
 
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWS
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWSAprenda como migrar e transferir dados ao utilizar a nuvem da AWS
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWS
 
Cómo mover a un almacenamiento de archivos administrados
Cómo mover a un almacenamiento de archivos administradosCómo mover a un almacenamiento de archivos administrados
Cómo mover a un almacenamiento de archivos administrados
 
Simplifique su BI con AWS
Simplifique su BI con AWSSimplifique su BI con AWS
Simplifique su BI con AWS
 
Simplifique o seu BI com a AWS
Simplifique o seu BI com a AWSSimplifique o seu BI com a AWS
Simplifique o seu BI com a AWS
 
Os benefícios de migrar seus workloads de Big Data para a AWS
Os benefícios de migrar seus workloads de Big Data para a AWSOs benefícios de migrar seus workloads de Big Data para a AWS
Os benefícios de migrar seus workloads de Big Data para a AWS
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Benefícios e melhores práticas no uso do Amazon Redshift

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Daniel Bento, Arquiteto de Soluções, AWS Novembro 2016 | São Paulo, Brasil Workshop Amazon Redshift
  • 2. Agenda • Introduction • Architecture and Use Cases • Data Loading and Migration • Table and Schema Design • Operations
  • 3. We start with the customer… and innovate Managing databases is painful & difficult SQL DBs do not perform well at scale Hadoop is difficult to deploy and manage DWs are complex, costly, and slow Commercial DBs are punitive & expensive Streaming data Is difficult to capture & analyze BI Tools are expensive and hard to manage ü Amazon RDS ü Amazon DynamoDB ü Amazon EMR ü Amazon Redshift ü Amazon Aurora ü Amazon Kinesis ü Amazon QuickSight Customers told us… We created…
  • 4. AnalyzeStore Glacier S3 DynamoDB RDS, Aurora AWS Big Data Portfolio Data Pipeline CloudSearch EMR EC2 Redshift Machine Learning ElasticSearch Launched Database Migration New QuickSight New SQL over Streams New Kinesis Firehose New Import Export Direct Connect Collect Kinesis
  • 5. Global Footprint 14 Regions; 33 Availability Zones; 54 Edge Locations Redshift
  • 6. Relational data warehouse Massively parallel; Petabyte scale Fully managed HDD and SSD Platforms $1,000/TB/Year; starts at $0.25/hour Free Tier – 2 months Amazon Redshift a lot faster a lot simpler a lot cheaper
  • 7. The legacy view of data warehousing ... Multi-year commitment Multi-year deployments Multi-million dollar deals
  • 8. … Leads to dark data This is a narrow view Small companies also have big data (mobile, social, gaming, adtech, IoT) Long cycles, high costs, administrative complexity all stifle innovation 0 200 400 600 800 1000 1200 Enterprise Data Data in Warehouse
  • 9. The Amazon Redshift view of data warehousing 10x cheaper Easy to provision Higher DBA productivity 10x faster No programming Easily leverage BI tools, Hadoop, Machine Learning, Streaming Analysis in-line with process flows Pay as you go, grow as you need Managed availability & DR Enterprise Big Data SaaS
  • 10. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. The Forrester Wave™: Enterprise Data Warehouse, Q4 2015
  • 11. Selected Amazon Redshift Customers NTT Docomo | Telecom FINRA | Financial Svcs Philips | Healthcare Yelp | Technology NASDAQ | Financial Svcs The Weather Company | Media Nokia | Telecom Pinterest | Technology Foursquare | Technology Coursera | Education Coinbase | Bitcoin Amazon | E-Commerce Etix | Entertainment Spuul | Entertainment Vivaki | Ad Tech Z2 | Gaming Neustar | Ad Tech SoundCloud | Technology BeachMint | E-Commerce Civis | Technology
  • 12. Amazon Redshift Security Petabyte-Scale Data Warehousing Service Amazon Redshift Architecture
  • 13. Amazon Redshift Architecture Leader Node Simple SQL end point Stores metadata Optimizes query plan Coordinates query execution Compute Nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at just $0.25/hour, grow to 2 PB (compressed) DC1: SSD; scale from 160 GB to 326 TB DS1/DS2: HDD; scale from 2 TB to 2 PB Ingestion/Backup Backup Restore JDBC/ODBC 10 GigE (HPC)
  • 14. Benefit #1: Amazon Redshift is fast Dramatically less I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  • 15. SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2016’ MIN: 01-JUNE-2016 MAX: 20-JUNE-2016 MIN: 08-JUNE-2016 MAX: 30-JUNE-2016 MIN: 12-JUNE-2016 MAX: 20-JUNE-2016 MIN: 02-JUNE-2016 MAX: 25-JUNE-2016 Unsorted Table MIN: 01-JUNE-2016 MAX: 06-JUNE-2016 MIN: 07-JUNE-2016 MAX: 12-JUNE-2016 MIN: 13-JUNE-2016 MAX: 18-JUNE-2016 MIN: 19-JUNE-2016 MAX: 24-JUNE-2016 Sorted By Date Benefit #1: Amazon Redshift is fast Sort Keys and Zone Maps
  • 16. Benefit #1: Amazon Redshift is fast Parallel and Distributed Query Load Export Backup Restore Resize
  • 17. ID Name 1 John Smith 2 Jane Jones 3 Peter Black 4 Pat Partridge 5 Sarah Cyan 6 Brian Snail 1 John Smith 4 Pat Partridge 2 Jane Jones 5 Sarah Cyan 3 Peter Black 6 Brian Snail Benefit #1: Amazon Redshift is fast Distribution Keys
  • 18. Benefit #1: Amazon Redshift is fast H/W optimized for I/O intensive workloads Choice of storage type, instance size Regular cadence of auto-patched improvements Example: Our new Dense Storage (HDD) instance type Improved memory 2x, compute 2x, disk throughput 1.5x Cost: same as our prior generation !
  • 19. Benefit #2: Amazon Redshift is inexpensive DS2 (HDD) Price Per Hour for DS2.XL Single Node Effective Annual Price per TB compressed On-Demand $ 0.850 $ 3,725 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DC1 (SSD) Price Per Hour for DC1.L Single Node Effective Annual Price per TB compressed On-Demand $ 0.250 $ 13,690 1 Year Reservation $ 0.161 $ 8,795 3 Year Reservation $ 0.100 $ 5,500 Pricing is simple Number of nodes x price/hour No charge for leader node No up front costs Pay as you go
  • 20. Amazon Redshift lets you start small and grow big Dense Storage (DS2.XL) 2 TB HDD, 31 GB RAM, 2 slices/4 cores Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB) Dense Storage (DS2.8XL) 16 TB HDD, 244 GB RAM, 16 slices/36 cores, 10 GigE Cluster 2-128 Nodes (32 TB – 2 PB) Note: Nodes not to scale Benefit #2: Amazon Redshift is inexpensive
  • 21. Benefit #3: Amazon Redshift is fully managed Continuous/incremental backups Multiple copies within cluster Continuous and incremental backups to S3 Continuous and incremental backups across regions Streaming restore Amazon S3 Amazon S3 Region 1 Region 2
  • 22. Benefit #3: Amazon Redshift is fully managed Amazon S3 Amazon S3 Region 1 Region 2 Fault tolerance Disk failures Node failures Network failures Availability Zone/Region level disasters
  • 23. Benefit #4: Security is built-in • Load encrypted from S3 • SSL to secure data in transit • Amazon VPC for network isolation • Encryption to secure data at rest • All blocks on disks & in Amazon S3 encrypted • Block key, Cluster key, Master key (AES-256) • On-premises HSM, AWS CloudHSM & KMS support • Audit logging and AWS CloudTrail integration • SOC 1/2/3, PCI-DSS, FedRAMP, BAA 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC
  • 24. Benefit #5: We innovate quickly 2013 20152014 •Initial Release in US East (N. Virginia; US West (Oregon), EU (Ireland); Asia Pacific (Tokyo, Singapore, Sydney) Regions •MANIFEST option for the COPY & UNLOAD commands •SQL Functions: Most recent queries •Resource-level IAM, CRC32 •Data Pipeline •Event notifications, encryption, key rotation, audit logging, on- premises or AWS CloudHSM; PCI, SOC 1/2/3 •Cross-Region Snapshot Copy •Audit features, cursor support, 500 concurrent client connections •EIP Address for VPC Cluster •New system views to tune table design and track WLM query queues •Custom ODBC/JDBC drivers; Query Visualization •Mobile Analytics auto export •KMS for GovCloud Region; HIPAA BAA •Interleaved sort keys •New Dense Storage Nodes (DS2) with better RAM and CPU. •New Reserved Storage Nodes: No, Partial & All Upfront Options •Cross-region backups for KMS encrypted clusters •Scaler UDFs in Python •AVRO Ingestion; Kinesis Firehose; Database Migration Service (DMS) •Modify Cluster Dynamically •Tag-based permissions and BZIP2 •System Tables for query Tuning •Dense Compute Nodes •Gzip & Lzop; JSON , RegEx, Cursors •EMR Data Loading & Bootstrap Action with COPY command; WLM concurrency limit to 50; support for the ECDH cipher suites for SSL connections; FedRAMP •Cross-region ingestion •Free trials & price reductions in Asia Pacific •CloudWatch Alarm for Disk Usage •AES 128-bit encryption; UTF-16; KMS Integration •EU (Frankfurt); GovCloud Regions •S3 Servier-side encryption support for UNLOAD •Tagging Support for Cost-allocation •WLM Queue-Hopping for timed-out queries •Append rows & Export to BZIP-2 •Lambda for Clusters in VPC; Data Schema Conversion Support from ML Console •US West (N. California) Region. 2016 100+ new features added since launch Release every two weeks Automatic patching
  • 25. Benefit #6: Amazon Redshift is powerful • Approximate functions • User defined functions • Machine Learning • Data Science Amazon ML
  • 26. Benefit #7: Amazon Redshift has a large ecosystem Data Integration Systems IntegratorsBusiness Intelligence
  • 27. Benefit #8: Service oriented architecture DynamoDB EMR S3 EC2/SSH RDS/Aurora Amazon Redshift Amazon Kinesis Machine Learning Data Pipeline CloudSearch Mobile Analytics
  • 30. Amazon Redshift Starts at $0.25/hour EC2 Starts at $0.02/hour S3 $0.030/GB-Mo Amazon Glacier $0.010/GB-Mo Amazon Kinesis $0.015/shard 1MB/s in; 2MB/out $0.028/million puts Analyzing Twitter Firehose
  • 31. 500MM tweets/day = ~ 5,800 tweets/sec 2k/tweet is ~12MB/sec (~1TB/day) $0.015/hour per shard, $0.028/million PUTS Amazon Kinesis cost is $0.765/hour Amazon Redshift cost is $0.850/hour (for a 2TB node) S3 cost is $1.28/hour (no compression) Total: $2.895/hour Data warehouses can be inexpensive and powerful
  • 32. Use only the services you need Scale only the services you need Pay for what you use ~40% discount with 1 year commitment ~70% discounts with 3 year commitment Data warehouses can be inexpensive and powerful
  • 33. Amazon.com – Weblog analysis Web log analysis for Amazon.com 1PB+ workload, 2TB/day, growing 67% YoY Largest table: 400 TB Want to understand customer behavior
  • 34. Query 15 months of data (1PB) in 14 minutes Load 5B rows in 10 minutes 21B rows joined with 10B rows – 3 days (Hive) to 2 hours Load pipeline: 90 hours (Oracle) to 8 hours 64 clusters 800 total nodes 13PB provisioned storage 2 DBAs Data warehouses can be fast and simple
  • 35. Sushiro – Real-time streaming from IoT & analysis
  • 36. Sushiro – Real-time streaming & analysis Real-time data ingested by Amazon Kinesis is analyzed in Amazon Redshift 380 stores stream live data from Sushi plates Inventory information combined with consumption information near real-time Forecast demand by store, minimize food waste, and improve efficiencies Amazon
  • 37. Big data does not mean batch Can be streamed in Can be processed in near real time Can be used to respond quickly to requests You can mix and match On premises and cloud Custom development and managed services Infrastructure with managed scaling, security Data warehouses can support real-time data
  • 38. Amazon Redshift Data Loading and Migration Petabyte-Scale Data Warehousing Service
  • 39. Data Loading Process Data Source Extraction Transformation Loading Amazon Redshift Target Focus
  • 40. Amazon Redshift Loading Data Overview AWS CloudCorporate Data center Amazon DynamoDB Amazon S3 Data Volume Amazon Elastic MapReduce Amazon RDS Amazon Redshift Amazon Glacier logs / files Source DBs VPN Connection AWS Direct Connect S3 Multipart Upload AWS Import/ Export EC2 or On- Prem (using SSH) Database Migration Service Kinesis AWS Lambda AWS Datapipeline
  • 41. Uploading Files to Amazon S3 Amazon Redshiftmydata Client.txt Corporate Data center Region Ensure that your data resides in the same Region as your Redshift clusters Split the data into multiple files to facilitate parallel processing Optionally, you can encrypt your data using Amazon S3 Server-Side or Client-Side Encryption Client.txt. 1 Client.txt. 2 Client.txt. 3 Client.txt. 4 Files should be individually compressed using GZIP or LZOP
  • 42. Loading Data From Amazon S3 Preparing Input Data Files Uploading files to Amazon S3 Using COPY to load data from Amazon S3
  • 43. Splitting Data Files Slice 0 Slice 1 Slice 0 Slice 1 Client.txt. 1 Client.txt. 2 Client.txt. 3 Client.txt. 4 Node 0 Node 1 2 XL Compute Nodes Copy customer from ‘s3://mydata/client.txt’ Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_ Delimiter ‘|’; mydata
  • 44. Use the COPY command Each slice can load one file at a time A single input file means only one slice is ingesting data Instead of 100MB/s, you’re only getting 6.25MB/s Loading – Use multiple input files to maximize throughput
  • 45. Use the COPY command You need at least as many input files as you have slices With 16 input files, all slices are working so you maximize throughput Get 100MB/s per node; scale linearly as you add nodes Loading – Use multiple input files to maximize throughput
  • 46. Amazon Confidential – Slides not intended for redistribution. • Comece sua primeira migração em 10 minutos ou menos • Mantenha suas aplicações em execução durante a migração • Mova dados para a mesma engine ou para uma diferente AWS Database Migration Service
  • 47. Customer Premises Application Users AWS Internet VPN Start a replication instance Connect to source and target databases Select tables, schemas, or databases Let AWS Database Migration Service create tables, load data, and keep them in sync Switch applications over to the target at your convenience AWS DMS - Keep your apps running during the migration AWS Database Migration Service
  • 48. Loading Data from an Amazon DynamoDB Table Differences Amazon DynamoDB Amazon Redshift Table Names • Up to 255 characters • May contain ‘.’ (dot) and ‘-’ (dash) characters • Case-sensitive • Limited to 127 characters • Can’t contain dots or dashes • Are NOT case-sensitive • Can’t conflict with any Amazon Redshift reserved words NULL Does not support the SQL concept of NULL Must specify how Amazon Redshift interprets empty or blank attribute values in Amazon DynamoDB Data Types STRING and NUMBER Data Types Supported BINARY and SET Data Types Not Supported
  • 49. Loading Data from Amazon Elastic MapReduce Load data from Amazon EMR in parallel using COPY Specify Amazon EMR cluster ID and HDFS file path/name Amazon EMR must be running until COPY completes. copy sales from 'emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials 'aws_access_key_id=<access-key id>; aws_secret_access_key=<secret-access-key>';
  • 50. Loading Data using Lambda AWS Lambda-based Amazon Redshift loader to offer you the ability to drop files into S3 and load them into any number of database tables in multiple Amazon Redshift clusters automatically, with no servers to maintain. Blog post https://blogs.aws.amazon.com/bigdata/post/ Tx24VJ6XF1JVJAA/A-Zero-Administration- Amazon-Redshift-Database-Loader GitHub http://github.com/awslabs/aws- lambda-redshift-loader
  • 51. Remote Loading using SSH Redshift COPY command can reach out to remote locations (EC2 and on premise) to load data using a secure shell script (SSH) To Remote Load follow this process: • Add cluster's public key to the remote host's authorized keys file • Configure remote host to accept connections from all cluster IP addresses • Create manifest file in JSON format and upload to S3 bucket • Issue a COPY command, including a reference to the manifest file
  • 52. Vacuuming Tables 1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING 2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE 3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE 4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINER 1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas Column 0 Column 1 Column 2 Amazon Redshift serializes all of the values of a column together.
  • 53. Vacuuming Tables 1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING 2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE 3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE 4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINER 1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas Column 0 Column 1 Column 2 Delete customer where column_0 = 3; x xxx xxxxxxxxxxxxxxx
  • 54. Vacuuming Tables 1,2,4 RFK,JFK,GWB 900 Columbus,800 Washington,600 Kansas VACUUM Customer; 1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansasx xxx xxxxxxxxxxxxxxx Redshift does not automatically reclaim and reuse space that is freed when you delete rows from tables or update rows in tables. The VACUUM command reclaims space following deletes, which improves performance as well as increasing available storage.
  • 55. Analyzing Tables ANALYZE command The entire current database A single Table One or more specific columns in a single table The ANALYZE command obtains a sample of rows, does some calculations, and saves resulting column statistics. You do not need to analyze all columns in all tables regularly Analyze the columns that are frequently used in the following: • Sorting and grouping operations • Joins • Query Predicates To maintain current statistics for tables: • Run the ANALYZE command before running queries • Run the ANALYZE command against the database routinely at the end of every regular load or update cycle • Run the ANALYZE command against an new tables Statistics
  • 56. Managing Concurrent Write Operations Concurrent COPY/INSERT/DELETE/UPDATE Operations into the same table Transaction 1 Transaction 2 CUSTOM ER Copy customer from ‘s3://mydata/client.txt’… ; Copy customer from ‘s3://mydata/client.txt’… ; Session A Session B Transaction 1 puts on the write lock on the CUSTOMER table Transaction 2 waits until transaction 1 releases the write lock
  • 57. Managing Concurrent Write Operations Concurrent COPY/INSERT/DELETE/UPDATE Operations into the same table Transaction 1 Transaction 2 CUSTOM ER Begin; Delete one row from CUSTOMERS; Copy…; Select count(*) from CUSTOMERS; End; Begin; Delete one row from CUSTOMERS; Copy…; Select count(*) from CUSTOMERS; End; Session A Session B Transaction 1 puts on the write lock on the CUSTOMER table Transaction 2 waits until transaction 1 releases the write lock
  • 58. Managing Current Write Operations Potential deadlock situation for concurrent write transactions Transaction 1 Transaction 2 CUSTOM ER Begin; Delete 3000 rows from CUSTOMERS; Copy…; Delete 5000 rows from PARTS; End; Begin; Delete 3000 rows from PARTS; Copy…; Delete 5000 rows from CUSTOMERS; End; Session A Session B Transaction 1 puts on the write lock on the CUSTOMER table PARTS Transaction 2 puts on the write lock on the PARTS table Wait Wait
  • 59. Amazon Redshift Security Petabyte-Scale Data Warehousing Service Amazon Redshift Table and Schema Design
  • 60. Goals of distribution • Distribute data evenly for parallel processing • Minimize data movement • Co-located joins • Localized aggregations Distribution key All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Full table data on first slice of every nodeSame key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round-robin distribution
  • 61. Choosing a distribution style Key • Large FACT tables • Rapidly changing tables used in joins • Localize columns used within aggregations All • Have slowly changing data • Reasonable size (i.e., few millions but not 100s of millions of rows) • No common distribution key for frequent joins • Typical use case: joined dimension table without a common distribution key Even • Tables not frequently joined or aggregated • Large tables without acceptable candidate keys
  • 62. Data Distribution and Distribution Keys Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 cloudfront uri = /games/g1.exe user_id=1234 … cloudfront uri = /imgs/ad1.png user_id=2345 … cloudfront uri=/games/g10.exe user_id=4312 … cloudfront uri = /img/ad_5.img user_id=1234 … 2M records 5M records 1M records 4M records Poor key choices lead to uneven distribution of records…
  • 63. Data Distribution and Distribution Keys Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 cloudfront uri = /games/g1.exe user_id=1234 … cloudfront uri = /imgs/ad1.png user_id=2345 … cloudfront uri=/games/g10.exe user_id=4312 … cloudfront uri = /img/ad_5.img user_id=1234 … 2M records 5M records 1M records 4M records Unevenly distributed data cause processing imbalances!
  • 64. Data Distribution and Distribution Keys Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 cloudfront uri = /games/g1.exe user_id=1234 … cloudfront uri = /imgs/ad1.png user_id=2345 … cloudfront uri=/games/g10.exe user_id=4312 … cloudfront uri = /img/ad_5.img user_id=1234 … 2M records2M records 2M records 2M records Evenly distributed data improves query performance
  • 66. Goals of sorting • Physically sort data within blocks and throughout a table • Optimal SORTKEY is dependent on: • Query patterns • Data profile • Business requirements
  • 67. COMPOUND • Most common • Well-defined filter criteria • Time-series data Choosing a SORTKEY INTERLEAVED • Edge cases • Large tables (>billion rows) • No common filter criteria • Non time-series data • Primarily as a query predicate (date, identifier, …) • Optionally, choose a column frequently used for aggregates • Optionally, choose same as distribution key column for most efficient joins (merge join)
  • 68. Table is sorted by 1 column [ SORTKEY ( date ) ] Best for: • Queries that use 1st column (i.e. date) as primary filter • Can speed up joins and group bys • Quickest to VACUUM Date Region Country 2-JUN-2015 Oceania New Zealand 2-JUN-2015 Asia Singapore 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Hong Kong 3-JUN-2015 Europe Germany 3-JUN-2015 Asia Korea Sort Keys – Single Column
  • 69. • Table is sorted by 1st column , then 2nd column etc. [ SORTKEY COMPOUND ( date, region, country) ] • Best for: • Queries that use 1st column as primary filter, then other cols • Can speed up joins and group bys • Slower to VACUUM Date Region Country 2-JUN-2015 Oceania New Zealand 2-JUN-2015 Asia Singapore 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Hong Kong 3-JUN-2015 Europe Germany 3-JUN-2015 Asia Korea Sort Keys – Compound
  • 70. • Equal weight is given to each column. [ SORTKEY INTERLEAVED ( date, region, country) ] • Best for: • Queries that use different columns in filter • Queries get faster the more columns used in the filter (up to 8) • Slowest to VACUUM Date Region Country 2-JUN-2015 Oceania New Zealand 2-JUN-2015 Asia Singapore 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Hong Kong 3-JUN-2015 Europe Germany 3-JUN-2015 Asia Korea Sort Keys – Interleaved
  • 71. Compressing data • COPY automatically analyzes and compresses data when loading into empty tables • ANALYZE COMPRESSION checks existing tables and proposes optimal compression algorithms for each column • Changing column encoding requires a table rebuild
  • 72. Automatic compression is a good thing (mostly) • From the zone maps we know: • Which blocks contain the range • Which row offsets to scan • Highly compressed SORTKEYs: • Many rows per block • Large row offset Skip compression on just the leading column of the compound SORTKEY
  • 73. During queries and ingestion, the system allocates buffers based on column width Wider than needed columns mean memory is wasted Fewer rows fit into memory; increased likelihood of queries spilling to disk DDL – Make Columns as narrow as possible
  • 74. Amazon Redshift Security Petabyte-Scale Data Warehousing Service Amazon Redshift Operations
  • 75. Backup and Restore Backups to Amazon S3 are automatic, continuous & incremental Configurable system snapshot retention period User-defined snapshots are on-demand Streaming restores enable you to resume querying faster 128GB RAM 16TB disk 16 coresRedshift Cluster Node 128GB RAM 16TB disk 16 coresRedshift Cluster Node 128GB RAM 16TB disk 16 coresRedshift Cluster Node Amazon S3
  • 76. Snapshots Automated Snapshots Redshift enables automated snapshots by default Snapshots are incremental Redshift retains incremental backup data required to restore cluster using manual snapshot or automatic snapshot Only for clusters that haven’t reached the snapshot retention period Amazon Redshift provides free storage for snapshots that is equal to the storage capacity of your cluster. Manual Snapshots The system will never delete a manual snapshot You can authorize access to an existing manual snapshot for as many as 20 AWS customer accounts Shared access between accounts means you can move data between dev/test/prod without reloading
  • 77. Cross-Region Snapshots Copy cluster snapshots to another region using the management console Can’t modify destination region without disabling cross-region snapshots and re- enabling with a new destination region and retention period
  • 78. Streaming Restore During restore your node is provisioned within less than two minutes Query provisioned node immediately as data is automatically streamed from the S3 snapshot
  • 79. ― Cluster is put into read-only mode ― New cluster is provisioned according to resizing needs ― Node-to-node parallel data copy ― Only charged for source cluster Resize – Phase 1
  • 80. Resize - Phase 2 ― Automatic SQL endpoint switchover via DNS ― Decommission the source cluster
  • 81. Resizing a Redshift Data Warehouse via the AWS Console
  • 82. Resizing a Redshift Data Warehouse via the AWS Console http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#cluster-resize-intro
  • 83. Resizing without Production Impact When you resize your cluster is read-only • Query cluster throughout resizing • There will be a short outage at the end of cluster resizing
  • 84. Parameter Groups Allow you to set a variety of configuration options: • date style, search path and query timeout • user activity logging • whether the leader node should accept only SSL protected connections Apply to every database in a cluster View events using the console, the SDK, or the CLI/API Use SET to change some properties temporarily, for example: SET wlm-query-slot-count = 3 Some changes to the parameter group may require a restart to take effect Default Parameter Values http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html#wlm-dynamic-and-static-properties
  • 85. Events Event information is available for a period of several weeks Source types: • Cluster • Parameter Group • Security Group • Snapshot Categories: • Configuration • Management • Monitoring • Security Subscribe to events to receive SNS notifications http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#redshift-event-messages
  • 86. Resource Tagging Define custom tags for your cluster • provide metadata about a resource • categorize billing reports for cost allocation • Activate tags in the billing and cost management service Cluster resize or a restore from a snapshot in the original region preserves tags but not in cross region snapshots http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-tagging.html
  • 87. Concurrent Query Execution Long-running queries may cause short-running queries to wait in a queue • As a result users may experience unpredictable performance By default, a Redshift cluster is configured with one queue with 5 execution slots by default (50 max) • 5 slots means 5 queries can execute concurrently while additional queries will queue until additional resources free up Running Default queue
  • 88. Workload Management Workload management is about creating queues for different workloads User Group A Short-running queueLong-running queue Short Query Group Long Query Group
  • 90. Workload Management Don’t set concurrency to more that you need set query_group to allqueries; select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000; reset query_group;
  • 91. Redshift Utils Script Purpose top_queries.sql Get the top 50 most time-consuming statements in the last 7 days. perf_alerts.sql Get the top occurrences of alerts; join with table scans. filter_used.sql Get the filter applied to tables on scans. To aid on choosing sortkey. commit_stats.sql Get information on consumption of cluster resources through COMMIT statements. current_session_info.sql Get information about sessions with currently running queries. missing_table_stats.sql Get EXPLAIN plans that flagged missing statistics on underlying tables. queuing_queries.sql Get queries that are waiting on a WLM query slot. table_info.sql Get table storage information (size, skew, etc.). • Redshift Admin Scripts provide diagnostic information for tuning and troubleshooting • https://github.com/awslabs/amazon-redshift-utils
  • 92. Redshift Utils View Purpose v_check_data_distribution.sql Get data distribution across slices. v_constraint_dependency.sql Get the the foreign key constraints between tables. v_generate_group_ddl.sql Get the DDL for a group. v_generate_schema_ddl.sql Get the DDL for schemas. v_generate_tbl_ddl.sql Get the DDL for a table. This will contain the distkey, sortkey, and constraints. v_generate_unload_copy_cmd.sql Generate unload and copy commands for an object. v_generate_user_object_permissions.sql Get the DDL for a users permissions to tables and views. v_generate_view_ddl.sql Get the DDL for a view. v_get_obj_priv_by_user.sql Get the table/views that a user has access to. v_get_schema_priv_by_user.sql Get the schema that a user has access to. v_get_tbl_priv_by_user.sql Get the tables that a user has access to. v_get_users_in_group.sql Get all users in a group. v_get_view_priv_by_user.sql Get the views that a user has access to. v_object_dependency.sql Merge the different dependency views. v_space_used_per_tbl.sql Get pull space used per table. v_view_dependency.sql Get the names of the views that are dependent on other tables/views. • Redshift Admin Views provide information about user and group access, various table constraints, object and view dependencies, data distribution across slices, and pull space used per table
  • 93. Open source tools https://github.com/awslabs/amazon-redshift-utils https://github.com/awslabs/amazon-redshift-monitoring https://github.com/awslabs/amazon-redshift-udfs Admin scripts Collection of utilities for running diagnostics on your cluster Admin views Collection of utilities for managing your cluster, generating schema DDL, etc. ColumnEncodingUtility Gives you the ability to apply optimal column encoding to an established schema with data already loaded
  • 94. Q&A
  • 95. PlayKids iFood MapLink Apontador Rapiddo Superplayer Cinepapaya ChefTime
  • 96. “Com os serviços da AWS pudemos dosar os investimentos iniciais e prospectar os custos para expansões futuras” “O Redshift nos permitiu transformar dados em informações self-service” - Wanderley Paiva Database Specialist
  • 97. O Desafio • Escalabilidade • Disponibilidade • Centralização dos dados • Custos reduzidos e preferencialmente diluído