Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013

Vidhya Srinivasan| vid@amazon.com
Neil Thombre | thombren@amazon.com
Amazon Redshift

What is a Data Warehouse ?
•  Large data volumes (TB to PB)
•  Queries are complex and IO intensive
•  Data typically loaded in batches
•  Integrates with Business Intelligence tools for reporting and analysis

DW - Existing AWS landscape
Scale

Out

Fully
SQL

Compa2ble

Op2mized
data

import
&
export

Eﬃcient

Aggregates
&

Joins

Local

storage

No
single

point
of

failure

RDS
X
X

DynamoDB
X
X
X

EMR/Hadoop
X
X
½

X

DW - Existing AWS landscape
Scale

Out

Fully
SQL

Compa2ble

Op2mized
data

import
&
export

Eﬃcient

Aggregates
&

Joins

Local

storage

No
single

point
of

failure

RDS
X
X

DynamoDB
X
X
X

EMR/Hadoop
X
X
½
X

RedshiJ
X
X
X
X
X
X

Introducing Amazon Redshift
•  Fully managed database service
•  Built from the ground up for DW
•  Secure & Reliable – Fault tolerant, automatic backup, encryption
•  Fast – Scale out, specialized hardware, columnar storage
•  Inexpensive – 1/10th the cost of alternatives, pay as you go
•  Easy to Use – Provision & resize with a few clicks
•  Compatible – JDBC/ODBC, mostly PostgreSQL compatible

Why did we call it Amazon Redshift?
Edwin
Hubble

1889
–
1953

>> How much storage is provisioned
by Redshift customers ?
>>
How
many
Redshi<
clusters
were
created
in

ﬁrst
10
weeks?

Amazon Redshift architecture
•  Leader Node
–  SQL endpoint
–  Stores metadata
–  Coordinates query execution
•  Compute Nodes
–  Local, columnar storage
–  Execute queries in parallel
–  Load, backup, restore via Amazon S3
–  Parallel load from Amazon DynamoDB
•  Single node version available
10
GigE

(HPC)

IngesKon

Backup

Restore

SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3
JDBC/ODBC

128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node

Amazon Redshift dramatically reduces I/O
•  Data compression
•  Zone maps
•  Direct-attached storage
•  Large data block sizes
ID
Age
State
Amount

123
20
CA
500

345
25
WA
250

678
40
FL
125

957
37
WA
375

•  With row storage you do
unnecessary I/O
•  To get total amount, you have to
read everything

•  Zone maps
ID
Age
State
Amount

123
20
CA
500

345
25
WA
250

678
40
FL
125

957
37
WA
375

•  With column storage, you only
read the data you need

•  Track of the minimum and
maximum value for each block
•  Skip over blocks that don’t
contain the data needed for a
given query
•  Minimize unnecessary I/O

•  Zone maps
•  Use direct-attached storage to
maximize throughput
•  Hardware optimized for high
performance data processing
•  Large block sizes to make the
most of each read
•  Amazon Redshift manages
durability for you

Amazon Redshift runs on optimized hardware
HS1.8XL: 128 GB RAM, 16 Cores, 24 Spindles, 16 TB compressed user storage, 2 GB/sec scan rate
16 GB RAM
2 TB disk
2 cores
HS1.XL: 16 GB RAM, 2 Cores, 3 Spindles, 2 TB compressed customer storage
•  Optimized for I/O intensive workloads
•  High disk density
•  Runs in HPC - fast network
•  HS1.8XL available on Amazon EC2
•  Need to leverage all the nodes
128 GB RAM
16 cores
16 TB disk

Amazon Redshift parallelizes and distributes everything
•  Query
•  Load
•  Backup/Restore
•  Resize

•  Load in parallel from Amazon S3
or Amazon DynamoDB
•  Data automatically distributed and
sorted according to DDL
•  Scales linearly with number of
nodes
Amazon S3/DynamoDB
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
•  Query
•  Load
•  Resize

•  Backups to Amazon S3 are
automatic, continuous and
incremental
•  Configurable system snapshot
retention period
•  Take user snapshots on-demand
•  Streaming restores enable you to
resume querying faster
Amazon S3
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
•  Query
•  Load
•  Resize

•  Resize while remaining online
•  Provision a new cluster in the
background
•  Copy data in parallel from node to
node
•  Only charged for source cluster
•  Query
•  Load
•  Resize
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Leader
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Leader
Node

•  Query
•  Load
•  Resize
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Leader
Node
•  Automatic SQL endpoint switchover
via DNS
•  Decommission the source cluster
•  Simple operation via AWS Console or
API

Amazon Redshift lets you start small and grow big
Extra Large Node (HS1.XL)
3 spindles, 2 TB, 16 GB RAM, 2 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Eight Extra Large Node (HS1.8XL)
24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE
Cluster 2-100 Nodes (32 TB – 1.6 PB)
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
XL
XL XL XL XL XL XL XL XL
Note:
Nodes
not
to
scale

Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013

Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013

Similaire à Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013 (20)

Plus de Michael Bohlig

Plus de Michael Bohlig (13)

Dernier

Dernier (20)

Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013