A quick tour in 16 slides of Amazon's Redshift clustered, massively parallel database.
Find out what differentiates it from the other database products Amazon has, including SimpleDB, DynamoDB and RDS (MySQL, SQL Server and Oracle).
Learn how it stores data on disk in a columnar format and how this relates to performance and interesting compression techniques.
Contrast the difference between Redshift and a MySQL instance and discover how the clustered architecture may help to dramatically reduce query time.
2. What is Redshift?
Amazon Product
Data Warehouse
Service
Fully managed
Redshift
Fast
Petabyte scale
(1PB == 1Billion MB)
1/10 cost of
traditional DW
As the universe expands, the wavelength of radiation from objects moving away
from an observer shifts towards the red end of the electromagnetic spectrum.
Redshift is a consequence of an expanding universe.
2
3. Where does Redshift sit within the Amazon database product suite?
Non-relational
Relational database
service
Data warehouse
service
Query flexibility
High availability
High availability
High availability
SimpleDB
DynamoDB
RDS
Redshift
(MySQL / Oracle / SQL Server)
(PostgreSQL base)
Web-services
interface
High scalability
Referential integrity
Cluster architecture
Smaller workloads
Run off SSDs
DB-dependent
feature-set (Multi-AZ)
Relational database
1MB response size
Integrates with
Redshift
Online Transaction
Processing
Horizontal scalability:
add more nodes
10GB hard limit
3
NoSQL service
Provisioned
throughput
Provisioned
throughput
Analytics
4. What differentiates Redshift from, say, a MySQL RDS instance?
Cluster Architecture
No RI by design
Redshift
Columnar storage
4
Read Optimised
5. What differentiates Redshift from, say, a MySQL RDS instance?
(i) Cluster architecture
a) Clients connect via existing protocols
to the Leader Node.
b) Leader node develops a query plan
and may generate and compile C++
code to be executed by the compute
nodes
c) Leader node will distribute work
across compute nodes using
Distribution Keys (more later)
d) Compute nodes receive work from
leader node and may transmit data
amongst themselves to answer the
query
e) Leader aggregates the results and
returns to client
f) Leader can distribute bulk data loads
across compute nodes: I have loaded
3G of raw data (gzipped to 500Mb)
on a single node in under 3 minutes)
5
source: http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
6. What differentiates Redshift from, say, a MySQL RDS instance?
(ii) Column-store database
a) Relational databases tend to
store data on a tuple by
tuple basis.
b) When querying the data, the
engine needs to read more
blocks of data, discarding
much of the data just read in
order to return columns
being queried
c) A column-store stores
columns contiguously in the
same block
d) Result: the number of IO
operations involved in a
query can be significantly
reduced, dependent on the
shape of the data
source: http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
6
7. What differentiates Redshift from, say, a MySQL RDS instance?
(iii) Optimised for Read performance
a) Contrast block sizes with other databases:
a) Default MySQL installs on ext3 file-systems use 4k blocks
b) Default NTFS partitions use 4k blocks, so SQL Server on
NTFS defaults to 4k blocks as well.
b) Redshift’s focus on Data Warehousing (and hence read
optimisation) allows them to use a 1,024KB block size
c) Under a column-store architecture each block holds the same
kind of data, so datatype-specific compression enables even
more data to be stored per block, further reducing disk space
and IO
d) Reduced disk space and IO helps improve inter-node data
sharing and replication, where compute nodes may
redistribute data based on a table’s distribution key (more
later on that)
7
All your blox are
belong to me!
8. What differentiates Redshift from, say, a MySQL RDS instance?
(iv) No referential integrity
Ok, sounds g… WHAT?!!
•
•
•
•
•
•
•
•
•
8
Do tell Redshift about primary, foreign keys and
column uniqueness. It won’t enforce them, but it
will use these hints to better understand queries.
No primary key
No foreign key
No index support
No sequences
No user defined functions
No stored procedures
No common table expressions
No exotic data types – no arrays, JSON, Geospatial types, etc.
No ‘alter column’ syntax – drop and reload
9. How does Redshift locate data?
The Sort Key
•
•
Redshift will store data on disk in Sort Key order – so think of it as the single clustered index for
the table
•
Sort keys should be selected based on how the table is used:
• Columns that are frequently used to join to other tables should be included in the sort key
• Date and timestamp columns that are used in filtering operations should be included
•
Redshift stores metadata about each data block, including the min and max of each column value
– using this, Redshift can skip entire blocks when answering a query
•
9
Each table can have a single Sort Key – a compound key, comprised of 1 to 400 columns from the
table
After data loads or inserts, the ANALYZE command should be run
• ANALYZE updates the table metadata that is used by the query planner – very important for
column-based data storage and ongoing query performance
10. How does Redshift locate data?
The Distribution Key
•
•
10
Redshift will distribute and replicate data between compute nodes in order to get best use of
the parallelism available in the cluster
• By default, data will be spread evenly across all compute nodes (EVEN distribution)
• A node is further broken down into slices – one slice per CPU core
• Each slice participates in the parallel execution of a job sent from the Leader node, so the
even distribution of data across the nodes is vital to ensuring consistent query
performance
• If data is denormalised and does not participate in joins, then an EVEN distribution won’t
be problematic
Alternatively a Distribution key can be provided (KEY distribution)
• The Distribution key is important, in that it helps define which data is kept together on a
given node.
• The objective is to choose a key that helps distribute data across a node’s slices, but not
across the cluster’s nodes
• Similarly to the Sort Key, the Distribution key is defined on a per-table basis, but unlike a
Sort Key, the Distribution Key is comprised of only a single column
11. What typical RDBMS features does Redshift have?
Features
DataTypes (complete list):
• Transactions
• SmallInt
• Reasonable number of windowing functions
• Integer
• Rank, First, Last, Lag, Sum, Nth and so on
• Bigint
• Most types of relational joins
• Decimal
• Inner, Left, Right, Full, Cross
• Real
• Correlated sub-queries are supported, but only where • Double precision
the query planner can decorrelate them for
• Boolean
performance (sub-queries during a join are a no-go)
• Char
• Views
• Varchar
• Excellent locking and concurrent write capabilities
• Date
• Thanks PostgreSQL!
• Timestamp
• Schema management
• Identity columns (auto_increment)
11
12. Other features?
• Close integration with S3 and DynamoDB
• Our test instance was primed from S3:
COPY <tableName> from s3://bucket/file.csv.gz
header as 1
GZIP
• COPY command is central to the import process – can load data in parallel, using what it
knows about the structure of the target table to assign work to individual compute nodes
• UNLOAD will export data from a Redshift table out to an S3 bucket
• Excellent set of database system tables that allow one to monitor pretty much everything
that’s going on:
• Loads
• Queries
• Chatter between compute nodes
• Sort and distribution keys
12
13. Other features (cont)?
• Column compression
• Each column can have an optionally assigned compression algorithm, including:
• BYTEDICT – essentially a key-value lookup for up to 256 values. Useful for repeating
data, such as “State” in a property record
• DELTA – stores the initial value of a column as per its data type, and then stores only
the offset between the next value and the first value. Very useful for dates
• RUNLEGNTH – Stores the value of a column and the number of times the value is
repeated. Useful when the data is stored consecutively – relevant for sort-key
• MOSTLY8/16/32 – uses traditional numeric compression, but allows for outliers
13
15. How well does Redshift perform?
We ran some rudimentary queries over a realistic data set…
“Return the current list of all valid properties within a selected list
of states and tell me what the current number of bedrooms,
bathrooms, car spaces, land size, floor size and year built is.”
…and found that Redshift
outperformed our existing database
by a factory 2.5 – 3.5.
Correct select of a SORT KEY in Redshift is vital.
Any filtering or joins on a non-sortkey column will
result in (slow) a table scan. In our example, this
reduced performance by 30%.
However, this is not a particularly useful comparison as, these
were different machines with different hardware specifications.
15