SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Extending Analytics Beyond the Data Warehouse,
ft. Warner Bros. Analytics
A N T 3 0 1
Ippokratis Pandis
Principal Engineer
Amazon Redshift
Kurt Larson
Tech Director
Warner Bros. Analytics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data
every 5 years
There is more data
than people think
15
years
live for
Data platforms need to
1,000x
scale
>10x
grows
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evolving around Amazon S3
Amazon
Kinesis
Social Web
Sensors Devices
LOBCRM
ERPOLTP
AWS
IAM
AWS
KMS
Data
Catalog
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch Service
AI Services
Amazon
QuickSight
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift integrates seamlessly with the S3 Data Lake
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node N
S3
Leader
Node
Compute
Node 1
Compute
Node 2
Spectrum
Node 3
Leader
Node
Compute
Node 1
Compute
Node 2
Leader
Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Compute
Node 4
Redshift Cluster 1 Redshift Cluster 2 Redshift Cluster 3
Glue Catalog
or Hive
Metastore
Amazon S3
SELECT
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Answering a business query
Find the 10 best selling products of German merchants in the past 1 week
SELECT o.pid, count(o.items)
FROM Orders o, Merchants m, Products p
WHERE p.pid = m.pid
AND o.pid = p.pid
AND m.country = ’DE’
AND o.date BETWEEN (getdate() + INTERVAL ‘-7 day’)
AND getdate()
GROUP BY 1
ORDER BY 2 LIMIT 10;
10s Billions Millions100s Thousands
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Executing this query in Amazon Redshift
HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN + FILTER
orders
10s Billions
SORT + LIMIT
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Redshift Cluster
Compute
Node N
Millions 100s Thousands
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Redshift Cluster
Compute
Node N
Moving the Big Data to Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S3
Leader Node
Compute
Node 2
Compute
Node 3
Redshift Cluster
Moving the Big Data to Amazon S3 & Sizing on Compute Needs
?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fast queries with data in Amazon S3
1. High Bandwidth
Parallelism (many small straws)
2. Reduce the amount of data to send back
Computation push-down
3. Minimize the amount of data to read
Avoid doing unnecessary work
Columnar formats & compression
4. Avoid expensive joins with Nested Data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node N
S3
Leader Node
Compute
Node 1
Compute
Node 2
Spectrum
Node 3
Redshift Spectrum Execution Layer
10s of Redshift nodes
1000s of
Spectrum
nodes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN + FILTER
orders
SORT + LIMIT
Executing this query in Redshift Spectrum
100s of Spectrum
nodes
10s of Redshift nodes
10s Billions
Leader Node
Compute
Node 1
Compute
Node 2
Millions 100s Thousands
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Executing this query in Redshift Spectrum
HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN + FILTER
orders
SORT + LIMIT
AGG
100s of Spectrum
nodes
10s of Redshift nodes
10s Billions
Leader Node
Compute
Node 1
Compute
Node 2
Millions 100s Thousands
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Partitioning Pruning – Static and Dynamic
Partitioning Scheme:
s3://Orders/ [YYYY] / [MM] / [DD] / [Country] /
SELECT country, count(*)
FROM s3.Orders o, local.Countries c
WHERE o.CountryID = c.ID
AND c.Continent = ‘South America’
AND o.YYYY = 2017 AND o.MM = 12
AND o.DD = 24
GROUP BY 1;
Hash
Join
Agg
S3
Orders
StaticDynamic
20 365 195 1,423,500
Partitions to process
1 1 195 195
1 1 12 12
Agg
Local
Countries
118625x
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Executing this query in Redshift Spectrum
HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN
orders
AGG
Partition Loop
SCAN + FILTER
partitions of
orders
SORT + LIMIT
10s Billions 10s Thousands
100s of Spectrum
nodes
10s of Redshift nodes
Leader Node
Compute
Node 1
Compute
Node 2
Millions 100s Thousands
EXPLAIN <query>;
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
* Since re:Invent 2017
Improvements in scale*
Integrate seamlessly with your data lake
Support for
DATE data type
Support for
Enhanced VPC
Routing
Improved performance of
IN-list predicate processing
in Spectrum scans
Improved performance for queries
with expressions on the partition
column of external tables
Ability to query external tables
during a resize operation
Specify the root of an S3 bucket as
the source for an existing table
Performance improvements
for Spectrum queries with
aggregations on partition columns
Support for
renaming external
table columns
Added a table property to
specify the file compression
type for external tables
Additional functionality
pushdown to Spectrum,
enhancing performance
Support for map
datatypes in Spectrum
to contain arrays
Query support for nested
data has been extended to
support arrays of arrays and
arrays of maps
Tail-latency reductions
4x Improvement in selective scans
2x Improvement in scans of small files
Multibyte character support
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Integrating Amazon Redshift
seamlessly with your data lake
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Unload to
Parquet
Redshift
Spectrum
Request
Accelerator
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Redshift Spectrum Request Accelerator
HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN
orders
AGG
Partition Loop
SCAN + FILTER
partitions of
orders
SORT + LIMIT
10s Billions 10s Thousands
100s of Spectrum
nodes
10s of Redshift nodes
Leader Node
Compute
Node 1
Compute
Node 2
Millions 100s Thousands
Incremental Result Caching
Coming Soon!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
UNLOAD to Parquet
The most popular open columnar format
Coming Soon!
Unload TPCH-1TB Lineitem
4-node dc2.8xlarge
34.52
11.46
13.82
9.07
0
10
20
30
40
Time(sec)
Time to UNLOAD
Series1 Series2 Series3 Series4
152.4
278.7
787.7
224.9
0
200
400
600
800
Size(GB)
Size
Series1 Series2 Series3 Series4
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Run Time
135.5
2.8
0
50
100
150
1 2
DataScanned(GBs)
Data Scanned
Simple Query Complex Query
NUVIAD
Querying Parquet vs Text
44.8
12.6
0
10
20
30
40
50
1 2
Runtime(Secs)
71.1
43.1
0
20
40
60
80
1 2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift
is now
>3x faster
than 6 months ago
Normalized Queries Per Hour (QPH)
Assuming Amazon Redshift’s QPH 6 months
ago=100%
Queriesperhour
Asa%ofredshift6monthsago
JUL 2018 AUG 2018 SEP 2018 OCT 2018MAY 2018
100%
181%
237%
284%
350%
Higher is better
115%
JUN 2018
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
*Since re:Invent 2017
Support for lateral
column alias reference
Improved resource
management for
memory-intensive queries
Performance improvements for
joins involving large numbers of
NULL values in a join key column
Performance improvements for
queries with intermediate subquery
results that can be distributed
Improved cluster
resize operations
Performance improvements for
queries that refer to stable
functions with constant expressions
Performance improvements
for queries operating over CHAR
and VARCHAR columns
Performance
improvements for
single-row inserts
Improvements in speed
Performance improvements for queries
with expressions on the partition
columns of external tables
Performance
improvements for
complex EXCEPT
subqueries
Doubled the
number of tables
you can create
in a cluster
Improvements
to hash join
performance
Improvements for the
COPY operation when
ingesting data from Parquet
and ORC formats
Performance improvement for
queries that refer to stable functions
over constant expressions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Performance improvements via query rewrites
that pushdown selective joins into a subquery
Performance improvements by
optimizing the data
redistribution strategy during
query planning
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ippokratis Pandis
ippo@amazon.com
ANALYTICS
WB Analytics
Examples of our work...
Mobile Console / PC
WB Analytics
Many teams work to publish a game...
Each team brings specialized tools…
We combine these tools with client
data to create a consistent,
actionable view of each game.
What we do...
WB Analytics
Where we started...
Server Telemetry
SQL
Challenges:
• Delayed Data
• Resource Constraints /
Scaling
• Multi-year CapEx
• SQL limited to RDBMS
24 hr
24 hr
24 hr
Popular
Column-based
RDBMS
Client Telemetry
Demographics
WB Analytics
Picking the right tools...
Integration
Tech
InsightsModeling
Client
Server
Integration
s
● Enforce Schemas
● Schema Lineage
● Auto Schema Merge
Ingestion
● Maintain consistent
API(s)
● Spark: Micro-batch
● Amazon EC2
autoscaling group
● Airflow: Batch
Storage
Data Lake Query
Engine(s)
● Amazon S3: Raw data
● Amazon Redshift: fast /
modest
● Spectrum/Amazon S3:
Large-sized & multi
cluster
Reasoning
Putting it all together
Ingestion Modeling Visuals & Automation
Schema Management
Amazon Redshift Loader
Client
Events
Server
Events
API
Kafka
Schema
Storage
Batch Daily
Loads
Sales
Social
Market
…
S3 Analysis Lake
Extracts
Parquet
Analyst Services
Processing
S3 Raw Lake
Data Lake
Profile
Processing
Spark
Client
Server
Data Models
High Frequency
Consolidated
Cluster
Spectrum
Transform / Load
Spectrum
Game Cluster
r
Server
Events
EC2 ASG
WB Analytics
Our Amazon Redshift Fleet
● ~30 Clusters
● Dedicated ingest pipeline and Redshift cluster per game
● Storage:
○ Amazon Redshift - 150 TB
○ Data Lake - 1 PB
Environment
● Peak sustained: 100k events / sec both event streams
● 40 - 300 tables / game
● 3-10 minute micro-batches
● Spectrum (scanned/mo) - ↑ 1PB
Targets
WB Analytics
Customer ExperienceOperational Flexibility
Amazon Redshift Wins
- Budget - Manage OpEx based on lifecycle
- Recovery - Faster resolution to data delays
- Scaling - Hours instead of weeks
- Managed - Not in the hardware business
- Modeling - More modeling done in warehouse
(enabling tools like Looker)
- Tools - Same data assets for multiple tools (Spectrum +
Amazon S3 + Parquet)
- Portability - Rapid sharing of common data assets
across Amazon Redshift clusters
WB Analytics
TipsObservations
More Amazon Redshift Wins!
❏ Schema Merge / Evolution
❏ Data Retention Strategy
❏ Load at different frequencies
❏ Spectrum as warm storage tier
❏ Everything big in columnar format
❏ Learn Spectrum pushdown
❏ Use Glue Data Catalog
❏ Other query engines fit some use cases
❏ Compact many small files
❏ Communicate to service teams
1. Compute vs Storage
(++ with Spectrum)
2. Instance Types (++ with DC2)
3. Resize Speed (++ with Elastic Resize)
4. Storage Tiers (coming...)
5. Faster (coming...)
WB Analytics
Challenges Revisited
Challenges:
• Delayed Data
• Resource Constraints /
Scaling
• Multi-year CapEx
• SQL limited to RDBMS
WB Analytics
… and now the Chalk Talk
● Cap Amazon Redshift cost by limiting cluster growth
● Size clusters for compute not storage
● Hot/warm storage tiers
● Maintain query SLAs.
Goals
● Unload to Parquet
● Spectrum Accelerator
● Elastic Resize
Features
WB Analytics
Elastic Resize Performance
❏ Spectrum 6 node dc2.xlarge cluster @ 2 TB per node => 12 TB cluster %50 full
❏ Scale up 2x with “Classic resize” 18-24 hrs before read/write available
❏ Scale up 2x with “Elastic Resize” 7 min!
❏ 4 min prep phase
❏ 3 min resize phase - cluster is read/write available now
❏ Post resize data copy phase ~ 30 min
❏ Scale down 2x from 12 node with “Elastic Resize” 8 min!
❏ 4 min prep phase
❏ 4 min resize phase - cluster is read/write available now.
❏ Post resize data copy phase ~90min
WB Analytics
UNLOAD to Parquet Performance
❏ Unload 215 daily partitions to Paquet in S3
❏ 10 node dc2.8xlarge cluster => 160 slices
❏ UNLOAD … TO PARQUET … PARALLEL
❏ 99.8 percentile slice unload time = 1.3 sec
❏ Remaining 0.2 slice unload time = ~ 40 sec.
❏ 215 daily partitions UNLOAD … TO PARQUET ~ 44 min
❏ Same UNLOAD to delimited text ~40 min
❏ Good enough already!
WB Analytics
Spectrum Accelerator Performance
WB Analytics
Elastic Resize Recap
~135x speedup!
WB Analytics
UNLOAD to Parquet Recap
Observations
1. Queryable from other query engines -
including TIMESTAMP!
2. Small/modest parallelism unload
performance is fast - many times faster
than text
3. Highly parallel or many unloads slower
within 20% of delimited text.
4. Small fraction of slower Parquet writes
are long poles
Tips
❏ Use Hive sub-directory name format.
❏ Discover (or map) partitions onto S3
data.
❏ Reassemble with UNION view.
❏ UNLOAD/COPY is faster than
INSERT… SELECT for remainder in
Amazon Redshift and some use cases
too
WB Analytics
Spectrum Accelerator Recap
Observations
1. Fast when data reduction happens
2. Varied speedup based on pushdown
predicates
3. Only happens when it’s worth it.
4. No performance regression
5. System view svl_s3requests is the key
to understanding caching
6. Speedup not yet predictable
Tips
❏ Know you query workload
❏ Ask for more predicate pushdown
❏ Track S3_scanned/returned ratio in
svl_s3requests.
❏ Look at first query execution .vs. later
executions
❏ Engage support when speedup is less
than expected
WB Analytics
Questions?
Kurt Larson
Technical Director
klarson@wbgames.com
We’re Hiring: https://careers.wbgames.com/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Nested Data Support
SELECT c.id, o.date
FROM spectrum.customers c, c.orders o;
id| date
--|----------------------
1 |2018-03-01 11:59:59
1 |2018-03-01 09:10:00
3 |2018-03-02 08:02:15
(3 rows)
Unnest array by
joining each array
element with its
parent row
Customer 2 is missing
Customer 1 has two rows
Support Nested Parquet, ORC, JSON, Amazon Ion
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Redshift Elastic Resize (GA)
Adds
additional
nodes
to Redshift
cluster
Distributes
data
across new
configuration
Near-zero
downtime
for reads/writes
Quickly scale
for varying
workload
demands
Scale up and down in minutes
New!
Redshift
Cluster
Redshift Managed S3
JDBC/ODBC
1
2
3
Leader Node
Backup
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Redshift Spectrum: Extends the Redshift data
warehouse to exabytes of data in S3 Data Lake
Redshift
query engine
Query across
Redshift and S3
Redshift
data
S3
data lake
Coming Soon!
 Unload to Parquet
 Spectrum Request Accelerator
No loading required
Scale compute and storage separately
Parquet, ORC, JSON, Avro, Grok, CSV data formats
Nested data support
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Security is built-in
Compliance certifications
10 GigE (HPC)
Customer
VPC
Internal
VPC
JDBC/ODBC
Compute
Nodes
Leader
Node
End-to-end encryption
Integration with AWS Key
Management Service
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Speed Scale Security
Amazon Redshift
The 4 things that matter most
Simplicity
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Caching Layer
Auto-scaling resources for bursts
of user activity (Preview)
Creates
more
clusters
automatically
on-demand
Consistently
fast
performance
even with
thousands of
concurrent queries
No
advance
hydration
required
Handles
unpredictable
volumes of
concurrent users
New!
Backup
Redshift Managed S3
1
2 3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Results with Auto-Scaling Concurrency
Higher is better
99% of users will
never see a charge
for auto-scale
resources
Auto-scaling resources for bursts of user activity
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fleet telemetry on query wait times
1 2 3 4
87% of Redshift customers
don’t have significant wait
times
Remaining 13% have
bursts of activity
averaging 10 minutes at a
time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
OrderID CustomerID OrderTime ShipMode
5 23 10.00 12.50
8 32 1.00 5.60
OrdersWithItems
ItemID Quantity Price
23 10.00 12.50
16 1.00 1.99
32 1.00 5.60
24 5.00 26.50
OrderItems
OrderID ItemID Quantity Price
5 23 10.00 12.50
8 32 1.00 5.60
5 16 1.00 1.99
8 24 5.00 26.50
OrderID CustomerID OrderTime ShipMode
5 23 10.00 12.50
8 32 1.00 5.60
Orders
OrderItems
Orders table includes
the OrdersWithItems
as a nested column,
avoid the expensive
join
Avoid expensive joins with nested data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Redshift Query Editor
Query data
directly from
the AWS console
Results are instantly
visible within the console
No need to install
and setup an external
JDBC/ODBC client
Launched in October!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
57
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
58
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Query is optimized and compiled at the
leader node. Determine what gets run
locally and what goes to Amazon
Redshift Spectrum
2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
59
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Query plan is sent to all
compute nodes3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
60
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions4
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
61
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
5
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
62
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Amazon Redshift Spectrum nodes
scan your S3 data6
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
63
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
64
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Final aggregations and joins with
local Amazon Redshift tables
done in-cluster
8
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
65
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Result is sent back to client 9
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Unnesting using LEFT Joins
SELECT c.id, c.name.given, c.name.family, o.shipdate, o.price
FROM spectrum.customers c LEFT JOIN c.orders o ON true;
id | given | family | shipdate | price
----|---------|---------|----------------------|--------
1 | John | Smith | 2018-03-01 11:59:59 | 100.5
2 | John | Smith | 2018-03-01 09:10:00 | 99.12
2 | Jenny | Doe | |
3 | Andy | Jones | 2018-03-02 08:02:15 | 13.5
(4 rows)
Customer without orders
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Aggregating nested data with subqueries
SELECT c.name.given, c.name.family,
(SELECT COUNT(*) FROM c.orders o) AS ordercount
FROM spectrum.customers c;
given | family | ordercount
--------|----------|--------------
Jenny | Doe | 0
John | Smith | 2
Andy | Jones | 1
(3 rows)
Numbers of orders for
each customer
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Comparable performance with smaller footprint
68
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7
Executiontime(s)
Series1 Series2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Small data” & low-latency queries
69
1TB TPC-H dataset
TPC-H Q1-like
0
0.5
1
1.5
2
1 2 3
ExecutionTime(s)
Predicate on Partitioning Column
Series1 Series2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Small data” & low-latency queries
70
1 2 3
Predicate on Non-partitioning Column
For low-latency & frequent queries Redshift is a great
option
1TB TPC-H dataset
TPC-H Q1-like
0
2
4
6
8
10
12
1 2 3
ExecutionTime(s)
Predicate on Partitioning Column
Series1 Series2 Series3

Contenu connexe

Tendances

Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Migrating On-Premises Databases to Cloud
Migrating On-Premises Databases to CloudMigrating On-Premises Databases to Cloud
Migrating On-Premises Databases to CloudAmazon Web Services
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxSwathiPonugumati
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftAmazon Web Services
 
Operationalizing Machine Learning at Scale at Starbucks
Operationalizing Machine Learning at Scale at StarbucksOperationalizing Machine Learning at Scale at Starbucks
Operationalizing Machine Learning at Scale at StarbucksDatabricks
 
SRV308 Deep Dive on Amazon Aurora
SRV308 Deep Dive on Amazon AuroraSRV308 Deep Dive on Amazon Aurora
SRV308 Deep Dive on Amazon AuroraAmazon Web Services
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architectureanicewick
 
Data Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data IntelligenceData Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data IntelligenceAlation
 
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon Web Services Korea
 

Tendances (20)

Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Migrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQLMigrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQL
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Migrating On-Premises Databases to Cloud
Migrating On-Premises Databases to CloudMigrating On-Premises Databases to Cloud
Migrating On-Premises Databases to Cloud
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28
 
Operationalizing Machine Learning at Scale at Starbucks
Operationalizing Machine Learning at Scale at StarbucksOperationalizing Machine Learning at Scale at Starbucks
Operationalizing Machine Learning at Scale at Starbucks
 
SRV308 Deep Dive on Amazon Aurora
SRV308 Deep Dive on Amazon AuroraSRV308 Deep Dive on Amazon Aurora
SRV308 Deep Dive on Amazon Aurora
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 
Data Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data IntelligenceData Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data Intelligence
 
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
 

Similaire à Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (ANT301) - AWS re:Invent 2018

Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Amazon Web Services
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeAmazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Amazon Web Services
 
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018Amazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Web Services
 
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...Amazon Web Services
 
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Amazon Web Services
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Amazon Web Services
 
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Amazon Web Services
 
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
What’s new with Amazon Redshift, featuring ZS Associates - ADB205 - Chicago A...
What’s new with Amazon Redshift, featuring ZS Associates - ADB205 - Chicago A...What’s new with Amazon Redshift, featuring ZS Associates - ADB205 - Chicago A...
What’s new with Amazon Redshift, featuring ZS Associates - ADB205 - Chicago A...Amazon Web Services
 
What's New with Amazon Redshift - ADB202 - Anaheim AWS Summit
What's New with Amazon Redshift - ADB202 - Anaheim AWS SummitWhat's New with Amazon Redshift - ADB202 - Anaheim AWS Summit
What's New with Amazon Redshift - ADB202 - Anaheim AWS SummitAmazon Web Services
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksAmazon Web Services
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumAmazon Web Services
 

Similaire à Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (ANT301) - AWS re:Invent 2018 (20)

Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_Singapore
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
 
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
 
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
 
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
 
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
What’s new with Amazon Redshift, featuring ZS Associates - ADB205 - Chicago A...
What’s new with Amazon Redshift, featuring ZS Associates - ADB205 - Chicago A...What’s new with Amazon Redshift, featuring ZS Associates - ADB205 - Chicago A...
What’s new with Amazon Redshift, featuring ZS Associates - ADB205 - Chicago A...
 
What's New with Amazon Redshift - ADB202 - Anaheim AWS Summit
What's New with Amazon Redshift - ADB202 - Anaheim AWS SummitWhat's New with Amazon Redshift - ADB202 - Anaheim AWS Summit
What's New with Amazon Redshift - ADB202 - Anaheim AWS Summit
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech Talks
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (ANT301) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics A N T 3 0 1 Ippokratis Pandis Principal Engineer Amazon Redshift Kurt Larson Tech Director Warner Bros. Analytics
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data every 5 years There is more data than people think 15 years live for Data platforms need to 1,000x scale >10x grows
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Evolving around Amazon S3 Amazon Kinesis Social Web Sensors Devices LOBCRM ERPOLTP AWS IAM AWS KMS Data Catalog Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service AI Services Amazon QuickSight
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift integrates seamlessly with the S3 Data Lake Spectrum Node 1 Spectrum Node 2 Spectrum Node N S3 Leader Node Compute Node 1 Compute Node 2 Spectrum Node 3 Leader Node Compute Node 1 Compute Node 2 Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Compute Node 4 Redshift Cluster 1 Redshift Cluster 2 Redshift Cluster 3 Glue Catalog or Hive Metastore Amazon S3 SELECT
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Answering a business query Find the 10 best selling products of German merchants in the past 1 week SELECT o.pid, count(o.items) FROM Orders o, Merchants m, Products p WHERE p.pid = m.pid AND o.pid = p.pid AND m.country = ’DE’ AND o.date BETWEEN (getdate() + INTERVAL ‘-7 day’) AND getdate() GROUP BY 1 ORDER BY 2 LIMIT 10; 10s Billions Millions100s Thousands
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Executing this query in Amazon Redshift HASH JOIN AGG HASH JOIN SCAN products SCAN + FILTER merchants SCAN + FILTER orders 10s Billions SORT + LIMIT Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Redshift Cluster Compute Node N Millions 100s Thousands
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3 Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Redshift Cluster Compute Node N Moving the Big Data to Amazon S3
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3 Leader Node Compute Node 2 Compute Node 3 Redshift Cluster Moving the Big Data to Amazon S3 & Sizing on Compute Needs ?
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fast queries with data in Amazon S3 1. High Bandwidth Parallelism (many small straws) 2. Reduce the amount of data to send back Computation push-down 3. Minimize the amount of data to read Avoid doing unnecessary work Columnar formats & compression 4. Avoid expensive joins with Nested Data
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spectrum Node 1 Spectrum Node 2 Spectrum Node N S3 Leader Node Compute Node 1 Compute Node 2 Spectrum Node 3 Redshift Spectrum Execution Layer 10s of Redshift nodes 1000s of Spectrum nodes
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. HASH JOIN AGG HASH JOIN SCAN products SCAN + FILTER merchants SCAN + FILTER orders SORT + LIMIT Executing this query in Redshift Spectrum 100s of Spectrum nodes 10s of Redshift nodes 10s Billions Leader Node Compute Node 1 Compute Node 2 Millions 100s Thousands
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Executing this query in Redshift Spectrum HASH JOIN AGG HASH JOIN SCAN products SCAN + FILTER merchants SCAN + FILTER orders SORT + LIMIT AGG 100s of Spectrum nodes 10s of Redshift nodes 10s Billions Leader Node Compute Node 1 Compute Node 2 Millions 100s Thousands
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Partitioning Pruning – Static and Dynamic Partitioning Scheme: s3://Orders/ [YYYY] / [MM] / [DD] / [Country] / SELECT country, count(*) FROM s3.Orders o, local.Countries c WHERE o.CountryID = c.ID AND c.Continent = ‘South America’ AND o.YYYY = 2017 AND o.MM = 12 AND o.DD = 24 GROUP BY 1; Hash Join Agg S3 Orders StaticDynamic 20 365 195 1,423,500 Partitions to process 1 1 195 195 1 1 12 12 Agg Local Countries 118625x
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Executing this query in Redshift Spectrum HASH JOIN AGG HASH JOIN SCAN products SCAN + FILTER merchants SCAN orders AGG Partition Loop SCAN + FILTER partitions of orders SORT + LIMIT 10s Billions 10s Thousands 100s of Spectrum nodes 10s of Redshift nodes Leader Node Compute Node 1 Compute Node 2 Millions 100s Thousands EXPLAIN <query>;
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. * Since re:Invent 2017 Improvements in scale* Integrate seamlessly with your data lake Support for DATE data type Support for Enhanced VPC Routing Improved performance of IN-list predicate processing in Spectrum scans Improved performance for queries with expressions on the partition column of external tables Ability to query external tables during a resize operation Specify the root of an S3 bucket as the source for an existing table Performance improvements for Spectrum queries with aggregations on partition columns Support for renaming external table columns Added a table property to specify the file compression type for external tables Additional functionality pushdown to Spectrum, enhancing performance Support for map datatypes in Spectrum to contain arrays Query support for nested data has been extended to support arrays of arrays and arrays of maps Tail-latency reductions 4x Improvement in selective scans 2x Improvement in scans of small files Multibyte character support
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Integrating Amazon Redshift seamlessly with your data lake © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Unload to Parquet Redshift Spectrum Request Accelerator
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Spectrum Request Accelerator HASH JOIN AGG HASH JOIN SCAN products SCAN + FILTER merchants SCAN orders AGG Partition Loop SCAN + FILTER partitions of orders SORT + LIMIT 10s Billions 10s Thousands 100s of Spectrum nodes 10s of Redshift nodes Leader Node Compute Node 1 Compute Node 2 Millions 100s Thousands Incremental Result Caching Coming Soon!
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. UNLOAD to Parquet The most popular open columnar format Coming Soon! Unload TPCH-1TB Lineitem 4-node dc2.8xlarge 34.52 11.46 13.82 9.07 0 10 20 30 40 Time(sec) Time to UNLOAD Series1 Series2 Series3 Series4 152.4 278.7 787.7 224.9 0 200 400 600 800 Size(GB) Size Series1 Series2 Series3 Series4
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Run Time 135.5 2.8 0 50 100 150 1 2 DataScanned(GBs) Data Scanned Simple Query Complex Query NUVIAD Querying Parquet vs Text 44.8 12.6 0 10 20 30 40 50 1 2 Runtime(Secs) 71.1 43.1 0 20 40 60 80 1 2
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift is now >3x faster than 6 months ago Normalized Queries Per Hour (QPH) Assuming Amazon Redshift’s QPH 6 months ago=100% Queriesperhour Asa%ofredshift6monthsago JUL 2018 AUG 2018 SEP 2018 OCT 2018MAY 2018 100% 181% 237% 284% 350% Higher is better 115% JUN 2018
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. *Since re:Invent 2017 Support for lateral column alias reference Improved resource management for memory-intensive queries Performance improvements for joins involving large numbers of NULL values in a join key column Performance improvements for queries with intermediate subquery results that can be distributed Improved cluster resize operations Performance improvements for queries that refer to stable functions with constant expressions Performance improvements for queries operating over CHAR and VARCHAR columns Performance improvements for single-row inserts Improvements in speed Performance improvements for queries with expressions on the partition columns of external tables Performance improvements for complex EXCEPT subqueries Doubled the number of tables you can create in a cluster Improvements to hash join performance Improvements for the COPY operation when ingesting data from Parquet and ORC formats Performance improvement for queries that refer to stable functions over constant expressions © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance improvements via query rewrites that pushdown selective joins into a subquery Performance improvements by optimizing the data redistribution strategy during query planning
  • 23. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ippokratis Pandis ippo@amazon.com
  • 25. WB Analytics Examples of our work... Mobile Console / PC
  • 26. WB Analytics Many teams work to publish a game... Each team brings specialized tools… We combine these tools with client data to create a consistent, actionable view of each game. What we do...
  • 27. WB Analytics Where we started... Server Telemetry SQL Challenges: • Delayed Data • Resource Constraints / Scaling • Multi-year CapEx • SQL limited to RDBMS 24 hr 24 hr 24 hr Popular Column-based RDBMS Client Telemetry Demographics
  • 28. WB Analytics Picking the right tools... Integration Tech InsightsModeling Client Server Integration s ● Enforce Schemas ● Schema Lineage ● Auto Schema Merge Ingestion ● Maintain consistent API(s) ● Spark: Micro-batch ● Amazon EC2 autoscaling group ● Airflow: Batch Storage Data Lake Query Engine(s) ● Amazon S3: Raw data ● Amazon Redshift: fast / modest ● Spectrum/Amazon S3: Large-sized & multi cluster Reasoning
  • 29. Putting it all together Ingestion Modeling Visuals & Automation Schema Management Amazon Redshift Loader Client Events Server Events API Kafka Schema Storage Batch Daily Loads Sales Social Market … S3 Analysis Lake Extracts Parquet Analyst Services Processing S3 Raw Lake Data Lake Profile Processing Spark Client Server Data Models High Frequency Consolidated Cluster Spectrum Transform / Load Spectrum Game Cluster r Server Events EC2 ASG
  • 30. WB Analytics Our Amazon Redshift Fleet ● ~30 Clusters ● Dedicated ingest pipeline and Redshift cluster per game ● Storage: ○ Amazon Redshift - 150 TB ○ Data Lake - 1 PB Environment ● Peak sustained: 100k events / sec both event streams ● 40 - 300 tables / game ● 3-10 minute micro-batches ● Spectrum (scanned/mo) - ↑ 1PB Targets
  • 31. WB Analytics Customer ExperienceOperational Flexibility Amazon Redshift Wins - Budget - Manage OpEx based on lifecycle - Recovery - Faster resolution to data delays - Scaling - Hours instead of weeks - Managed - Not in the hardware business - Modeling - More modeling done in warehouse (enabling tools like Looker) - Tools - Same data assets for multiple tools (Spectrum + Amazon S3 + Parquet) - Portability - Rapid sharing of common data assets across Amazon Redshift clusters
  • 32. WB Analytics TipsObservations More Amazon Redshift Wins! ❏ Schema Merge / Evolution ❏ Data Retention Strategy ❏ Load at different frequencies ❏ Spectrum as warm storage tier ❏ Everything big in columnar format ❏ Learn Spectrum pushdown ❏ Use Glue Data Catalog ❏ Other query engines fit some use cases ❏ Compact many small files ❏ Communicate to service teams 1. Compute vs Storage (++ with Spectrum) 2. Instance Types (++ with DC2) 3. Resize Speed (++ with Elastic Resize) 4. Storage Tiers (coming...) 5. Faster (coming...)
  • 33. WB Analytics Challenges Revisited Challenges: • Delayed Data • Resource Constraints / Scaling • Multi-year CapEx • SQL limited to RDBMS
  • 34. WB Analytics … and now the Chalk Talk ● Cap Amazon Redshift cost by limiting cluster growth ● Size clusters for compute not storage ● Hot/warm storage tiers ● Maintain query SLAs. Goals ● Unload to Parquet ● Spectrum Accelerator ● Elastic Resize Features
  • 35. WB Analytics Elastic Resize Performance ❏ Spectrum 6 node dc2.xlarge cluster @ 2 TB per node => 12 TB cluster %50 full ❏ Scale up 2x with “Classic resize” 18-24 hrs before read/write available ❏ Scale up 2x with “Elastic Resize” 7 min! ❏ 4 min prep phase ❏ 3 min resize phase - cluster is read/write available now ❏ Post resize data copy phase ~ 30 min ❏ Scale down 2x from 12 node with “Elastic Resize” 8 min! ❏ 4 min prep phase ❏ 4 min resize phase - cluster is read/write available now. ❏ Post resize data copy phase ~90min
  • 36. WB Analytics UNLOAD to Parquet Performance ❏ Unload 215 daily partitions to Paquet in S3 ❏ 10 node dc2.8xlarge cluster => 160 slices ❏ UNLOAD … TO PARQUET … PARALLEL ❏ 99.8 percentile slice unload time = 1.3 sec ❏ Remaining 0.2 slice unload time = ~ 40 sec. ❏ 215 daily partitions UNLOAD … TO PARQUET ~ 44 min ❏ Same UNLOAD to delimited text ~40 min ❏ Good enough already!
  • 38. WB Analytics Elastic Resize Recap ~135x speedup!
  • 39. WB Analytics UNLOAD to Parquet Recap Observations 1. Queryable from other query engines - including TIMESTAMP! 2. Small/modest parallelism unload performance is fast - many times faster than text 3. Highly parallel or many unloads slower within 20% of delimited text. 4. Small fraction of slower Parquet writes are long poles Tips ❏ Use Hive sub-directory name format. ❏ Discover (or map) partitions onto S3 data. ❏ Reassemble with UNION view. ❏ UNLOAD/COPY is faster than INSERT… SELECT for remainder in Amazon Redshift and some use cases too
  • 40. WB Analytics Spectrum Accelerator Recap Observations 1. Fast when data reduction happens 2. Varied speedup based on pushdown predicates 3. Only happens when it’s worth it. 4. No performance regression 5. System view svl_s3requests is the key to understanding caching 6. Speedup not yet predictable Tips ❏ Know you query workload ❏ Ask for more predicate pushdown ❏ Track S3_scanned/returned ratio in svl_s3requests. ❏ Look at first query execution .vs. later executions ❏ Engage support when speedup is less than expected
  • 41. WB Analytics Questions? Kurt Larson Technical Director klarson@wbgames.com We’re Hiring: https://careers.wbgames.com/
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Nested Data Support SELECT c.id, o.date FROM spectrum.customers c, c.orders o; id| date --|---------------------- 1 |2018-03-01 11:59:59 1 |2018-03-01 09:10:00 3 |2018-03-02 08:02:15 (3 rows) Unnest array by joining each array element with its parent row Customer 2 is missing Customer 1 has two rows Support Nested Parquet, ORC, JSON, Amazon Ion
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Elastic Resize (GA) Adds additional nodes to Redshift cluster Distributes data across new configuration Near-zero downtime for reads/writes Quickly scale for varying workload demands Scale up and down in minutes New! Redshift Cluster Redshift Managed S3 JDBC/ODBC 1 2 3 Leader Node Backup
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Spectrum: Extends the Redshift data warehouse to exabytes of data in S3 Data Lake Redshift query engine Query across Redshift and S3 Redshift data S3 data lake Coming Soon!  Unload to Parquet  Spectrum Request Accelerator No loading required Scale compute and storage separately Parquet, ORC, JSON, Avro, Grok, CSV data formats Nested data support
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Security is built-in Compliance certifications 10 GigE (HPC) Customer VPC Internal VPC JDBC/ODBC Compute Nodes Leader Node End-to-end encryption Integration with AWS Key Management Service
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Speed Scale Security Amazon Redshift The 4 things that matter most Simplicity
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Caching Layer Auto-scaling resources for bursts of user activity (Preview) Creates more clusters automatically on-demand Consistently fast performance even with thousands of concurrent queries No advance hydration required Handles unpredictable volumes of concurrent users New! Backup Redshift Managed S3 1 2 3
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Results with Auto-Scaling Concurrency Higher is better 99% of users will never see a charge for auto-scale resources Auto-scaling resources for bursts of user activity
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fleet telemetry on query wait times 1 2 3 4 87% of Redshift customers don’t have significant wait times Remaining 13% have bursts of activity averaging 10 minutes at a time
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. OrderID CustomerID OrderTime ShipMode 5 23 10.00 12.50 8 32 1.00 5.60 OrdersWithItems ItemID Quantity Price 23 10.00 12.50 16 1.00 1.99 32 1.00 5.60 24 5.00 26.50 OrderItems OrderID ItemID Quantity Price 5 23 10.00 12.50 8 32 1.00 5.60 5 16 1.00 1.99 8 24 5.00 26.50 OrderID CustomerID OrderTime ShipMode 5 23 10.00 12.50 8 32 1.00 5.60 Orders OrderItems Orders table includes the OrdersWithItems as a nested column, avoid the expensive join Avoid expensive joins with nested data
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Query Editor Query data directly from the AWS console Results are instantly visible within the console No need to install and setup an external JDBC/ODBC client Launched in October!
  • 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved. 57 Spectrum Node 1 Spectrum Node 2 Spectrum Node 3 Spectrum Node … Spectrum Node N Amazon S3 Exabyte-scale object storage Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Amazon Redshift Cluster JDBC / ODBC Glue Catalog Apache Hive Metastore Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Life of a query1
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved. 58 Spectrum Node 1 Spectrum Node 2 Spectrum Node 3 Spectrum Node … Spectrum Node N Amazon S3 Exabyte-scale object storage Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Amazon Redshift Cluster JDBC / ODBC Glue Catalog Apache Hive Metastore Life of a query Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum 2
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved. 59 Spectrum Node 1 Spectrum Node 2 Spectrum Node 3 Spectrum Node … Spectrum Node N Amazon S3 Exabyte-scale object storage Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Amazon Redshift Cluster JDBC / ODBC Glue Catalog Apache Hive Metastore Life of a query Query plan is sent to all compute nodes3
  • 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved. 60 Spectrum Node 1 Spectrum Node 2 Spectrum Node 3 Spectrum Node … Spectrum Node N Amazon S3 Exabyte-scale object storage Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Amazon Redshift Cluster JDBC / ODBC Glue Catalog Apache Hive Metastore Life of a query Compute nodes obtain partition info from Data Catalog; dynamically prune partitions4
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved. 61 Spectrum Node 1 Spectrum Node 2 Spectrum Node 3 Spectrum Node … Spectrum Node N Amazon S3 Exabyte-scale object storage Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Amazon Redshift Cluster JDBC / ODBC Glue Catalog Apache Hive Metastore Life of a query Each compute node issues multiple requests to the Amazon Redshift Spectrum layer 5
  • 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved. 62 Spectrum Node 1 Spectrum Node 2 Spectrum Node 3 Spectrum Node … Spectrum Node N Amazon S3 Exabyte-scale object storage Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Amazon Redshift Cluster JDBC / ODBC Glue Catalog Apache Hive Metastore Life of a query Amazon Redshift Spectrum nodes scan your S3 data6
  • 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved. 63 Spectrum Node 1 Spectrum Node 2 Spectrum Node 3 Spectrum Node … Spectrum Node N Amazon S3 Exabyte-scale object storage Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Amazon Redshift Cluster JDBC / ODBC Glue Catalog Apache Hive Metastore Life of a query 7 Amazon Redshift Spectrum projects, filters, joins and aggregates
  • 64. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved. 64 Spectrum Node 1 Spectrum Node 2 Spectrum Node 3 Spectrum Node … Spectrum Node N Amazon S3 Exabyte-scale object storage Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Amazon Redshift Cluster JDBC / ODBC Glue Catalog Apache Hive Metastore Life of a query Final aggregations and joins with local Amazon Redshift tables done in-cluster 8
  • 65. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved. 65 Spectrum Node 1 Spectrum Node 2 Spectrum Node 3 Spectrum Node … Spectrum Node N Amazon S3 Exabyte-scale object storage Leader Node Compute Node 1 Compute Node 2 Compute Node 3 Amazon Redshift Cluster JDBC / ODBC Glue Catalog Apache Hive Metastore Life of a query Result is sent back to client 9
  • 66. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Unnesting using LEFT Joins SELECT c.id, c.name.given, c.name.family, o.shipdate, o.price FROM spectrum.customers c LEFT JOIN c.orders o ON true; id | given | family | shipdate | price ----|---------|---------|----------------------|-------- 1 | John | Smith | 2018-03-01 11:59:59 | 100.5 2 | John | Smith | 2018-03-01 09:10:00 | 99.12 2 | Jenny | Doe | | 3 | Andy | Jones | 2018-03-02 08:02:15 | 13.5 (4 rows) Customer without orders
  • 67. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Aggregating nested data with subqueries SELECT c.name.given, c.name.family, (SELECT COUNT(*) FROM c.orders o) AS ordercount FROM spectrum.customers c; given | family | ordercount --------|----------|-------------- Jenny | Doe | 0 John | Smith | 2 Andy | Jones | 1 (3 rows) Numbers of orders for each customer
  • 68. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Comparable performance with smaller footprint 68 0 100 200 300 400 500 600 700 1 2 3 4 5 6 7 Executiontime(s) Series1 Series2
  • 69. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Small data” & low-latency queries 69 1TB TPC-H dataset TPC-H Q1-like 0 0.5 1 1.5 2 1 2 3 ExecutionTime(s) Predicate on Partitioning Column Series1 Series2
  • 70. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Small data” & low-latency queries 70 1 2 3 Predicate on Non-partitioning Column For low-latency & frequent queries Redshift is a great option 1TB TPC-H dataset TPC-H Q1-like 0 2 4 6 8 10 12 1 2 3 ExecutionTime(s) Predicate on Partitioning Column Series1 Series2 Series3