Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (ANT301) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Extending Analytics Beyond the Data Warehouse,
ft. Warner Bros. Analytics
A N T 3 0 1
Ippokratis Pandis
Principal Engineer
Amazon Redshift
Kurt Larson
Tech Director
Warner Bros. Analytics

Data
every 5 years
There is more data
than people think
15
years
live for
Data platforms need to
1,000x
scale
>10x
grows

Evolving around Amazon S3
Amazon
Kinesis
Social Web
Sensors Devices
LOBCRM
ERPOLTP
AWS
IAM
AWS
KMS
Data
Catalog
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch Service
AI Services
Amazon
QuickSight

Amazon Redshift integrates seamlessly with the S3 Data Lake
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node N
S3
Leader
Node
Compute
Node 1
Compute
Node 2
Spectrum
Node 3
Leader
Node
Compute
Node 1
Compute
Node 2
Leader
Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Compute
Node 4
Redshift Cluster 1 Redshift Cluster 2 Redshift Cluster 3
Glue Catalog
or Hive
Metastore
Amazon S3
SELECT

Answering a business query
Find the 10 best selling products of German merchants in the past 1 week
SELECT o.pid, count(o.items)
FROM Orders o, Merchants m, Products p
WHERE p.pid = m.pid
AND o.pid = p.pid
AND m.country = ’DE’
AND o.date BETWEEN (getdate() + INTERVAL ‘-7 day’)
AND getdate()
GROUP BY 1
ORDER BY 2 LIMIT 10;
10s Billions Millions100s Thousands

Executing this query in Amazon Redshift
HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN + FILTER
orders
10s Billions
SORT + LIMIT
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Redshift Cluster
Compute
Node N
Millions 100s Thousands

S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Redshift Cluster
Compute
Node N
Moving the Big Data to Amazon S3

S3
Leader Node
Compute
Node 2
Compute
Node 3
Redshift Cluster
Moving the Big Data to Amazon S3 & Sizing on Compute Needs
?

Fast queries with data in Amazon S3
1. High Bandwidth
Parallelism (many small straws)
2. Reduce the amount of data to send back
Computation push-down
3. Minimize the amount of data to read
Avoid doing unnecessary work
Columnar formats & compression
4. Avoid expensive joins with Nested Data

Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node N
S3
Leader Node
Compute
Node 1
Compute
Node 2
Spectrum
Node 3
Redshift Spectrum Execution Layer
10s of Redshift nodes
1000s of
Spectrum
nodes

HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN + FILTER
orders
SORT + LIMIT
Executing this query in Redshift Spectrum
100s of Spectrum
nodes
10s Billions
Leader Node
Compute
Node 1
Compute
Node 2

HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN + FILTER
orders
SORT + LIMIT
AGG
100s of Spectrum
nodes
10s Billions
Leader Node
Compute
Node 1
Compute
Node 2

Partitioning Pruning – Static and Dynamic
Partitioning Scheme:
s3://Orders/ [YYYY] / [MM] / [DD] / [Country] /
SELECT country, count(*)
FROM s3.Orders o, local.Countries c
WHERE o.CountryID = c.ID
AND c.Continent = ‘South America’
AND o.YYYY = 2017 AND o.MM = 12
AND o.DD = 24
GROUP BY 1;
Hash
Join
Agg
S3
Orders
StaticDynamic
20 365 195 1,423,500
Partitions to process
1 1 195 195
1 1 12 12
Agg
Local
Countries
118625x

HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN
orders
AGG
Partition Loop
SCAN + FILTER
partitions of
orders
SORT + LIMIT
10s Billions 10s Thousands
100s of Spectrum
nodes
Leader Node
Compute
Node 1
Compute
Node 2
EXPLAIN <query>;

* Since re:Invent 2017
Improvements in scale*
Integrate seamlessly with your data lake
Support for
DATE data type
Support for
Enhanced VPC
Routing
Improved performance of
IN-list predicate processing
in Spectrum scans
Improved performance for queries
with expressions on the partition
column of external tables
Ability to query external tables
during a resize operation
Specify the root of an S3 bucket as
the source for an existing table
Performance improvements
for Spectrum queries with
aggregations on partition columns
Support for
renaming external
table columns
Added a table property to
specify the file compression
type for external tables
Additional functionality
pushdown to Spectrum,
enhancing performance
Support for map
datatypes in Spectrum
to contain arrays
Query support for nested
data has been extended to
support arrays of arrays and
arrays of maps
Tail-latency reductions
4x Improvement in selective scans
2x Improvement in scans of small files
Multibyte character support

Integrating Amazon Redshift
seamlessly with your data lake
Unload to
Parquet
Redshift
Spectrum
Request
Accelerator

Redshift Spectrum Request Accelerator
HASH JOIN
AGG
HASH JOIN
SCAN
products
SCAN + FILTER
merchants
SCAN
orders
AGG
Partition Loop
SCAN + FILTER
partitions of
orders
SORT + LIMIT
10s Billions 10s Thousands
100s of Spectrum
nodes
Leader Node
Compute
Node 1
Compute
Node 2
Incremental Result Caching
Coming Soon!

UNLOAD to Parquet
The most popular open columnar format
Coming Soon!
Unload TPCH-1TB Lineitem
4-node dc2.8xlarge
34.52
11.46
13.82
9.07
0
10
20
30
40
Time(sec)
Time to UNLOAD
Series1 Series2 Series3 Series4
152.4
278.7
787.7
224.9
0
200
400
600
800
Size(GB)
Size
Series1 Series2 Series3 Series4

Run Time
135.5
2.8
0
50
100
150
1 2
DataScanned(GBs)
Data Scanned
Simple Query Complex Query
NUVIAD
Querying Parquet vs Text
44.8
12.6
0
10
20
30
40
50
1 2
Runtime(Secs)
71.1
43.1
0
20
40
60
80
1 2

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift
is now
>3x faster
than 6 months ago
Normalized Queries Per Hour (QPH)
Assuming Amazon Redshift’s QPH 6 months
ago=100%
Queriesperhour
Asa%ofredshift6monthsago
JUL 2018 AUG 2018 SEP 2018 OCT 2018MAY 2018
100%
181%
237%
284%
350%
Higher is better
115%
JUN 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
*Since re:Invent 2017
Support for lateral
column alias reference
Improved resource
management for
memory-intensive queries
Performance improvements for
joins involving large numbers of
NULL values in a join key column
queries with intermediate subquery
results that can be distributed
Improved cluster
resize operations
queries that refer to stable
functions with constant expressions
Performance improvements
for queries operating over CHAR
and VARCHAR columns
Performance
improvements for
single-row inserts
Improvements in speed
Performance improvements for queries
with expressions on the partition
columns of external tables
Performance
improvements for
complex EXCEPT
subqueries
Doubled the
number of tables
you can create
in a cluster
Improvements
to hash join
performance
Improvements for the
COPY operation when
ingesting data from Parquet
and ORC formats
Performance improvement for
queries that refer to stable functions
over constant expressions
Performance improvements via query rewrites
that pushdown selective joins into a subquery
Performance improvements by
optimizing the data
redistribution strategy during
query planning

Thank you!
Ippokratis Pandis
ippo@amazon.com

WB Analytics
Examples of our work...
Mobile Console / PC

WB Analytics
Many teams work to publish a game...
Each team brings specialized tools…
We combine these tools with client
data to create a consistent,
actionable view of each game.
What we do...

WB Analytics
Where we started...
Server Telemetry
SQL
Challenges:
• Delayed Data
• Resource Constraints /
Scaling
• Multi-year CapEx
• SQL limited to RDBMS
24 hr
24 hr
24 hr
Popular
Column-based
RDBMS
Client Telemetry
Demographics

WB Analytics
Picking the right tools...
Integration
Tech
InsightsModeling
Client
Server
Integration
s
● Enforce Schemas
● Schema Lineage
● Auto Schema Merge
Ingestion
● Maintain consistent
API(s)
● Spark: Micro-batch
● Amazon EC2
autoscaling group
● Airflow: Batch
Storage
Data Lake Query
Engine(s)
● Amazon S3: Raw data
● Amazon Redshift: fast /
modest
● Spectrum/Amazon S3:
Large-sized & multi
cluster
Reasoning

Putting it all together
Ingestion Modeling Visuals & Automation
Schema Management
Amazon Redshift Loader
Client
Events
Server
Events
API
Kafka
Schema
Storage
Batch Daily
Loads
Sales
Social
Market
…
S3 Analysis Lake
Extracts
Parquet
Analyst Services
Processing
S3 Raw Lake
Data Lake
Profile
Processing
Spark
Client
Server
Data Models
High Frequency
Consolidated
Cluster
Spectrum
Transform / Load
Spectrum
Game Cluster
r
Server
Events
EC2 ASG

WB Analytics
Our Amazon Redshift Fleet
● ~30 Clusters
● Dedicated ingest pipeline and Redshift cluster per game
● Storage:
○ Amazon Redshift - 150 TB
○ Data Lake - 1 PB
Environment
● Peak sustained: 100k events / sec both event streams
● 40 - 300 tables / game
● 3-10 minute micro-batches
● Spectrum (scanned/mo) - ↑ 1PB
Targets

WB Analytics
Customer ExperienceOperational Flexibility
Amazon Redshift Wins
- Budget - Manage OpEx based on lifecycle
- Recovery - Faster resolution to data delays
- Scaling - Hours instead of weeks
- Managed - Not in the hardware business
- Modeling - More modeling done in warehouse
(enabling tools like Looker)
- Tools - Same data assets for multiple tools (Spectrum +
Amazon S3 + Parquet)
- Portability - Rapid sharing of common data assets
across Amazon Redshift clusters

WB Analytics
TipsObservations
More Amazon Redshift Wins!
❏ Schema Merge / Evolution
❏ Data Retention Strategy
❏ Load at different frequencies
❏ Spectrum as warm storage tier
❏ Everything big in columnar format
❏ Learn Spectrum pushdown
❏ Use Glue Data Catalog
❏ Other query engines fit some use cases
❏ Compact many small files
❏ Communicate to service teams
1. Compute vs Storage
(++ with Spectrum)
2. Instance Types (++ with DC2)
3. Resize Speed (++ with Elastic Resize)
4. Storage Tiers (coming...)
5. Faster (coming...)

WB Analytics
Challenges Revisited
Challenges:
• Delayed Data
• Resource Constraints /
Scaling
• Multi-year CapEx
• SQL limited to RDBMS

WB Analytics
… and now the Chalk Talk
● Cap Amazon Redshift cost by limiting cluster growth
● Size clusters for compute not storage
● Hot/warm storage tiers
● Maintain query SLAs.
Goals
● Unload to Parquet
● Spectrum Accelerator
● Elastic Resize
Features

WB Analytics
Elastic Resize Performance
❏ Spectrum 6 node dc2.xlarge cluster @ 2 TB per node => 12 TB cluster %50 full
❏ Scale up 2x with “Classic resize” 18-24 hrs before read/write available
❏ Scale up 2x with “Elastic Resize” 7 min!
❏ 4 min prep phase
❏ 3 min resize phase - cluster is read/write available now
❏ Post resize data copy phase ~ 30 min
❏ Scale down 2x from 12 node with “Elastic Resize” 8 min!
❏ 4 min prep phase
❏ 4 min resize phase - cluster is read/write available now.
❏ Post resize data copy phase ~90min

WB Analytics
UNLOAD to Parquet Performance
❏ Unload 215 daily partitions to Paquet in S3
❏ 10 node dc2.8xlarge cluster => 160 slices
❏ UNLOAD … TO PARQUET … PARALLEL
❏ 99.8 percentile slice unload time = 1.3 sec
❏ Remaining 0.2 slice unload time = ~ 40 sec.
❏ 215 daily partitions UNLOAD … TO PARQUET ~ 44 min
❏ Same UNLOAD to delimited text ~40 min
❏ Good enough already!

WB Analytics
Spectrum Accelerator Performance

WB Analytics
Elastic Resize Recap
~135x speedup!

WB Analytics
UNLOAD to Parquet Recap
Observations
1. Queryable from other query engines -
including TIMESTAMP!
2. Small/modest parallelism unload
performance is fast - many times faster
than text
3. Highly parallel or many unloads slower
within 20% of delimited text.
4. Small fraction of slower Parquet writes
are long poles
Tips
❏ Use Hive sub-directory name format.
❏ Discover (or map) partitions onto S3
data.
❏ Reassemble with UNION view.
❏ UNLOAD/COPY is faster than
INSERT… SELECT for remainder in
Amazon Redshift and some use cases
too

WB Analytics
Spectrum Accelerator Recap
Observations
1. Fast when data reduction happens
2. Varied speedup based on pushdown
predicates
3. Only happens when it’s worth it.
4. No performance regression
5. System view svl_s3requests is the key
to understanding caching
6. Speedup not yet predictable
Tips
❏ Know you query workload
❏ Ask for more predicate pushdown
❏ Track S3_scanned/returned ratio in
svl_s3requests.
❏ Look at first query execution .vs. later
executions
❏ Engage support when speedup is less
than expected

WB Analytics
Questions?
Kurt Larson
Technical Director
klarson@wbgames.com
We’re Hiring: https://careers.wbgames.com/

Nested Data Support
SELECT c.id, o.date
FROM spectrum.customers c, c.orders o;
id| date
--|----------------------
1 |2018-03-01 11:59:59
1 |2018-03-01 09:10:00
3 |2018-03-02 08:02:15
(3 rows)
Unnest array by
joining each array
element with its
parent row
Customer 2 is missing
Customer 1 has two rows
Support Nested Parquet, ORC, JSON, Amazon Ion

Redshift Elastic Resize (GA)
Adds
additional
nodes
to Redshift
cluster
Distributes
data
across new
configuration
Near-zero
downtime
for reads/writes
Quickly scale
for varying
workload
demands
Scale up and down in minutes
New!
Redshift
Cluster
Redshift Managed S3
JDBC/ODBC
1
2
3
Leader Node
Backup

Redshift Spectrum: Extends the Redshift data
warehouse to exabytes of data in S3 Data Lake
Redshift
query engine
Query across
Redshift and S3
Redshift
data
S3
data lake
Coming Soon!
 Unload to Parquet
 Spectrum Request Accelerator
No loading required
Scale compute and storage separately
Parquet, ORC, JSON, Avro, Grok, CSV data formats
Nested data support

Security is built-in
Compliance certifications
10 GigE (HPC)
Customer
VPC
Internal
VPC
JDBC/ODBC
Compute
Nodes
Leader
Node
End-to-end encryption
Integration with AWS Key
Management Service

Speed Scale Security
Amazon Redshift
The 4 things that matter most
Simplicity

Caching Layer
Auto-scaling resources for bursts
of user activity (Preview)
Creates
more
clusters
automatically
on-demand
Consistently
fast
performance
even with
thousands of
concurrent queries
No
advance
hydration
required
Handles
unpredictable
volumes of
concurrent users
New!
Backup
Redshift Managed S3
1
2 3

Results with Auto-Scaling Concurrency
Higher is better
99% of users will
never see a charge
for auto-scale
resources
Auto-scaling resources for bursts of user activity

Fleet telemetry on query wait times
1 2 3 4
87% of Redshift customers
don’t have significant wait
times
Remaining 13% have
bursts of activity
averaging 10 minutes at a
time

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
OrderID CustomerID OrderTime ShipMode
5 23 10.00 12.50
8 32 1.00 5.60
OrdersWithItems
ItemID Quantity Price
23 10.00 12.50
16 1.00 1.99
32 1.00 5.60
24 5.00 26.50
OrderItems
OrderID ItemID Quantity Price
5 23 10.00 12.50
8 32 1.00 5.60
5 16 1.00 1.99
8 24 5.00 26.50
OrderID CustomerID OrderTime ShipMode
5 23 10.00 12.50
8 32 1.00 5.60
Orders
OrderItems
Orders table includes
the OrdersWithItems
as a nested column,
avoid the expensive
join
Avoid expensive joins with nested data

Redshift Query Editor
Query data
directly from
the AWS console
Results are instantly
visible within the console
No need to install
and setup an external
JDBC/ODBC client
Launched in October!

© 2018, Amazon Web Services, Inc or its Affiliates. All rights reserved.
57
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Exabyte-scale object storage
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query1

58
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Query is optimized and compiled at the
leader node. Determine what gets run
locally and what goes to Amazon
Redshift Spectrum
2

59
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Query plan is sent to all
compute nodes3

60
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions4

61
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
5

62
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Amazon Redshift Spectrum nodes
scan your S3 data6

63
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates

64
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Final aggregations and joins with
local Amazon Redshift tables
done in-cluster
8

65
Spectrum
Node 1
Spectrum
Node 2
Spectrum
Node 3
Spectrum
Node …
Spectrum
Node N
Amazon S3
Leader Node
Compute
Node 1
Compute
Node 2
Compute
Node 3
Amazon
Redshift
Cluster
JDBC / ODBC
Glue Catalog
Apache Hive
Metastore
Life of a query
Result is sent back to client 9

Unnesting using LEFT Joins
SELECT c.id, c.name.given, c.name.family, o.shipdate, o.price
FROM spectrum.customers c LEFT JOIN c.orders o ON true;
id | given | family | shipdate | price
----|---------|---------|----------------------|--------
1 | John | Smith | 2018-03-01 11:59:59 | 100.5
2 | John | Smith | 2018-03-01 09:10:00 | 99.12
2 | Jenny | Doe | |
3 | Andy | Jones | 2018-03-02 08:02:15 | 13.5
(4 rows)
Customer without orders

Aggregating nested data with subqueries
SELECT c.name.given, c.name.family,
(SELECT COUNT(*) FROM c.orders o) AS ordercount
FROM spectrum.customers c;
given | family | ordercount
--------|----------|--------------
Jenny | Doe | 0
John | Smith | 2
Andy | Jones | 1
(3 rows)
Numbers of orders for
each customer

Comparable performance with smaller footprint
68
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7
Executiontime(s)
Series1 Series2

“Small data” & low-latency queries
69
1TB TPC-H dataset
TPC-H Q1-like
0
0.5
1
1.5
2
1 2 3
ExecutionTime(s)
Predicate on Partitioning Column
Series1 Series2

“Small data” & low-latency queries
70
1 2 3
Predicate on Non-partitioning Column
For low-latency & frequent queries Redshift is a great
option
1TB TPC-H dataset
TPC-H Q1-like
0
2
4
6
8
10
12
1 2 3
ExecutionTime(s)
Predicate on Partitioning Column
Series1 Series2 Series3

Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (ANT301) - AWS re:Invent 2018

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (ANT301) - AWS re:Invent 2018

Similaire à Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (ANT301) - AWS re:Invent 2018 (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (ANT301) - AWS re:Invent 2018