Companies have valuable data that they might not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. With the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with an Amazon Redshift lead engineer to ask questions and learn more about how you can extend your analytics beyond your data warehouse.
26. WB Analytics
Many teams work to publish a game...
Each team brings specialized tools…
We combine these tools with client
data to create a consistent,
actionable view of each game.
What we do...
27. WB Analytics
Where we started...
Server Telemetry
SQL
Challenges:
• Delayed Data
• Resource Constraints /
Scaling
• Multi-year CapEx
• SQL limited to RDBMS
24 hr
24 hr
24 hr
Popular
Column-based
RDBMS
Client Telemetry
Demographics
28. WB Analytics
Picking the right tools...
Integration
Tech
InsightsModeling
Client
Server
Integration
s
● Enforce Schemas
● Schema Lineage
● Auto Schema Merge
Ingestion
● Maintain consistent
API(s)
● Spark: Micro-batch
● Amazon EC2
autoscaling group
● Airflow: Batch
Storage
Data Lake Query
Engine(s)
● Amazon S3: Raw data
● Amazon Redshift: fast /
modest
● Spectrum/Amazon S3:
Large-sized & multi
cluster
Reasoning
29. Putting it all together
Ingestion Modeling Visuals & Automation
Schema Management
Amazon Redshift Loader
Client
Events
Server
Events
API
Kafka
Schema
Storage
Batch Daily
Loads
Sales
Social
Market
…
S3 Analysis Lake
Extracts
Parquet
Analyst Services
Processing
S3 Raw Lake
Data Lake
Profile
Processing
Spark
Client
Server
Data Models
High Frequency
Consolidated
Cluster
Spectrum
Transform / Load
Spectrum
Game Cluster
r
Server
Events
EC2 ASG
30. WB Analytics
Our Amazon Redshift Fleet
● ~30 Clusters
● Dedicated ingest pipeline and Redshift cluster per game
● Storage:
○ Amazon Redshift - 150 TB
○ Data Lake - 1 PB
Environment
● Peak sustained: 100k events / sec both event streams
● 40 - 300 tables / game
● 3-10 minute micro-batches
● Spectrum (scanned/mo) - ↑ 1PB
Targets
31. WB Analytics
Customer ExperienceOperational Flexibility
Amazon Redshift Wins
- Budget - Manage OpEx based on lifecycle
- Recovery - Faster resolution to data delays
- Scaling - Hours instead of weeks
- Managed - Not in the hardware business
- Modeling - More modeling done in warehouse
(enabling tools like Looker)
- Tools - Same data assets for multiple tools (Spectrum +
Amazon S3 + Parquet)
- Portability - Rapid sharing of common data assets
across Amazon Redshift clusters
32. WB Analytics
TipsObservations
More Amazon Redshift Wins!
❏ Schema Merge / Evolution
❏ Data Retention Strategy
❏ Load at different frequencies
❏ Spectrum as warm storage tier
❏ Everything big in columnar format
❏ Learn Spectrum pushdown
❏ Use Glue Data Catalog
❏ Other query engines fit some use cases
❏ Compact many small files
❏ Communicate to service teams
1. Compute vs Storage
(++ with Spectrum)
2. Instance Types (++ with DC2)
3. Resize Speed (++ with Elastic Resize)
4. Storage Tiers (coming...)
5. Faster (coming...)
34. WB Analytics
… and now the Chalk Talk
● Cap Amazon Redshift cost by limiting cluster growth
● Size clusters for compute not storage
● Hot/warm storage tiers
● Maintain query SLAs.
Goals
● Unload to Parquet
● Spectrum Accelerator
● Elastic Resize
Features
35. WB Analytics
Elastic Resize Performance
❏ Spectrum 6 node dc2.xlarge cluster @ 2 TB per node => 12 TB cluster %50 full
❏ Scale up 2x with “Classic resize” 18-24 hrs before read/write available
❏ Scale up 2x with “Elastic Resize” 7 min!
❏ 4 min prep phase
❏ 3 min resize phase - cluster is read/write available now
❏ Post resize data copy phase ~ 30 min
❏ Scale down 2x from 12 node with “Elastic Resize” 8 min!
❏ 4 min prep phase
❏ 4 min resize phase - cluster is read/write available now.
❏ Post resize data copy phase ~90min
36. WB Analytics
UNLOAD to Parquet Performance
❏ Unload 215 daily partitions to Paquet in S3
❏ 10 node dc2.8xlarge cluster => 160 slices
❏ UNLOAD … TO PARQUET … PARALLEL
❏ 99.8 percentile slice unload time = 1.3 sec
❏ Remaining 0.2 slice unload time = ~ 40 sec.
❏ 215 daily partitions UNLOAD … TO PARQUET ~ 44 min
❏ Same UNLOAD to delimited text ~40 min
❏ Good enough already!
39. WB Analytics
UNLOAD to Parquet Recap
Observations
1. Queryable from other query engines -
including TIMESTAMP!
2. Small/modest parallelism unload
performance is fast - many times faster
than text
3. Highly parallel or many unloads slower
within 20% of delimited text.
4. Small fraction of slower Parquet writes
are long poles
Tips
❏ Use Hive sub-directory name format.
❏ Discover (or map) partitions onto S3
data.
❏ Reassemble with UNION view.
❏ UNLOAD/COPY is faster than
INSERT… SELECT for remainder in
Amazon Redshift and some use cases
too
40. WB Analytics
Spectrum Accelerator Recap
Observations
1. Fast when data reduction happens
2. Varied speedup based on pushdown
predicates
3. Only happens when it’s worth it.
4. No performance regression
5. System view svl_s3requests is the key
to understanding caching
6. Speedup not yet predictable
Tips
❏ Know you query workload
❏ Ask for more predicate pushdown
❏ Track S3_scanned/returned ratio in
svl_s3requests.
❏ Look at first query execution .vs. later
executions
❏ Engage support when speedup is less
than expected