Learn how Amazon’s enterprise data warehouse, one of the world's largest data warehouses managing petabytes of data, is leveraging Amazon Redshift. Learn about Amazon's enterprise data warehouse best practices and solutions, and how they’re using Amazon Redshift technology to handle design and scale challenges.
4. Amazon Data Warehouse
• Authoritative repository of data for all Amazon
• Petabytes of data
• Existing EDW is Oracle RAC; also using Amazon Elastic MapReduce
and now Amazon Redshift
• Owns managing the hardware and software infrastructure
–
Apart from Oracle DB, just Amazon IP
• Not part of AWS
5. Introducing the Elephant…
• Mission: Provide customers the
best value
–
–
Leverage AWS only if it provides the
best value
We aren’t moving 100% to Amazon
Redshift
• Publish best practices
–
If AWS isn’t the best, we’ll say so
• There is a conflict of interest
7. Amazon Data Warehouse – Growth Story
• Petabytes of data
• Growth of data volume – YoY storage
requirements have grown 67%
• Growth of processing volume – YoY
processing demand has grown 47%
10. Amazon Data Warehouse – Cost per Job
• Our main efficiency metric – Cost per Job (CPJ)
$CapEx $DataCenter $VendorSup port
PeakJobsPe rDay
11. What Drives Cost per Job…
Up?
Down?
•
Number of disks
– Data gets bigger!
•
Bidding
– 2+ vendors
•
Number of servers
•
Moore’s Law
– Vendors fight this!
•
Short-sighted negotiations
– 4th year support…
•
Data design
Data Center costs (power, rent)
•
Software (e.g. DBM)
•
12. Current State and Problems
• Existing EDW
– Multiple multi-petabyte clusters (redundancy and jobs)
– Why not <x>? CPJ not lower
• Data stored in SANs (not Exadata)
• Performs poorly on scans of 10T+
• Long procurement cycles (3 month minimum)
13. Amazon Data Warehouse and Amazon Redshift
Integration Project
• Spent 2013 evaluating Amazon Redshift for Amazon data
warehouse
– Where does Amazon Redshift provide a better CPJ?
– Can Amazon Redshift solve some pain (without introducing new pain)?
• Picked 10K jobs and 275 tables to copy
14. Current State of Affairs
• Biggest cluster size: 20+1 8XL
• Peak daily jobs: 7211 (using all 4 clusters)
• 4159 extracts
• 3052 loads
15. Some Results
• Benchmarking for 4159 jobs
– Outperforming 2719
– Underperforming 1440
– Avg. runtime
• 4:43 mins in Amazon Redshift
• 17:38 mins in existing EDW
• LOADS are slower
• EXTRACTS are faster
Job Type
RS Performance
Category
Job Count by
Category
EXTRACT
EXTRACT
EXTRACT
EXTRACT
EXTRACT
EXTRACT
LOAD
LOAD
LOAD
LOAD
LOAD
LOAD
10X Faster
5X Faster
3X Faster
2X Faster
1X or same
2X Slower
10X Faster
5X Faster
3X Faster
2X Faster
1X or same
2X Slower
945
487
393
301
480
1150
7
15
23
23
45
290
17. Amazon Redshift Integration Best Practices
•
Integrating via Amazon S3 (Manifests)
•
Primary key enforcement
•
Idempotent loads
–
–
MERGE via INSERT/UPDATE
Mimic Trunc-Load [Backfills]
•
Trunc-partition using sort keys
•
Administration automation
•
Ensuring data correctness
18. Integrating via Amazon S3
• S3 in the US Standard Region is eventually consistent!
• S3 LIST might not give the entire list of data right after
you save it (this WILL eventually happen to you!)
• Amazon Redshift loads everything it sees in a bucket
– You may see all data files, Amazon Redshift may not, which can cause
missing data
19. Best Practices – Using Amazon S3
• Read/COPY
–
–
System table validation – STL_LOAD_ERRORS,
Verify files loaded are ‘intended’ files
• Write/ UNLOAD
–
–
System table validation – STL_UNLOAD_LOG
Verify all files that has the data are on S3
• Manifests
–
–
–
Metadata to know what to exactly to read from S3
Provides authoritative reference to data
Powerful in terms of user metadata format, encryption, etc.
20. Primary Key Enforcement
• Amazon Redshift does not enforce primary key
– You will need to do this to ensure data quality
• Best practice
– Introduce temp table to check duplicates in incoming data
– Validate against incoming data to catch offenders
– Put the data in target table and validate target data in the same
transaction before commit
• Yes, this IS a lot of overhead
21. Idempotent Loads
• Idempotent Loads – doing a load 2+ times the same as
doing 1 load
– Needed to manage load failures
• MERGE – leverages primary key, row at a time
• TRUNC / INSERT – load a partition at a time
22. MERGE
• No native Amazon Redshift MERGE support
• Merge is implemented as a multi-step process
–
–
–
–
Load the data in temp table
Figure out inserts and load
Figure out updates and modify target table
Validation for duplicates
23. TRUNC - INSERT
• Solution
– Distribute randomly
– Use sort keys to align data (mimics partition)
– Selectively delete and insert
• Issues
– Inserts are in an “unsorted” bucket – performance degrades without
periodic VACUUM
– Very slow (effectively row at a time)
24. Other Temp Table Uses
• Partial column data load
• Filtered data load
• Column transformations
25. Automating Administration
• Stored procs / Oracle workflow used to do
admin task like retention, stats, etc.
• Solution
– We introduced a software layer to prepare the administrative
task statements based on defined inputs
– Execute using JDBC connection
– Can schedule work like stats collection, vacuum, etc.
26. 2013 Results
• CPJ is 55% less on Amazon Redshift in general
–
–
–
–
We can’t share the math, sorry YMMV
Between Redshift and Amazon data warehouse, known improvements get us to ~66%
Big wins are in big queries
Loads are slow and expensive
• Moved ~10K jobs to ~60 8XLs (4 clusters)
• We could move at most 45% of our work to Amazon Redshift with
minimal changes
27. 2014 Plan
• Focus on big tables (100T+)
– Need to solve data expiry and backfill challenges
• Solve problems with CPU bound
• Interactive analytics (third-party vendor apps
with Amazon Redshift + Oracle)
28. Please give us your feedback on this
presentation
DAT306
As a thank you, we will select prize
winners daily for completed surveys!