This advanced session targets Amazon Simple Storage Service (Amazon S3) technical users. We will discuss the impact of object naming conventions and parallelism on S3 performance, provide real-world examples and code the implements best practices for naming of objects and implementing parallelism of both PUTs and GETs, cover multi-part uploads and byte-range downloads and introduce GNU parallel for a quick and easy way to improve S3 performance.
5. Choosing a Region
• Performance
– Proximity to your users
– Co-locating with compute, other AWS resources
• Other things to think about
– Legal and regulatory requirements
– Costs vary by region
6. Pay Attention to Your Naming Scheme If:
• You want consistent performance from a bucket
• You want a bucket capable of routinely
exceeding 100 TPS
http://amzn.to/18oF5LC
7. Transactions Per Second (TPS)
1
8
2
5
100/8 = 12.5 events/sec
100,000 users @ 10 events an hour = 224 TPS
8. Distributing Key Names
• Don’t do this
<my_bucket>/2013_11_13-164533125.jpg
<my_bucket>/2013_11_13-051033564.jpg
<my_bucket>/2013_11_13-061133789.jpg
<my_bucket>/2013_11_13-051033458.jpg
<my_bucket>/2013_11_12-063433125.jpg
<my_bucket>/2013_11_12-021033564.jpg
<my_bucket>/2013_11_12-065533789.jpg
<my_bucket>/2013_11_12-011033458.jpg
<my_bucket>/2013_11_11-022333125.jpg
<my_bucket>/2013_11_11-153433564.jpg
<my_bucket>/2013_11_11-065233789.jpg
<my_bucket>/2013_11_11-065633458.jpg
9. Distributing Key Names
• Add randomness to the beginning of the key name
<my_bucket>/521335461-2013_11_13.jpg
<my_bucket>/465330151-2013_11_13.jpg
<my_bucket>/987331160-2013_11_13.jpg
<my_bucket>/465765461-2013_11_13.jpg
<my_bucket>/125631151-2013_11_13.jpg
<my_bucket>/934563160-2013_11_13.jpg
<my_bucket>/532132341-2013_11_13.jpg
<my_bucket>/565437681-2013_11_13.jpg
<my_bucket>/234567460-2013_11_13.jpg
<my_bucket>/456767561-2013_11_13.jpg
<my_bucket>/345565651-2013_11_13.jpg
<my_bucket>/431345660-2013_11_13.jpg
10. Other Techniques for Distributing Key Names
• Store objects as a hash of their name
– add the original name as metadata
• “deadmau5_mix.mp3” 0aa316fb000eae52921aab1b4697424958a53ad9
– watch for duplicate names!
– prepend keyname with short hash
• 0aa3-deadmau5_mix.mp3
• Epoch time (reverse)
– 5321354831-deadmau5_mix.mp3
11. Randomness in a Key Name Can Be an Anti-Pattern
• Lifecycle policies
• LISTs with prefix filters
• Maintaining thumbnails of images
– craig.jpg -> stored as orig-09329jed0fc
– thumb-09329jed0fc
• When you need to recover a file with its original
name
12. Solving for the Anti-Pattern
• Add additional prefixes to help sorting
<my_bucket>/images/521335461-2013_11_13.jpg
<my_bucket>/images/465330151-2013_11_13.jpg
<my_bucket>/movies/293924440-2013_11_13.jpg
<my_bucket>/movies/987331160-2013_11_13.jpg
<my_bucket>/thumbs-small/838434842-2013_11_13.jpg
<my_bucket>/thumbs-small/342532454-2013_11_13.jpg
<my_bucket>/thumbs-small/345233453-2013_11_13.jpg
<my_bucket>/thumbs-small/345453454-2013_11_13.jpg
• Amazon S3 maintains keys lexicographically in its
internal indices
13. Distributing Your Key Names Is Always a Good Idea!
It can take some time for improvements to manifest
Open a support case if you need an immediate bump
or if you’ve got any questions!
http://amzn.to/18oF5LC
15. Using Amazon CloudFront for Distribution
•
•
•
•
Caches objects from Amazon S3
Reduces the number of Amazon S3 GETs
Low latency with multiple endpoints
High transfer rate
• Two flavors:
– Web distribution (static content)
– RTMP distribution (on-demand streaming of media)
16. Multipart Upload Provides Parallelism
• Allows faster, more flexible uploads
• Allows you to upload a single object as a set of parts
• Upon upload, Amazon S3 then presents all parts as
a single object
• Enables parallel uploads, pausing and resuming
an object upload, and beginning uploads before
you know the total object size
17. Choose the Right Part Size
• Strike a balance between part size and number of parts
– Lots of small parts increase connection overhead, invalidating the benefits
of parallelism
– Too few large parts don’t get you enough benefits of multipart; don’t get you
resiliency to network errors
• We recommend parts of 25–50 MB on higher-bandwidth
networks and parts of 10 MB on mobile networks
18. You Can Parallelize Your GETs, Too
• Use range-based GETs to get multithreaded
performance when downloading objects
• Compensates for unreliable networks
• Benefits of multithreaded parallelism
• Align your ranges with your parts!
19. If you’re using SSL and parallelizing…
• You’re likely to become CPU-constrained
because encryption is CPU-intensive
• Amazon S3 recommends using AES-256 to
optimize for security and performance
• You can leverage AES-NI hardware on your host
to improve your performance
20. If Your Application Relies on LIST…
• Getting the objects your customers have stored
• Seeing sets of files (all animations, videos)
• Getting logs
• Viewing inventories
• Sorting keys based on metadata
21. What Should You Do?
• Parallelize LIST when you need a sequential list of
your keys
• You should build a secondary index of your keys,
such as with Amazon DynamoDB, to get a faster
alternative to LIST when a sequential list isn’t
sufficient
– Sorting by metadata
– Looking up by category
– Objects by time stamp
22. LIST Operations with Amazon DynamoDB
• Maintain metadata in DynamoDB
– Keep data about what’s in your buckets in DynamoDB
• On PUTs, enter data about your objects in DynamoDB
• On GETs, use DynamoDB to assist in your search for
specific objects
• You can use DynamoDB to give you “LIST” based on
specific criteria
23. Wrap up: Maximizing Amazon S3 Performance
Architecture
Optimizing PUTs
Choosing a region
Multipart upload
Building a naming scheme
Considering LISTs
Optimizing GETs
Using CloudFront
Range-based GETs
24. Please give us your feedback on this
presentation
STG304
As a thank you, we will select prize
winners daily for completed surveys!