More and more startups/companies are deploying their infrastructure directly and exclusively in EC2 or similar cloud provider. With that comes a whole new set of challenges and paradigms around scalability, reliability and availability.
This talk will focus on how to leverage all the infrastructure parts of AWS, augment them with great (affordable) third party services and solid Open Source Software to create an operations environment that will scale with you, be as reliable as it can be, providing you and your peers with all the data you need to make good decisions to support (rapid) changes while letting you sleep through the night. And all that using a tiny operations team.
It may make you coffee in the morning too.
Reliability & Scale in AWS while letting you sleep through the night
1. ONE MAN OPS
Reliability & Scale in AWS while letting you sleep through the night
Jos Boumans - @jiboumans
http://www.fwallpaper.net/picture_pics-Sleepy-cat.html
17. AWS OUTAGE = YOUR OUTAGE
http://it.mario.wikia.com/wiki/Lakitu
18. THE RULES HAVE CHANGED
You're not in Kansas anymore
http://entreatmenot.blogspot.com/2011/04/shattered-dreams.html
19. NETWORK WILL PARTITION
And it will happen often
http://thevinylvillain.blogspot.com/2010_04_01_archive.html
20. DISK IO WILL FLUCTUATE
On a good day, it's mediocre
http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
21. IP ADDRESSES WILL CHANGE
IP lease is 8 hours
DNS TTL is 60 seconds
www.fantom-xp.com
22. INSTANCES WILL DIE
And it will always be your Database Master
http://room57.deviantart.com/art/Hangman-188353196
24. EMBRACE FAILURE
Hardware will fail. Humans will make errors.
Nature will produce thunderstorms.
http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
25. ADJUST YOUR STRATEGY
Don't bring a knife to a gun fight
http://www.flickr.com/photos/statlerhotel/6628770499/sizes/l/in/photostream/
26. DATA STORES
Some work better than others
http://gustavhoiland.com/2010/03/10/stacked-boxes/
27. RDBMS
CouchDB
BigTable Based
Dynamo Based
Master / Slave based
CAP THEOREM
Your choice: sacrifice availability or consistency.
Orange is a lie.
28. MYSQL / ORACLE VS RDS
See: Network partitioning & instances dying
29. BIGTABLE BASED STORES
HBase, Accumulo, Hypertable
Still suffer when network partitioning happens
http://www.cloudera.com/cdh4/
30. DYNAMO BASED STORES
Cassandra, Riak, DynamoDB
http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html http://aws.amazon.com/dynamodb/faqs/
31. GO HOSTED?
CouchDB, MongoDB, Riak, Cassandra, HBase
Your Latency May Vary
http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html
32. CLIENT SIDE STORAGE
Keep a copy of your users data locally
http://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ http://www.w3.org/2001/tag/2010/09/ClientSideStorage.html
33. FILE STORES
EBS vs Instance Store
http://homedezine.blogspot.com/2011/04/day-my-cat-removed-carpet-photo-studio.html
34. SIMPLE STORAGE SERVICE
S3: Arguably AWS' best feature
http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
35. TRAFFIC SHAPING
Control every part of the request
http://www.visualphotos.com/image/2x4154765/man_standing_with_traffic_cones_in_shape_of_u-turn
36. STAY LOCAL IF YOU CAN
Going off box exposes you to risks you need to mitigate
http://southshorewoman.com/issue/june-2010/article/local-character
37. CACHE WHAT YOU CAN
HTTP Responses, DB Queries, User content
Browsers have caches too!
http://theoatmeal.com/blog/charity_money
38. USE ELASTIC LOAD BALANCERS
They will save you more than once
http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
39. USE GLOBAL LOAD BALANCING
Fail over to the closest data center on region failure
41. USE A CDN
Critical items should always be available
http://kadanthuponanimidangal.blogspot.com/2010/12/blog-post_6992.html
42. MEASURE EVERYTHING
Find outliers, deviants & trends before they cause trouble
http://www.themoviedb.org/movie/629-the-usual-suspects
43. GRAPHITE, STATSD & COLLECTD
Use Statsd & Collectd for application/system metrics
Use graphite to store, aggregate & visualize
http://hostedgraphite.com/
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
44. GRAPH EVENTS
Deployments, outages, CDN reconfigurations, failed builds, etc
Anything that's important to the health of your eco system
http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
45. COMPARE WEEK TO WEEK
Overlay week to week graphs using timeShift()
Quickly identifies trends and deviations from trends
http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-10
46. FORECASTING
Use Holt-Winters confidence bands
Verify that your metrics are within normal tolerance
https://github.com/ripienaar/graphite-graph-dsl/wiki/Creating-Holt-Winters-Forecasts
47. FIND INDIVIDUAL OUTLIERS
Absolute numbers mean very little
Use mean & standard deviation
http://en.wikipedia.org/wiki/File:Black_sheep-1.jpg
48. ALERT ON TRENDS
Once you go over a threshold, it's too late
Alert on unwanted trends and preemptively fix
http://sub-second.blogspot.com/2012/06/reporting-response-times-percentile.html http://aphyr.github.com/riemann/
50. SHOUT OUT: NEW RELIC
Python, Ruby, .NET, Java, PHP support
In depth profiling of your app for performance & errors.
51. CONFIGURATION MANAGEMENT
Unique snowflakes are bad
http://www.torange.us/Plants/Conifers/spruce-needles-in-hoarfrost-424.html
52. PUPPET VS CHEF
Yes.
http://puppetlabs.com/
http://www.opscode.com/chef
53. INFRASTRUCTURE AS CODE
Use different environments
Measure and report on it
http://americansingercanary.com/green.htm
54. SHOUT OUT: UBUNTU
Ubuntu + cloud-init + boto = awesome*
*I am biased
http://www.123rf.com/photo_4871141_food-pyramid-isolated-on-white.html https://github.com/krux/ops-tools
55. DEV = PRODUCTION
"I dunno, it worked on my laptop"
Instead, use vagrant
http://vagrantup.com/ http://vagrantup.com/
56. ROLL YOUR OWN AMIS
Instantly boot up new deployments
Reduce Time to Respond
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
57. CONFIDENT DEPLOYS
That human error could be yours
http://www.etsy.com/listing/37178125/stormtrooper-regrets-those-were-the
58. CONTINUOUS INTEGRATION
Ours: Github + Jenkins + FPM + apt::s3
From commit to deployable in one command http://github.com/
http://jenkins-ci.org/
https://github.com/thekad/apt-s3
https://github.com/jordansissel/fpm/wiki/
59. ONE CLICK DEPLOYMENTS
Deployments should not be exciting.
Don't create a checklist; automate & track
http://www.thegreenhead.com/2012/07/one-click-butter-cutter.php https://checkmarkable.com/
60. DARK LAUNCHES
Exercise the code without impacting the user experience
http://www.kissmetrics.com/
http://www.layoutsparks.com/pictures/moon-23 https://github.com/yahoo/boomerang/
61. SHADOW TRAFFIC
Test new code against live traffic
http://doppelthingers.tumblr.com/post/12839979386/traffic-light-shadow-hangman-and-possibly-his https://gist.github.com/3125323