Five Years of EC2 Distilled

Five years of EC2
distilled
Grig Gheorghiu

Silicon Valley Cloud Computing Meetup, Feb. 19th 2013

@griggheo
agiletesting.blogspot.com

whoami

• Dir of Technology at Reliam (managed
hosting)
• Sr Sys Architect at OpenX
• VP Technical Ops at Evite
• VP Technical Ops at Nasty Gal

EC2 creds

• Started with personal m1.small instance in
2008
• Still around!
• UPTIME:
• 5:13:52 up 438 days, 23:33, 1 user, load average:
0.03, 0.09, 0.08

EC2 at OpenX
• end of 2008
• 100s then 1000s of instances
• one of largest AWS customers at the time
• NAMING is very important
• terminated DB server by mistake
• in ideal world naming doesn’t matter

EC2 at OpenX (cont.)
• Failures are very frequent at scale
• Forced to architect for failure and
horizontal scaling
• Hard to scale at all layers at the same time
(scaling app server layer can overwhelm DB
layer; play wack-a-mole)
• Elasticity: easier to scale out than scale back

• Automation and conﬁguration management
become critical
• Used little-known tool - ‘slack’
• Rolled own EC2 management tool in
Python, wrapped around EC2 Java API
• Testing deployments is critical (one
mistake can get propagated everywhere)

• Hard to scale at the DB layer (MySQL)
• mysql-proxy for r/w split
• slaves behind HAProxy for reads
• HAProxy for LB, then ELB
• ELB melted initially, had to be gradually
warmed up

EC2 at Evite

• Sharded MySQL at DB layer; application
very write-intensive
• Didn’t do proper capacity planning/dark
launching; had to move quickly from data
center to EC2 to scale horizontally
• Engaged Percona at the same time

EC2 at Evite (cont.)
• Started with EBS volumes (separate for
data, transaction logs, temp ﬁles)
• EBS horror stories
• CPU Wait up to 100%, instances AWOL
• I/O very inconsistent, unpredictable
• Striped EBS volumes in RAID0 helps with
performance but not with reliability

• EBS apocalypse in April 2011

• Hit us even with masters and slaves in diff.
availability zones (but all in single region -
mistake!)

• IMPORTANT: rebuilding redundancy into your
system is HARD

• For DB servers, reloading data on new server is
a lengthy process

• General operation: very frequent failures
(once a week); nightmare for pager duty
• Got very good at disaster recovery!
• Failover of master to slave

• Rebuilding of slave from master (xtrabackup)

• Local disks striped in RAID0 better than
EBS

• Ended up moving DB servers back to data
center
• Bare metal (Dell C2100, 144 GB RAM,
RAID10); 2 MySQL instances per server
• Lots of tuning help from Percona
• BUT: EC2 was great for capacity planning!
(Zynga does the same)

• Relational databases are not ready for the
cloud (reliability, I/O performance)
• Still keep MySQL slaves in EC2 for DR
• Ryan Macktechnologies so“Wecould better
understood
(Facebook):
we
chose well-

predict capacity needs and rely on our existing
monitoring and operational tool kits."

• Didn’t use provisioned IOPS for EBS
• Didn’t use VPC
• Great experience with Elastic Map Reduce,
S3, Route 53 DNS
• Not so great experience with DynamoDB
• ELB OK but still need HAProxy behind it

EC2 at NastyGal
• VPC - really good idea!
• Extension of data center infrastructure
• Currently using it for dev/staging + some
internal backend production
• Challenging to set up VPN tunnels to
various ﬁrewall vendors (Cisco, Fortinet)
- not much debugging on VPC side

Interacting with AWS
• AWS API (mostly Java based, but also Ruby
and Python)
• Multi-cloud libraries: jclouds (Java), libcloud
(Python), deltacloud (Ruby)
• Chef knife
• Vagrant EC2 provider
• Roll your own

Proper infrastructure care
and feeding
• Monitoring - alerting, logging, graphing
• It’s not in production if it’s not monitored
and graphed
• Monitoring is for ops what testing is for
dev
• Great way to learn a new infrastructure
• Dev and ops on pager

and feeding
• Going from #monitoringsucks to
#monitoringlove and @monitorama
• Modern monitoring/graphing/logging tools
• Sensu, Graphite, Boundary, Server
Density, New Relic, Papertrail, Pingdom,
Dead Man’s Snitch

and feeding
• Dashboards!

• Mission Control page with graphs based on
Graphite and Google Visualization API

• Correlate spikes and dips in graphs with errors
(external and internal monitoring)

• Akamai HTTP 500 alerts correlated with Web
server 500 errors and DB server I/O wait
increase

and feeding

• HTTP 500 errors as a percentage of all HTTP
requests across all app servers in the last 60
minutes

and feeding
• Expect failures and recover quickly

• Capacity planning
• Dark launching

• Measure baselines

• Correlate external symptoms (HTTP 500) with
metrics (CPU I/O Wait) then keep metrics
under certain thresholds by adding resources

and feeding
• Automate, automate, automate! - Chef, Puppet,
CFEngine, Jenkins, Capistrano, Fabric

• Chef - can be single source of truth for
infrastructure
• Running chef-client continuously on nodes
requires discipline

• Logging into remote node is anti-pattern (hard!)

and feeding
• Chef best practices

• Use knife - no snowﬂakes!

• Deploy new nodes, don’t do massive updates
in place

• BUT! beware of OS monoculture

• kernel bug after 200+ days

• leapocalypse

Is the cloud worth the
hype?
• It’s a game changer, but it’s not magical; try before
you buy! (benchmarks could surprise you)

• Cloud expert? Carry pager or STFU

• Forces you to think about failure recovery,
horizontal scalability, automation

• Something to be said about abstracting away the
physical network - the most obscure bugs are
network-related (ARP caching, routing tables)

So...when should I use
the cloud?
• Great for dev/staging/testing
• Great for layers of infrastructure that
contain many identical nodes and that are
forgiving of node failures (web farms,
Hadoop nodes, distributed databases)
• Not great for ‘snowﬂake’-type systems
• Not great for RDBMS (esp. write-intensive)

If you still want to use
the cloud
• Watch that monthly bill!

• Use multiple cloud vendors
• Design your infrastructure to scale horizontally
and to be portable across cloud vendors

• Shared nothing

• No SAN, NAS

If you still want to use
the cloud
• Don’t get locked into vendor-proprietary
services
• EC2, S3, Route 53, EMR are OK

• Data stores are not OK (DynamoDB)

• OpsWorks - debatable (based on Chef, but still
locks you in)

• Wrap services in your own RESTful endpoints

Does EC2 have rivals?
• No (or at least not yet)
• Anybody use GCE?
• Other public clouds are either toys or
smaller, with less features (no names named)
• Perception matters - not a contender unless
featured on High Scalability blog
• APIs matter less (can use multi-cloud libs)

Does EC2 have rivals?
• OpenStack, CloudStack, Eucalyptus all seem
promising
• Good approach: private infrastructure (bare
metal, private cloud) for performance/
reliability + extension into public cloud for
elasticity/agility (EC2 VPC, Rack Connect)

• How about PaaS?
• Personally: too hard to relinquish control

Five Years of EC2 Distilled

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Five Years of EC2 Distilled

Similaire à Five Years of EC2 Distilled (20)

Dernier

Dernier (20)

Five Years of EC2 Distilled