TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
The AWS Cloud : Leveraging the State of the Art
1. The AWS Cloud
Leveraging the State of the Art
Sid Anand (@r39132)
SAP Cloud Inside Track 2012
1
Thursday, February 16, 2012
2. What is the AWS Cloud?
A Real World Scenario
2
Thursday, February 16, 2012
3. A Real World Scenario
Question
If you were to build your own website today, what would you need?
Answer
You need a machine!
For simplicity, we will assume that your web server and application server code run
on the same box!
AWS offers EC2 instances (i.e. virtual instances) to host your code
- Various sizes (e.g. IOps, # of Spindles, CPUs, Memory, Network bandwidth)
- Various configurations (e.g. Virtual Private Cloud, High Performance Cluster )
- Various pricing schemes (e.g. on-demand, reserved, SPOT, etc....)
3
Thursday, February 16, 2012
4. A Real World Scenario
Question
Is one machine enough to handle traffic
from all of your users?
What if that machine were to fall over or
need maintenance (i.e. a restart)?
Answer
Add many machines!
4
Thursday, February 16, 2012
5. A Real World Scenario
Question
This handles more traffic, but what if your
servers were to fall over or need maintenance?
Answer
AWS offers AutoScaleGroups (a.k.a. ASG)!
You can deploy your servers under the protection of
an ASG with a min and max pool size set.
The ASG ensures that machines are replaced when
they die to guarantee your “min” pool size
ASGs monitor the health of your machines by polling
an http port on each machine
5
Thursday, February 16, 2012
6. A Real World Scenario
Question
How do you distribute traffic to all of your
machines evenly?
Answer
Deploy your favorite software load balancer!
And write some custom code to register/deregister
your machine instances with the load balancer
6
Thursday, February 16, 2012
7. A Real World Scenario
Question
What if the load balancer were to fall over or to need maintenance or
to become a traffic choke point?
Answer
Add multiple servers and deploy them under an ASG!
This is not ideal for a few reasons
- Need to register/deregister your Load Balancer instances with DNS
- Need to sync with ASGsʼs view of what is alive and dead, being
added or removed, etc...
7
Thursday, February 16, 2012
8. A Real World Scenario
Answer
AWS offers Elastic Load Balancers (i.e. ELB)
- Conceptually similar to having many LBs in an ASG, with some
additional features:
- Provides DNS hostname (e.g. mysite-11111111.us-
east-1.elb.amazonaws.com)
- Maps all of the load balancer instances to this hostname
- Takes care of maintenance of the load balancer machines and
the requisite DNS registrations/deregistrations
- Syncs with the ASG -- if the ASG replaces one of your
instances, the ELB will also remove that instance
- Letʼs see how it works in action!
8
Thursday, February 16, 2012
10. A Real World Scenario
Question
What about a DB to persist my data?
Answer
Multiple AWS hosted/managed options!
- DynamoDB (the new SimpleDB replacement) offers key-value
semantics
Netflix replaced Oracle with SimpleDB and ran on it 2010-2011
- 4.5 Billion user-facing request a day
- S3 offers key-value semantics for very large files (e.g. 5TB).
Typically for Map-Reduce files, media files, or Oracle BLOBS/
CLOBS
- RDS - hosted Oracle or MySQL if you need relations and complex
queries
10
Thursday, February 16, 2012
11. A Real World Scenario
Question
What if I have high-volume writes, but donʼt
care when they are written -- e.g. event
streams
Answer
Simple Queue Service
- Think Enterprise Message Bus
- Highly available, infinitely scalable
- Handles application/system monitoring
event traffic and social graph events at
Netflix
11
Thursday, February 16, 2012
12. A Real World Scenario
Question
What if the whole Data Center goes
down? How do I keep my service
available?
Answer
Amazon Data Center = Availability Zone
12
Thursday, February 16, 2012
13. A Real World Scenario
Answer
Always deploy your code in
multiple Availability Zones!
- Netflix deploys in 3 AZs in
Virgina
- Best Practice : Always deploy
enough capacity in each AZ to
handle losing one AZ during
peak
- Netflix follows this best
practice!
13
Thursday, February 16, 2012
14. A Real World Scenario
Question
What if your Asian and European customers complain of slow response times?
Recall : Higher Response times, lower scalability
Answer
AWS has 8 global regions! Each region has between 3 and 4 AZs
- Netflixʼs launch in the UK and Ireland were out of AWS EU-West Region
14
Thursday, February 16, 2012
15. A Real World Scenario
15
Thursday, February 16, 2012
16. A Real World Scenario
Other AWS Services:
- Elastic Map Reduce : Map-Reduce as a Service for analytics. Supports PIG and Hive
- ElastiCache : A hosted cache service (think Memcached as a Service)
Whatʼs Missing (or coming soon)?:
- Discovery & Load Balancing for N-tier applications!
- In effect, weʼd like ELB for internal traffic
- Crypto as a Service
- Currently, none of the services are cross-region! Itʼs left to the user to transfer data or proxy requests between
regions
16
Thursday, February 16, 2012
17. Who Uses AWS?
Netflix’s Cloud Architecture
17
Thursday, February 16, 2012
18. Netflix’s Cloud Architecture
ELB ELB
NES NES NES NES
Components
Many (~100) applications, organized in Discovery
clusters (a.k.a. ASGs)
NMTS NMTS NMTS NMTS
Clusters can be at different levels in the
call stack
NMTS NMTS
Clusters can call each other
NBES NBES
IAAS IAAS IAAS
18
Thursday, February 16, 2012
19. Netflix’s Cloud Architecture
ELB ELB
Levels
NES NES NES NES
NES : Netflix Edge Services
Discovery
NMTS : Netflix Mid-tier Services
NMTS NMTS NMTS NMTS
NBES : Netflix Back-end Services
IAAS : AWS IAAS Services NMTS NMTS
Discovery : Help services discover NMTS
and NBES services
NBES NBES
IAAS IAAS IAAS
19
Thursday, February 16, 2012
20. Netflix’s Cloud Architecture
ELB ELB
Components (NES)
NES NES NES NES
Overview
Any service that browsers and streaming Discovery
devices connect to over the internet
NMTS NMTS NMTS NMTS
They sit behind AWS Elastic Load
Balancers (a.k.a. ELB)
NMTS NMTS
They call clusters at lower levels
NBES NBES
IAAS IAAS IAAS
20
Thursday, February 16, 2012
21. Netflix’s Cloud Architecture
Components (NES) ELB ELB
Examples NES NES NES NES
API Servers
Discovery
Support the video browsing experience
NMTS NMTS NMTS NMTS
Also allows users to modify their Q
Serves 1.4 Billions calls/day NMTS NMTS
Streaming Control Servers
Support streaming video playback
NBES NBES
Authenticate your Wii, PS3, etc...
Download DRM to the Wii, PS3, etc...
Return a list of CDN urls to the Wii, PS3, IAAS IAAS IAAS
etc...
21
Thursday, February 16, 2012
22. Netflix’s Cloud Architecture
ELB ELB
Components (NMTS) NES NES NES NES
Overview
Discovery
Can call services at the same or lower NMTS NMTS NMTS NMTS
levels
Other NMTS
NMTS NMTS
NBES, IAAS
Not NES
NBES NBES
Exposed through our Discovery service
IAAS IAAS IAAS
22
Thursday, February 16, 2012
23. Netflix’s Cloud Architecture
ELB ELB
Components (NMTS)
NES NES NES NES
Examples
Discovery
Netflix Queue Servers
NMTS NMTS NMTS NMTS
Modify items in the usersʼ movie queue
Viewing History Servers
NMTS NMTS
Record and track all streaming movie
watching
SIMS Servers NBES NBES
Compute and serve user-to-user and
movie-to-movie similarities
IAAS IAAS IAAS
23
Thursday, February 16, 2012
24. Netflix’s Cloud Architecture
ELB ELB
Components (NBES)
NES NES NES NES
Overview
Discovery
A back-end, usually 3rd party, open-source
service NMTS NMTS NMTS NMTS
Leaf in the call tree. Cannot call anything
else
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
24
Thursday, February 16, 2012
25. Netflix’s Cloud Architecture
ELB ELB
Components (NBES) NES NES NES NES
Examples
Discovery
Cassandra Clusters
NMTS NMTS NMTS NMTS
Our new cloud database is Cassandra and
stores all sorts of data to support
application needs NMTS NMTS
Zookeeper Clusters
Our distributed lock service and sequence
NBES NBES
generator
Memcached Clusters
Typically caches things that we store in S3
but need to access quickly or often IAAS IAAS IAAS
25
Thursday, February 16, 2012
26. Netflix’s Cloud Architecture
ELB ELB
Components (IAAS)
NES NES NES NES
Examples
AWS S3 Discovery
Large-sized data (e.g. video encodes, NMTS NMTS NMTS NMTS
application logs, etc...) is stored here, not
Cassandra
NMTS NMTS
AWS SQS
Amazonʼs message queue to send events
(e.g. Facebook network updates are
processed asynchronously over SQS) NBES NBES
IAAS IAAS IAAS
26
Thursday, February 16, 2012
27. Netflix’s Cloud Architecture
Architecture Pros
Horizontally scalable at every level
Should give us maximum availability
Architecture Cons
A user-issued call will pass through multiple levels (a.k.a. hops) during normal operation
Latency can be a concern
EC2 instances in AWS can die at any time!
A lot of moving parts
27
Thursday, February 16, 2012
28. Dealing with the Cons!
We have a little help
28
Thursday, February 16, 2012
29. Simian Army
Prevention (& Early Detection) is the best
medicine
29
Thursday, February 16, 2012
30. Simian Army
• Chaos Monkey
• Simulates hard failures in AWS by killing a few instances per ASG (e.g. Auto Scale Group)
• Similar to how EC2 instances can be killed by AWS with little warning
• Tests Netflixʼs ability to gracefully deal with broken connections, interrupted calls, etc...
• Verifies that all services are running within the protection of AWS Auto Scale Groups, which
reincarnates killed instances
• If not, the Chaos monkey will win!
30
Thursday, February 16, 2012
31. Simian Army
• Latency Monkey
• Simulates soft failures -- i.e. a service gets slower
• Injects random delays in servers!
• Tests the ability of applications to detect and recover (i.e. Graceful Degradation) from the harder
problem of delays
• Delays cause Thundering Herds (outside of the scope of this talk!)
31
Thursday, February 16, 2012
32. Simian Army
Does this solve all of our issues?
32
Thursday, February 16, 2012
33. Simian Army
The infinite cloud is infinite when your needs are
moderate!
To ensure fairness among tenants, AWS meters or limits every resource
Hence, we hit limits quite often. Our “velocity” is limited by how long it takes for AWS to
turn around and raise the limit -- a few hours!
33
Thursday, February 16, 2012
34. Simian Army
• Limits Monkey
• Checks once an hour whether we are approaching one of our limits and triggers alerts for us to
proactively reach out to AWS!
• Conformity & Janitor Monkeys
• Finds and clean up orphaned resources (e.g. EC2 instances that are not in an ASG,
unreferenced security groups, ELBs, ASGs, etc...) to increase head-room
• Buys us more time before we run out of resources and also saves us $$$$
34
Thursday, February 16, 2012