This presentation covers our work starting with Acme Air web scale and transitioning to operational lessons learned in HA, automatic recovery, continuous delivery, and operational visibility. It shows the port of the Netflix OSS cloud platform to IBM's cloud - SoftLayer and use of RightScale.
Axa Assurance Maroc - Insurer Innovation Award 2024
Cloud Services Powered by IBM SoftLayer and NetflixOSS
1. Public Cloud Services using
IBM Cloud and Netflix OSS
Jan 2014
Andrew Spyker
@aspyker
2. Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility
• Get started yourself
2
3. About me …
• IBM STSM of Performance Architect and Strategy
• Eleven years in performance in WebSphere
–
–
–
–
Led the App Server Performance team for years
Small sabbatical focused on IBM XML technology
Works in Emerging Technology Institute, CTO Office
Now cloud service operations
• Email: aspyker@us.ibm.com
–
–
–
–
Blog: http://ispyker.blogspot.com/
Linkedin: http://www.linkedin.com/in/aspyker
Twitter: http://twitter.com/aspyker
Github: http://www.github.com/aspyker
• RTP dad that enjoys technology as well as running, wine and poker
3
4. Develop or maintain a service today?
• Develop – yes
• Maintain – starting
• So far
http://www.flickr.com/photos/stevendepolo/
– Multiple services inside of IBM
– Other services for use in our PaaS environment
4
5. What qualifies me to talk?
• My monkey?
• Of cloud prize ~ 40 entrants
– Best example mash-up sample
• Nomination and win
– Best portability enhancement
• Nomination
– More on this coming …
•
•
Other nominees - http://techblog.netflix.com/2013/09/netflixoss-meetup-s1e4-cloud-prize.html
Other winners - http://techblog.netflix.com/2013/11/netflix-open-source-software-cloud.html
5
6. Seriously, how did I get here?
• Experience with performance and scale on
standardized benchmarks (SPEC/TPC)
– Non representative of how to (web) scale
• Pinning, biggest monolithic DB “wins”, hand tuned for fixed size
– Out of date on modern architecture for mobile/cloud
• Created Acme Air
– http://bit.ly/acmeairblog
• Demonstrated that we could achieve (web) scale runs
– 4B+ Mobile/Browser request/day
– With modern mobile and cloud best practices
6
7. What was shown?
• Peak performance and scale – You betcha!
• Operational visibility – Only during the run via
nmon collection and post-run visualization
•
•
•
•
True operational visibility - nope
Devops – nope
HA and DR – nope
Manual and automatic elastic scaling - nope
7
8. What next?
• Went looking for what best industry practices around
devops and high availability at web scale existed
– Many have documented via research papers and on
highscalability.com – Google, Twitter, Facebook, Linkedin,
etc.
• Why Netflix?
– Documented not only on their tech blog, but also have
released working OSS on github
– Also, given dependence on Amazon, they are a clear
bellwether of web scale public cloud availability
8
9. Steps to NetflixOSS understanding
• Recoded Acme Air application to make use of NetflixOSS
runtime components
• Worked to implement a NetflixOSS devops and high
availability setup around Acme Air (on EC2) run at previous
levels of scale and performance on IBM middleware
• Worked to port NetflixOSS runtime and devops/high
availability servers to IBM Cloud (SoftLayer) and RightScale
• Through public collaboration with Netflix technical team
– Google groups, github and meetups
9
10. Why?
• To prove that advanced cloud high availability
and devops platform wasn’t “tied” to Amazon
• To understand how we can advance IBM cloud
platforms for our customers
• To understand how we can host our IBM
public cloud services better
10
11. Another Cloud Portability work of note
• In this presentation, focused
on portability across public clouds
Project Aurora
• What about applicability to private cloud?
• Paypal worked to port the cloud management system
to OpenStack and Heat
– https://github.com/paypal/aurora
• Additional work required to port runtime aspects as we
did in public cloud
11
12. Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility
• Get started yourself
12
13. My view of Netflix goals
• As a business
– Be the best streaming media provider in the world
– Make best content deals based on real data/analysis
• Technology wise
– Have the most availability possible
– “Stream starts per unit of time” is KPI measured for entire business
– Deliver features to customers first in market
• Requiring high velocity of IT change
– Do all of this at web scale
• Culture wise
– Create a high performance delivery culture that attracts top talent
13
14. Standing on the shoulder of a giants
• Public Cloud (Amazon)
– When adding streaming, Netflix decided they
• Shouldn’t invest in building data centers worldwide
• Had to plan for the streaming business to be very big
– Embraced cloud architecture paying only for what they need
• Open Source
– Many parts of runtime depend on open source
• Linux, Apache Tomcat, Apache Cassandra, etc.
• Requires top technical talent and OSS committers
– Realized that Amazon wasn’t enough
• Started a cloud platform on top that would
eventually be open sourced - NetflixOSS
http://en.wikipedia.org/wiki/
File:Andre_in_the_late_%2780s.jpg
14
15. NetflixOSS on Github
• “Technical
indigestion as a
service”
– Adrian Cockcroft
• netflix.github.io
– 40+ OSS projects
– Expanding every day
• Focusing more on
interactive midtier server
technology today
…
15
16. Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility
• Get started yourself
16
17. High Availability Thoughts
• Three of every part of your architecture
–
–
–
–
–
EVERYTHING in your architecture (including IaaS components)
Likely more via clustering/partitioning
One = SPOF
Two = slow active/standby recovery
Three = where you get zero downtime when failures occur
• All parts of application should fail independently
– No one part should take down entire application
– When linked, highest availability is limited to lowest availability component
– Apply circuit breaker pattern to isolate systems
• If a part of the system results in total end user failure
– Use partitioning to ensure only some smaller percentage of users are affected
17
18. Faleure
• What is failing?
– Underlying IaaS problems
• Instances, racks, availability zones, regions
– Software issues
• Operating system, servers, application code
Inspiration
– Surrounding services
• Other application services, DNS, user registries, etc.
• How is a component failing?
–
–
–
–
Fails and disappears altogether
Intermittently fails
Works, but is responding slowly
Works, but is causing users a poor experience
18
19. Overview of IaaS HA
•
Launch instances into availability zones
– Instances of various sizes (compute, storage, etc.)
•
Availability zones are isolated from each over
Availability zones are connected /w low-latency links
Regions contain availability zones
Regions independent of each other
Regions have higher latency to each other
Datacenter/
Availability Zone
Datacenter/
Availability Zone
Internet
This gives a high level of resilience to outages
– Unlikely to affect multiple availability zones or regions
•
Datacenter/
Availability Zone
Organized into regions and availability zones
–
–
–
–
–
•
Region
(Dallas)
Cloud providers require customer be aware of this
topology to take advantage of its benefits within
their application
Second
Region
Datacenter/
Availability Zone
Datacenter/
Availability Zone
Datacenter/
Availability Zone
19
20. Acme Air As A Sample
ELB
Web App
Front End
(REST services)
App Service
(Authentication)
Data Tier
Greatly simplified …
20
21. Micro-services architecture
• Decompose system into isolated services that can be developed
separately
• Why?
– They can fail independently vs. fail together monolythically
– They can be developed and released with difference velocities by
different teams
• To show this we created separate “auth service” for Acme Air
• In a typical customer facing application any single front end
invocation could spawn 20-30 calls to services and data sources
21
22. How do services advertise themselves?
• Upon web app startup, Karyon server is started
– Karyon will configure (via Archaius) the application
– Karyon will register the location of the instance with Eureka
• Others can know of the existence of the service
• Lease based so instances continue to check in updating list of available instances
– Karyon will also expose a JMX console, healthcheck URL
• Devops can change things about the service via JMX
• The system can monitor the health of the instance
App Service
(Authentication)
Name, Port
IP address,
Healthcheck url
Karyon
App Server
Eureka
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Server(s)
config.properties, auth-service.properties
Or remote Archaius stores
22
23. How do consumers find services?
• Service consumers query eureka at startup and
periodically to determine location of dependencies
– Can query based on availability zone and cross
availability zone
Web App
Front End
(REST services)
Eureka client
App Server
What “auth-service”
instances exist?
Eureka
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Server(s)
23
25. How does the consumer call the service?
• Protocols impls have eureka aware load balancing support build in
– In client load balancing -- does not require separate LB tier
• Ribbon – REST client
– Pluggable load balancing scheme
– Built in failure recovery support (retry next server, mark instance as failing, etc.)
• Other eureka enabled clients
– Custom code in non-Java or Ribbon enabled systems (Java or pure REST)
– More from Netflix
• Memcached (EVCache), Asystanax (Cassandra and Priam) coming
Web App
Front End
(REST services)
Call
“auth-service”
Ribbon
REST
client
Eureka
client
App Service
App Service
(Authentication)
App Service
(Authentication)
App Service
(Authentication)
(Authentication)
25
26. PS. This is a common pattern
• Same idea, but different implementations
– Airbnb.com’s SmartStack
• Zookeeper/Synapse/Nerve/HAProxy
– Parse.com’s clustering
• Zookeeper/Ngnix
26
27. How to deploy this with HA?
Instances?
• Asgard deploys across AZs
• Using auto scaling groups in
managed by Asgard
• More on Asgard later
Eureka?
•
•
DNS and Elastic IP trickery
Deployed across AZs
•
For clients to find eureka servers
–
–
•
For new eureka servers
–
–
–
•
DNS TXT record for domain lists AZ TXT
records
AZ TXT records have list of Eureka servers
Look for list of eureka servers IP’s for the AZ
it’s coming up in
Look for unassigned elastic IP’s, grab one and
assign it to itself
Sync with other already assigned IP’s that
likely are hosting Eureka server instances
Simpler configurations with less HA are
available
27
28. Protect yourself from unhealthy services
• Wrap all calls to services with Hystrix command pattern
– Hystrix implements circuit breaker pattern
– Executes command using semaphore or separate thread pool to
guarantee return within finite time to caller
– If a unhealthy service is detected, start to call fallback implementation
(broken circuit) and periodically check if main implementation works
(reset circuit)
• Hystrix also provides caching, request collapsing with synchronous
and asynchronous (reactive via RxJava) invocation
Execute
auth-service
call
Call
“auth-service”
Hystrix
Web App
Front End
(REST
services)
Ribbon REST
client
App Service
App Service
(Authentication)
App Service
(Authentication)
App Service
(Authentication)
(Authentication)
Fallback implementation
28
29. Denominator
• Most (simple) geographic (region) based disaster
recovery depends on front end DNS traffic switching
• Java Library and CLI for cross DNS configuration
• Allows for common, quicker (than using various DNS
provider UI) and automated DNS updates
• Plugins have been developed by various DNS providers
29
30. Augmenting the ELB tier - Zuul
•
Originally developed to do cross region routing for regional HA
– Advanced geographic (region) based disaster recovery
•
Zuul also adds devops support in the front tier routing
–
–
–
–
–
•
And some common function
–
–
–
–
–
•
•
Stress testing (squeeze testing)
Canary testing
Dynamic routing
Load Shedding
Debugging
Region 1
Load
Balancers
Filter
Filter
Filter
Filters
Zuul
Zuul
Zuul
Edge
Service
Region 2
Load
Balancers
Zuul
Zuul
Zuul
Edge
Service
Authentication
Security
Static response handling
Multi-region resiliency (DR for ELB tier)
Insight
Through dynamically deployable filters (written in Groovy)
Eureka aware using ribbon, and archaius like shown in runtime section
30
31. HA in application architecture
• Stateless application design
–
–
–
–
Legacy application design has state
Temporal state should be pushed to caching servers
Durable state should be pushed to partitioned data servers
Trades off peak latency for uptime (sometimes no trade off)
• Partitioned data servers
– Wealth of NoSQL servers available today
– Be careful of oversold “consistency” promises
• Look for third party “Jepsen-like” testing
– Be ready to deal with compensated approaches
– Consider differences in system of record vs. interaction data
stores
31
32. Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility
• Get started yourself
32
33. Automatic Recovery Thoughts
• Automatic recovery depends on elastic, ephemeral
instance cluster design powered by “auto scaling”
• If something fails once, it will fail again
• No repeated failure should be a pager call
– Instead should be email with automated recovery
information to be analyzed offline
• Test failure on your system before the system tests
your failure
33
34. Auto Scaling (for the masses)
• For many, auto scaling is more auto recovery
– Far more important to keep N instances running
than be able to scale automatically to 2N, 10N,
100N
• For many, automatic scaling isn’t appropriate
– First understand how the system can be elastically
scaled with operator expertise manually
34
35. ASGard
Region
(Dallas)
Datacenter/
Availability Zone
Tell IaaS to start
these instances and
Keep this many
Instances running
Datacenter/
Availability Zone
Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)
App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)
Datacenter/
Availability Zone
Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)
App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)
Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)
App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)
• Asgard is the console for automatic scaling and recovery
35
36. Asgard creates an “application”
• Enforces common practices for deploying code
– Common approach to linking auto scaling groups to launch configurations,
load balancers, security groups, scaling policies and images
• Adds missing concept to the IaaS domain model – “application”
– Apps clustering and application lifecycle vs. individually launched and
managed images
• Example
–
–
–
–
Application – app1
Cluster – app1-env
Asgard group version n – app1-env-v009
Asgard group version n+1 – app1-env-v010
36
37. When to test recovery (and HA)?
• Failure is inevitable. Don’t try to avoid it!
• How do you know if your backup is good?
– Try to restore from your backup every so often
– Better to ensure backup works before you have a crashed
system and find out your backup is broken
• How do you know if your system is HA?
– Try to force failures every so often
– Better to force those failures during office hours
– Better to ensure HA before you have a down system and
angry users
– Best to learn from failures and add automated tests
37
38. The Simian Army
• A bunch of automated “monkeys” that
perform automated system administration
tasks
• Anything that is done by a human more than
once can and should be automated
• Absolutely necessary at web scale
38
39. Bad Monkeys
• Open Sourced – Chaos Monkey
– Used to randomly terminate instances
– Now block network, burn cpu, kill
processes, fail amazon api, fail dns,
fail dynamo, fail s3, introduce network
errors/latency, detach volumes, fill disk,
burn I/O
http://www.flickr.com/photos/27261720@N00/132750805
• Not yet open sourced
– Chaos Gorilla
• Kill datacenter/availability zone instances
– Chaos Kong
• Kill all instances in an entire region
– Latency Monkey
• Introduce latency into service calls directly (ribbon server side)
– Split Brain Monkey
• Datacenters/availability zones continue to operate, but isolated from each other
39
40. Elastic Scale
• Basic elastic scale required to achieve high availability
– To run three or more of any component
• Front tier specific considerations
– Will likely need to scale far higher than micro-services
– Use distributed caching with TTL where appropriate
– Otherwise micro-service architecture could overload data
servers
• Scaling larger (or Web Scale) will find bottlenecks that
require changes to architecture and/or tuning
– Iterative process of improvement
40
41. Elastic scaling in application
architecture
• Clusters that replicate data within the cluster must
discover new peers (and timeout dead ones)
• Clusters that connect to other clusters must discover
new dependency instances (and timeout dead ones)
• Many legacy architectures contain static cluster
definitions that require “re-starts” to update
information
– Code changes required to leverage dynamic connectivity
41
42. Full Auto Scaling
• Eventually web scale will require auto scaling
based on policy
– Attach policy based on request latency, utilization,
queue depth, etc.
• Words of caution, be careful to
– Design policies to be proactive on scale up or risk
scaling that isn’t fast enough to keep up with demand
– Design policies to be generous on scale down or risk
over-scaling down and immediate need for scale up
42
43. Scaling Continues to Evolve
• Reactive auto scaling is “easy” but naïve
– Instances fail
– Unexpected spike in demand
• What if your traffic is “predictable”, consider
– User population follows a daily pattern
– User population known to follow different patterns each day (work
days vs. weekends)
– End of month influx of work
• Scryer is Netflix’s predictive analytics to not wait for reactive scaling
– Better end user experience, less over deployment (cheaper), more
consistent utilization (cheaper)
– Not yet open sourced
43
44. Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• How to grade public cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility
• Get started yourself
44
45. Thoughts on
Continuous Delivery
• Legacy waterfall habits are hard to break
Inspiration
– “Leaks” of old world continue to show
– Especially if product has to be released in “shrink
wrapped” form in parallel
• Netflix approach and technology assists breaking
these habits
– Provide the tools and proof points and the
organization will follow
45
46. Continuous Delivery Pipeline
•
Developers
– Perform local testing before checking code into continuous build
•
Continuous build
– Builds code, tests code and flags any breaks for immediate attention
– Builds packages ready for image installation
•
Image bakery
– Builds image for deployment that then show up in Asgard
•
Continuous deployment
– Images deployed through Asgard
– Instances are given image and environmental context from Asgard
•
•
Same images should be used in production that are used in test
Due to micro-services (API as contract) approach
– No need to co-ordinate typical deployments across teams
46
47. Asgard devops procedures
•
•
•
•
Fast rollback
Canary testing
Red/Black pushes
More through REST interfaces
– Adhoc processes allowed, enforced through Asgard model
• More coming using Glisten and workflow services
47
49. Ability to reconfigure - Archaius
• Using dynamic properties, can
easily change properties across
cluster of applications, either
Application
– NetflixOSS named props
• Hystrix timeouts for example
Runtime
– Custom dynamic props
Hierarchy
• High throughput achieved by
polling approach
• HA of configuration source
dependent on what source you
use
URL
JMX
Karyon
Console
Persisted DB
Application Props
Libraries
Container
– HTTP server, database, etc.
DynamicIntProperty prop =
DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE);
int value = prop.get(); // value will change over time based on configuration
49
50. Get baked!
• Caution: Flame/troll bait ahead!!
– Criticism – “Netflix is ruining the cloud”
• Overhead of images for every code version
• Ties to Amazon AMI’s (have proven this tie can be broken)
• Netflix takes the approach of baking images as part of build such that
– Instance boot-up doesn’t depend on outside servers
– Instance boot-up only starts servers already set to run
– New code = new instances (never update instances in place)
• Why?
– Critical when launching hundreds of servers at a time
– Goal to reduce the failure points in places where dynamic system
configuration doesn’t provide value
– Speed of elastic scaling, boot and go
– Discourages ad hoc changes to server instances
50
51. AMInator
• Starting image/volume
– Foundational image created (maybe via
loopback), base AMI with common
software created/tested independently
• Aminator running – Bakery
– Bakery obtains a known EBS volume of
the base image from a pool
– Bakery mounts volume and provisions
the application (apt/deb or yum/rpm)
– Bakery snapshots and registers snapshot
• Recent work to add other provisioning
such as chef as plugins
51
52. Imaginator
• Implementation of Aminator
– For IBM SoftLayer cloud
• Creates image templates
– Starts from base OS and adds deb/rpm’s
• Snapshots images for later deployment
• Not yet open sourced
52
53. Good Monkeys
• Janitor Monkey
– Somewhat a mitigation for baking approach
– Will mark and sweep unused resources
(instances, volumes, snapshots, ASG’s,
launch configs, images, etc.)
– Owners notified, then removed
• Conformity Monkey
http://www.flickr.com/photos/sonofgroucho/5852049290
– Check instances are conforming to rules
around security, ASG/ELB, age, status/health
check, etc.
53
54. Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility
• Get started yourself
54
55. Thoughts on Operational Visibility
• Programming model to expose metrics should be
simple
• Systems need to expose internals in a way that is
sensible to the owners and operators
• The tools that view the internals need to match the
level of abstraction developers care about
• The tools must give sufficient context when viewing
any single metric or alert
55
56. Monitoring - Servo
• Annotation based publishing through JMX of
application metrics
• Gauges, counters, and timers
• Filters, Observers, and Pollers to publish metrics
– Can export metrics to metric collection servers
• Netflix exposes their metrics to Atlas
– The entire Netflix monitoring infrastructure hasn’t been
open sourced due to complexity and priority
56
57. Back to Hystrix
• Main reason for Hystrix is
protect yourself from
dependencies, but …
• Same layer of indirection to
services can provide
visualization
• You can aggregate the view
across clusters via Turbine
• Other alert system and
dashboards can read from
Turbine
57
58. Edda
• IaaS does not typically provide
– Historical views of the state of the system
– All views between components an operator might want to see
• Edda polls current state and stores the data in a queriable
database
• Provides a adhoc queriable view of all deployment aspects
• Provides a historical view
– For correlation of problems to changes
– Becoming a more common place feature in cloud
58
59. Ice
• Cloud spend and usage analytics
• Communicates with billing API to give
birds eye view of cloud spend with drill
down to region, availability zone, and
service team through application groups
• Watches differently priced instances and
instance sizes to help optimize
• Not point in time
– Shows trends to help predict future
optimizations
59
60. Agenda
• Blah, blah, blah
• How can I learn more?
• How do I play with this?
• Let’s write some code!
60
61. Want to play?
•
NetflixOSS blog and github
– http://techblog.netflix.com
– http://github.com/Netflix
•
NetflixOSS as ported to IBM Cloud
– https://github.com/EmergingTechnologyInstitute
– SoftLayer Image Templates coming soon
•
Acme Air, NetflixOSS AMI’s
– Try Asgard/Eureka with a real application
– http://bit.ly/aa-AMIs
•
Thanks!
Questions?
See what we ported to IBM Cloud (video)
– http://bit.ly/noss-sl-blog
•
Fork and submit pull requests to Acme Air
– http://github.com/aspyker/acmeair-netflix
61