SlideShare une entreprise Scribd logo
1  sur  61
Public Cloud Services using
IBM Cloud and Netflix OSS
Jan 2014

Andrew Spyker
@aspyker
Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility

• Get started yourself
2
About me …
• IBM STSM of Performance Architect and Strategy

• Eleven years in performance in WebSphere
–
–
–
–

Led the App Server Performance team for years
Small sabbatical focused on IBM XML technology
Works in Emerging Technology Institute, CTO Office
Now cloud service operations

• Email: aspyker@us.ibm.com
–
–
–
–

Blog: http://ispyker.blogspot.com/
Linkedin: http://www.linkedin.com/in/aspyker
Twitter: http://twitter.com/aspyker
Github: http://www.github.com/aspyker

• RTP dad that enjoys technology as well as running, wine and poker
3
Develop or maintain a service today?
• Develop – yes
• Maintain – starting

• So far

http://www.flickr.com/photos/stevendepolo/

– Multiple services inside of IBM
– Other services for use in our PaaS environment
4
What qualifies me to talk?
• My monkey?
• Of cloud prize ~ 40 entrants
– Best example mash-up sample
• Nomination and win

– Best portability enhancement
• Nomination

– More on this coming …
•
•

Other nominees - http://techblog.netflix.com/2013/09/netflixoss-meetup-s1e4-cloud-prize.html
Other winners - http://techblog.netflix.com/2013/11/netflix-open-source-software-cloud.html

5
Seriously, how did I get here?
• Experience with performance and scale on
standardized benchmarks (SPEC/TPC)
– Non representative of how to (web) scale
• Pinning, biggest monolithic DB “wins”, hand tuned for fixed size

– Out of date on modern architecture for mobile/cloud

• Created Acme Air
– http://bit.ly/acmeairblog

• Demonstrated that we could achieve (web) scale runs
– 4B+ Mobile/Browser request/day
– With modern mobile and cloud best practices

6
What was shown?
• Peak performance and scale – You betcha!
• Operational visibility – Only during the run via
nmon collection and post-run visualization
•
•
•
•

True operational visibility - nope
Devops – nope
HA and DR – nope
Manual and automatic elastic scaling - nope
7
What next?
• Went looking for what best industry practices around
devops and high availability at web scale existed
– Many have documented via research papers and on
highscalability.com – Google, Twitter, Facebook, Linkedin,
etc.

• Why Netflix?
– Documented not only on their tech blog, but also have
released working OSS on github
– Also, given dependence on Amazon, they are a clear
bellwether of web scale public cloud availability
8
Steps to NetflixOSS understanding
• Recoded Acme Air application to make use of NetflixOSS
runtime components
• Worked to implement a NetflixOSS devops and high
availability setup around Acme Air (on EC2) run at previous
levels of scale and performance on IBM middleware
• Worked to port NetflixOSS runtime and devops/high
availability servers to IBM Cloud (SoftLayer) and RightScale
• Through public collaboration with Netflix technical team
– Google groups, github and meetups
9
Why?
• To prove that advanced cloud high availability
and devops platform wasn’t “tied” to Amazon
• To understand how we can advance IBM cloud
platforms for our customers
• To understand how we can host our IBM
public cloud services better
10
Another Cloud Portability work of note
• In this presentation, focused
on portability across public clouds

Project Aurora

• What about applicability to private cloud?

• Paypal worked to port the cloud management system
to OpenStack and Heat
– https://github.com/paypal/aurora

• Additional work required to port runtime aspects as we
did in public cloud
11
Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility

• Get started yourself
12
My view of Netflix goals
• As a business
– Be the best streaming media provider in the world
– Make best content deals based on real data/analysis

• Technology wise
– Have the most availability possible
– “Stream starts per unit of time” is KPI measured for entire business
– Deliver features to customers first in market
• Requiring high velocity of IT change

– Do all of this at web scale

• Culture wise
– Create a high performance delivery culture that attracts top talent

13
Standing on the shoulder of a giants
• Public Cloud (Amazon)
– When adding streaming, Netflix decided they
• Shouldn’t invest in building data centers worldwide
• Had to plan for the streaming business to be very big

– Embraced cloud architecture paying only for what they need

• Open Source
– Many parts of runtime depend on open source
• Linux, Apache Tomcat, Apache Cassandra, etc.
• Requires top technical talent and OSS committers

– Realized that Amazon wasn’t enough
• Started a cloud platform on top that would
eventually be open sourced - NetflixOSS
http://en.wikipedia.org/wiki/
File:Andre_in_the_late_%2780s.jpg

14
NetflixOSS on Github
• “Technical
indigestion as a
service”
– Adrian Cockcroft

• netflix.github.io
– 40+ OSS projects
– Expanding every day

• Focusing more on
interactive midtier server
technology today
…
15
Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility

• Get started yourself
16
High Availability Thoughts
• Three of every part of your architecture
–
–
–
–
–

EVERYTHING in your architecture (including IaaS components)
Likely more via clustering/partitioning
One = SPOF
Two = slow active/standby recovery
Three = where you get zero downtime when failures occur

• All parts of application should fail independently
– No one part should take down entire application
– When linked, highest availability is limited to lowest availability component
– Apply circuit breaker pattern to isolate systems

• If a part of the system results in total end user failure
– Use partitioning to ensure only some smaller percentage of users are affected

17
Faleure
• What is failing?
– Underlying IaaS problems
• Instances, racks, availability zones, regions

– Software issues
• Operating system, servers, application code

Inspiration

– Surrounding services
• Other application services, DNS, user registries, etc.

• How is a component failing?
–
–
–
–

Fails and disappears altogether
Intermittently fails
Works, but is responding slowly
Works, but is causing users a poor experience
18
Overview of IaaS HA
•

Launch instances into availability zones
– Instances of various sizes (compute, storage, etc.)

•

Availability zones are isolated from each over
Availability zones are connected /w low-latency links
Regions contain availability zones
Regions independent of each other
Regions have higher latency to each other

Datacenter/
Availability Zone

Datacenter/
Availability Zone

Internet

This gives a high level of resilience to outages
– Unlikely to affect multiple availability zones or regions

•

Datacenter/
Availability Zone

Organized into regions and availability zones
–
–
–
–
–

•

Region
(Dallas)

Cloud providers require customer be aware of this
topology to take advantage of its benefits within
their application

Second
Region

Datacenter/
Availability Zone

Datacenter/
Availability Zone

Datacenter/
Availability Zone

19
Acme Air As A Sample

ELB

Web App
Front End
(REST services)

App Service
(Authentication)

Data Tier

Greatly simplified …

20
Micro-services architecture
• Decompose system into isolated services that can be developed
separately
• Why?
– They can fail independently vs. fail together monolythically
– They can be developed and released with difference velocities by
different teams

• To show this we created separate “auth service” for Acme Air
• In a typical customer facing application any single front end
invocation could spawn 20-30 calls to services and data sources

21
How do services advertise themselves?
• Upon web app startup, Karyon server is started
– Karyon will configure (via Archaius) the application
– Karyon will register the location of the instance with Eureka
• Others can know of the existence of the service
• Lease based so instances continue to check in updating list of available instances

– Karyon will also expose a JMX console, healthcheck URL
• Devops can change things about the service via JMX
• The system can monitor the health of the instance

App Service
(Authentication)

Name, Port
IP address,
Healthcheck url

Karyon

App Server

Eureka
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Server(s)

config.properties, auth-service.properties
Or remote Archaius stores
22
How do consumers find services?
• Service consumers query eureka at startup and
periodically to determine location of dependencies
– Can query based on availability zone and cross
availability zone
Web App
Front End
(REST services)
Eureka client

App Server

What “auth-service”
instances exist?
Eureka
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Server(s)

23
Demo

24
How does the consumer call the service?
• Protocols impls have eureka aware load balancing support build in
– In client load balancing -- does not require separate LB tier

• Ribbon – REST client
– Pluggable load balancing scheme
– Built in failure recovery support (retry next server, mark instance as failing, etc.)

• Other eureka enabled clients
– Custom code in non-Java or Ribbon enabled systems (Java or pure REST)
– More from Netflix
• Memcached (EVCache), Asystanax (Cassandra and Priam) coming

Web App
Front End
(REST services)

Call
“auth-service”

Ribbon
REST
client
Eureka
client

App Service
App Service
(Authentication)
App Service
(Authentication)
App Service
(Authentication)
(Authentication)
25
PS. This is a common pattern
• Same idea, but different implementations
– Airbnb.com’s SmartStack
• Zookeeper/Synapse/Nerve/HAProxy

– Parse.com’s clustering
• Zookeeper/Ngnix

26
How to deploy this with HA?
Instances?
• Asgard deploys across AZs
• Using auto scaling groups in
managed by Asgard
• More on Asgard later

Eureka?
•
•

DNS and Elastic IP trickery
Deployed across AZs

•

For clients to find eureka servers
–

–

•

For new eureka servers
–
–
–

•

DNS TXT record for domain lists AZ TXT
records
AZ TXT records have list of Eureka servers

Look for list of eureka servers IP’s for the AZ
it’s coming up in
Look for unassigned elastic IP’s, grab one and
assign it to itself
Sync with other already assigned IP’s that
likely are hosting Eureka server instances

Simpler configurations with less HA are
available
27
Protect yourself from unhealthy services
• Wrap all calls to services with Hystrix command pattern
– Hystrix implements circuit breaker pattern
– Executes command using semaphore or separate thread pool to
guarantee return within finite time to caller
– If a unhealthy service is detected, start to call fallback implementation
(broken circuit) and periodically check if main implementation works
(reset circuit)

• Hystrix also provides caching, request collapsing with synchronous
and asynchronous (reactive via RxJava) invocation

Execute
auth-service
call

Call
“auth-service”

Hystrix

Web App
Front End
(REST
services)

Ribbon REST
client

App Service
App Service
(Authentication)
App Service
(Authentication)
App Service
(Authentication)
(Authentication)

Fallback implementation
28
Denominator
• Most (simple) geographic (region) based disaster
recovery depends on front end DNS traffic switching
• Java Library and CLI for cross DNS configuration
• Allows for common, quicker (than using various DNS
provider UI) and automated DNS updates
• Plugins have been developed by various DNS providers

29
Augmenting the ELB tier - Zuul
•

Originally developed to do cross region routing for regional HA
– Advanced geographic (region) based disaster recovery

•

Zuul also adds devops support in the front tier routing
–
–
–
–
–

•

And some common function
–
–
–
–
–

•
•

Stress testing (squeeze testing)
Canary testing
Dynamic routing
Load Shedding
Debugging

Region 1
Load
Balancers

Filter
Filter
Filter
Filters

Zuul
Zuul
Zuul
Edge
Service

Region 2
Load
Balancers

Zuul
Zuul
Zuul

Edge
Service

Authentication
Security
Static response handling
Multi-region resiliency (DR for ELB tier)
Insight

Through dynamically deployable filters (written in Groovy)
Eureka aware using ribbon, and archaius like shown in runtime section
30
HA in application architecture
• Stateless application design
–
–
–
–

Legacy application design has state
Temporal state should be pushed to caching servers
Durable state should be pushed to partitioned data servers
Trades off peak latency for uptime (sometimes no trade off)

• Partitioned data servers
– Wealth of NoSQL servers available today
– Be careful of oversold “consistency” promises
• Look for third party “Jepsen-like” testing

– Be ready to deal with compensated approaches
– Consider differences in system of record vs. interaction data
stores
31
Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility

• Get started yourself
32
Automatic Recovery Thoughts
• Automatic recovery depends on elastic, ephemeral
instance cluster design powered by “auto scaling”
• If something fails once, it will fail again
• No repeated failure should be a pager call
– Instead should be email with automated recovery
information to be analyzed offline

• Test failure on your system before the system tests
your failure
33
Auto Scaling (for the masses)
• For many, auto scaling is more auto recovery
– Far more important to keep N instances running
than be able to scale automatically to 2N, 10N,
100N

• For many, automatic scaling isn’t appropriate
– First understand how the system can be elastically
scaled with operator expertise manually

34
ASGard
Region
(Dallas)

Datacenter/
Availability Zone

Tell IaaS to start
these instances and
Keep this many
Instances running
Datacenter/
Availability Zone

Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)

App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)

Datacenter/
Availability Zone

Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)

App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)

Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)

App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)

• Asgard is the console for automatic scaling and recovery
35
Asgard creates an “application”
• Enforces common practices for deploying code
– Common approach to linking auto scaling groups to launch configurations,
load balancers, security groups, scaling policies and images

• Adds missing concept to the IaaS domain model – “application”
– Apps clustering and application lifecycle vs. individually launched and
managed images

• Example
–
–
–
–

Application – app1
Cluster – app1-env
Asgard group version n – app1-env-v009
Asgard group version n+1 – app1-env-v010

36
When to test recovery (and HA)?
• Failure is inevitable. Don’t try to avoid it!
• How do you know if your backup is good?
– Try to restore from your backup every so often
– Better to ensure backup works before you have a crashed
system and find out your backup is broken

• How do you know if your system is HA?
– Try to force failures every so often
– Better to force those failures during office hours
– Better to ensure HA before you have a down system and
angry users
– Best to learn from failures and add automated tests
37
The Simian Army
• A bunch of automated “monkeys” that
perform automated system administration
tasks

• Anything that is done by a human more than
once can and should be automated
• Absolutely necessary at web scale
38
Bad Monkeys
• Open Sourced – Chaos Monkey
– Used to randomly terminate instances
– Now block network, burn cpu, kill
processes, fail amazon api, fail dns,
fail dynamo, fail s3, introduce network
errors/latency, detach volumes, fill disk,
burn I/O
http://www.flickr.com/photos/27261720@N00/132750805

• Not yet open sourced
– Chaos Gorilla
• Kill datacenter/availability zone instances

– Chaos Kong
• Kill all instances in an entire region

– Latency Monkey
• Introduce latency into service calls directly (ribbon server side)

– Split Brain Monkey
• Datacenters/availability zones continue to operate, but isolated from each other

39
Elastic Scale
• Basic elastic scale required to achieve high availability
– To run three or more of any component

• Front tier specific considerations
– Will likely need to scale far higher than micro-services
– Use distributed caching with TTL where appropriate
– Otherwise micro-service architecture could overload data
servers

• Scaling larger (or Web Scale) will find bottlenecks that
require changes to architecture and/or tuning
– Iterative process of improvement
40
Elastic scaling in application
architecture
• Clusters that replicate data within the cluster must
discover new peers (and timeout dead ones)
• Clusters that connect to other clusters must discover
new dependency instances (and timeout dead ones)
• Many legacy architectures contain static cluster
definitions that require “re-starts” to update
information
– Code changes required to leverage dynamic connectivity

41
Full Auto Scaling
• Eventually web scale will require auto scaling
based on policy
– Attach policy based on request latency, utilization,
queue depth, etc.

• Words of caution, be careful to
– Design policies to be proactive on scale up or risk
scaling that isn’t fast enough to keep up with demand
– Design policies to be generous on scale down or risk
over-scaling down and immediate need for scale up
42
Scaling Continues to Evolve
• Reactive auto scaling is “easy” but naïve
– Instances fail
– Unexpected spike in demand

• What if your traffic is “predictable”, consider
– User population follows a daily pattern
– User population known to follow different patterns each day (work
days vs. weekends)
– End of month influx of work

• Scryer is Netflix’s predictive analytics to not wait for reactive scaling
– Better end user experience, less over deployment (cheaper), more
consistent utilization (cheaper)
– Not yet open sourced

43
Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• How to grade public cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility

• Get started yourself
44
Thoughts on
Continuous Delivery
• Legacy waterfall habits are hard to break

Inspiration

– “Leaks” of old world continue to show
– Especially if product has to be released in “shrink
wrapped” form in parallel

• Netflix approach and technology assists breaking
these habits
– Provide the tools and proof points and the
organization will follow
45
Continuous Delivery Pipeline
•

Developers
– Perform local testing before checking code into continuous build

•

Continuous build
– Builds code, tests code and flags any breaks for immediate attention
– Builds packages ready for image installation

•

Image bakery
– Builds image for deployment that then show up in Asgard

•

Continuous deployment
– Images deployed through Asgard
– Instances are given image and environmental context from Asgard
•

•

Same images should be used in production that are used in test

Due to micro-services (API as contract) approach
– No need to co-ordinate typical deployments across teams

46
Asgard devops procedures
•
•
•
•

Fast rollback
Canary testing
Red/Black pushes
More through REST interfaces
– Adhoc processes allowed, enforced through Asgard model

• More coming using Glisten and workflow services

47
Demo

48
Ability to reconfigure - Archaius
• Using dynamic properties, can
easily change properties across
cluster of applications, either

Application

– NetflixOSS named props
• Hystrix timeouts for example

Runtime

– Custom dynamic props
Hierarchy

• High throughput achieved by
polling approach
• HA of configuration source
dependent on what source you
use

URL

JMX
Karyon
Console

Persisted DB
Application Props
Libraries
Container

– HTTP server, database, etc.
DynamicIntProperty prop =
DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE);
int value = prop.get(); // value will change over time based on configuration

49
Get baked!
• Caution: Flame/troll bait ahead!!
– Criticism – “Netflix is ruining the cloud”
• Overhead of images for every code version
• Ties to Amazon AMI’s (have proven this tie can be broken)

• Netflix takes the approach of baking images as part of build such that
– Instance boot-up doesn’t depend on outside servers
– Instance boot-up only starts servers already set to run
– New code = new instances (never update instances in place)

• Why?
– Critical when launching hundreds of servers at a time
– Goal to reduce the failure points in places where dynamic system
configuration doesn’t provide value
– Speed of elastic scaling, boot and go
– Discourages ad hoc changes to server instances
50
AMInator
• Starting image/volume
– Foundational image created (maybe via
loopback), base AMI with common
software created/tested independently

• Aminator running – Bakery
– Bakery obtains a known EBS volume of
the base image from a pool
– Bakery mounts volume and provisions
the application (apt/deb or yum/rpm)
– Bakery snapshots and registers snapshot

• Recent work to add other provisioning
such as chef as plugins
51
Imaginator
• Implementation of Aminator
– For IBM SoftLayer cloud

• Creates image templates
– Starts from base OS and adds deb/rpm’s

• Snapshots images for later deployment
• Not yet open sourced
52
Good Monkeys
• Janitor Monkey
– Somewhat a mitigation for baking approach
– Will mark and sweep unused resources
(instances, volumes, snapshots, ASG’s,
launch configs, images, etc.)
– Owners notified, then removed

• Conformity Monkey

http://www.flickr.com/photos/sonofgroucho/5852049290

– Check instances are conforming to rules
around security, ASG/ELB, age, status/health
check, etc.

53
Agenda
• How did I get here?
• Netflix overview, Netflix OSS teaser
• Cloud services
– High Availability
– Automatic recovery
– Continuous delivery
– Operational visibility

• Get started yourself
54
Thoughts on Operational Visibility
• Programming model to expose metrics should be
simple
• Systems need to expose internals in a way that is
sensible to the owners and operators
• The tools that view the internals need to match the
level of abstraction developers care about
• The tools must give sufficient context when viewing
any single metric or alert
55
Monitoring - Servo
• Annotation based publishing through JMX of
application metrics
• Gauges, counters, and timers
• Filters, Observers, and Pollers to publish metrics
– Can export metrics to metric collection servers

• Netflix exposes their metrics to Atlas
– The entire Netflix monitoring infrastructure hasn’t been
open sourced due to complexity and priority
56
Back to Hystrix
• Main reason for Hystrix is
protect yourself from
dependencies, but …
• Same layer of indirection to
services can provide
visualization
• You can aggregate the view
across clusters via Turbine

• Other alert system and
dashboards can read from
Turbine
57
Edda
• IaaS does not typically provide
– Historical views of the state of the system
– All views between components an operator might want to see

• Edda polls current state and stores the data in a queriable
database
• Provides a adhoc queriable view of all deployment aspects
• Provides a historical view
– For correlation of problems to changes
– Becoming a more common place feature in cloud

58
Ice
• Cloud spend and usage analytics
• Communicates with billing API to give
birds eye view of cloud spend with drill
down to region, availability zone, and
service team through application groups
• Watches differently priced instances and
instance sizes to help optimize
• Not point in time
– Shows trends to help predict future
optimizations

59
Agenda
• Blah, blah, blah
• How can I learn more?
• How do I play with this?

• Let’s write some code!
60
Want to play?
•

NetflixOSS blog and github
– http://techblog.netflix.com
– http://github.com/Netflix

•

NetflixOSS as ported to IBM Cloud
– https://github.com/EmergingTechnologyInstitute
– SoftLayer Image Templates coming soon

•

Acme Air, NetflixOSS AMI’s
– Try Asgard/Eureka with a real application
– http://bit.ly/aa-AMIs

•

Thanks!
Questions?

See what we ported to IBM Cloud (video)
– http://bit.ly/noss-sl-blog

•

Fork and submit pull requests to Acme Air
– http://github.com/aspyker/acmeair-netflix

61

Contenu connexe

Tendances

Netflix Webkit-Based UI for TV Devices
Netflix Webkit-Based UI for TV DevicesNetflix Webkit-Based UI for TV Devices
Netflix Webkit-Based UI for TV DevicesMatt McCarthy
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopSudhir Tonse
 
Getting Started with Amazon AppStream
Getting Started with Amazon AppStreamGetting Started with Amazon AppStream
Getting Started with Amazon AppStreamAmazon Web Services
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Amazon WorkSpaces - Fully Managed Desktops in the Cloud
Amazon WorkSpaces - Fully Managed Desktops in the CloudAmazon WorkSpaces - Fully Managed Desktops in the Cloud
Amazon WorkSpaces - Fully Managed Desktops in the CloudAmazon Web Services
 
AWS Innovate: Moving Microsoft .Net applications one container at a time - Da...
AWS Innovate: Moving Microsoft .Net applications one container at a time - Da...AWS Innovate: Moving Microsoft .Net applications one container at a time - Da...
AWS Innovate: Moving Microsoft .Net applications one container at a time - Da...Amazon Web Services Korea
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Scalable Microservices at Netflix. Challenges and Tools of the Trade
Scalable Microservices at Netflix. Challenges and Tools of the TradeScalable Microservices at Netflix. Challenges and Tools of the Trade
Scalable Microservices at Netflix. Challenges and Tools of the TradeC4Media
 
Monitoring microservices platform
Monitoring microservices platformMonitoring microservices platform
Monitoring microservices platformBoyan Dimitrov
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixJosh Evans
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connectAdrian Cockcroft
 
High Availability and Disaster Recovery
High Availability and Disaster RecoveryHigh Availability and Disaster Recovery
High Availability and Disaster RecoveryAkelios
 
Delivering Hybrid Cloud Solutions on Microsoft Azure
Delivering Hybrid Cloud Solutions on Microsoft AzureDelivering Hybrid Cloud Solutions on Microsoft Azure
Delivering Hybrid Cloud Solutions on Microsoft AzureKemp
 
#NetflixEverywhere Global Architecture
#NetflixEverywhere Global Architecture#NetflixEverywhere Global Architecture
#NetflixEverywhere Global ArchitectureJosh Evans
 
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Derek Ashmore
 
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...TriNimbus
 
Introduction to RightScale
Introduction to RightScaleIntroduction to RightScale
Introduction to RightScaleAkelios
 
Alfresco Transform Service DevCon 2019
Alfresco Transform Service DevCon 2019Alfresco Transform Service DevCon 2019
Alfresco Transform Service DevCon 2019J V
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAmazon Web Services Korea
 
Scaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWSScaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWSBoyan Dimitrov
 

Tendances (20)

Netflix Webkit-Based UI for TV Devices
Netflix Webkit-Based UI for TV DevicesNetflix Webkit-Based UI for TV Devices
Netflix Webkit-Based UI for TV Devices
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash Workshop
 
Getting Started with Amazon AppStream
Getting Started with Amazon AppStreamGetting Started with Amazon AppStream
Getting Started with Amazon AppStream
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Amazon WorkSpaces - Fully Managed Desktops in the Cloud
Amazon WorkSpaces - Fully Managed Desktops in the CloudAmazon WorkSpaces - Fully Managed Desktops in the Cloud
Amazon WorkSpaces - Fully Managed Desktops in the Cloud
 
AWS Innovate: Moving Microsoft .Net applications one container at a time - Da...
AWS Innovate: Moving Microsoft .Net applications one container at a time - Da...AWS Innovate: Moving Microsoft .Net applications one container at a time - Da...
AWS Innovate: Moving Microsoft .Net applications one container at a time - Da...
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Scalable Microservices at Netflix. Challenges and Tools of the Trade
Scalable Microservices at Netflix. Challenges and Tools of the TradeScalable Microservices at Netflix. Challenges and Tools of the Trade
Scalable Microservices at Netflix. Challenges and Tools of the Trade
 
Monitoring microservices platform
Monitoring microservices platformMonitoring microservices platform
Monitoring microservices platform
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at Netflix
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
High Availability and Disaster Recovery
High Availability and Disaster RecoveryHigh Availability and Disaster Recovery
High Availability and Disaster Recovery
 
Delivering Hybrid Cloud Solutions on Microsoft Azure
Delivering Hybrid Cloud Solutions on Microsoft AzureDelivering Hybrid Cloud Solutions on Microsoft Azure
Delivering Hybrid Cloud Solutions on Microsoft Azure
 
#NetflixEverywhere Global Architecture
#NetflixEverywhere Global Architecture#NetflixEverywhere Global Architecture
#NetflixEverywhere Global Architecture
 
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
 
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
 
Introduction to RightScale
Introduction to RightScaleIntroduction to RightScale
Introduction to RightScale
 
Alfresco Transform Service DevCon 2019
Alfresco Transform Service DevCon 2019Alfresco Transform Service DevCon 2019
Alfresco Transform Service DevCon 2019
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
 
Scaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWSScaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWS
 

En vedette

DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012Amazon Web Services
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflixgreggulrich
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to PriamJason Brown
 
Building Cloud Tools for Netflix
Building Cloud Tools for NetflixBuilding Cloud Tools for Netflix
Building Cloud Tools for NetflixJoe Sondow
 
Asgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the CloudAsgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the CloudJoe Sondow
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Philip Fisher-Ogden
 

En vedette (7)

DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflix
 
Data Stores @ Netflix
Data Stores @ NetflixData Stores @ Netflix
Data Stores @ Netflix
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
 
Building Cloud Tools for Netflix
Building Cloud Tools for NetflixBuilding Cloud Tools for Netflix
Building Cloud Tools for Netflix
 
Asgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the CloudAsgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the Cloud
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 

Similaire à Cloud Services Powered by IBM SoftLayer and NetflixOSS

NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013aspyker
 
Netflix0SS Services on Docker
Netflix0SS Services on DockerNetflix0SS Services on Docker
Netflix0SS Services on DockerDocker, Inc.
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalaspyker
 
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...IndicThreads
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Sourceaspyker
 
Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Andrés Colón Pérez
 
Ultra-scale e-Commerce Transaction Services with Lean Middleware
Ultra-scale e-Commerce Transaction Services with Lean Middleware Ultra-scale e-Commerce Transaction Services with Lean Middleware
Ultra-scale e-Commerce Transaction Services with Lean Middleware WSO2
 
Disruptive Trends in Application Development
Disruptive Trends in Application DevelopmentDisruptive Trends in Application Development
Disruptive Trends in Application DevelopmentWaveMaker, Inc.
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indixYu Ishikawa
 
Docker & aPaaS: Enterprise Innovation and Trends for 2015
Docker & aPaaS: Enterprise Innovation and Trends for 2015Docker & aPaaS: Enterprise Innovation and Trends for 2015
Docker & aPaaS: Enterprise Innovation and Trends for 2015WaveMaker, Inc.
 
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...Lohika_Odessa_TechTalks
 
Tech talk microservices debugging
Tech talk microservices debuggingTech talk microservices debugging
Tech talk microservices debuggingAndrey Kolodnitsky
 
12 Factor App Methodology
12 Factor App Methodology12 Factor App Methodology
12 Factor App Methodologylaeshin park
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes mattersPlatform9
 
Customer Applications Of Hadoop On Red Hat Storage Server
Customer Applications Of Hadoop On Red Hat Storage ServerCustomer Applications Of Hadoop On Red Hat Storage Server
Customer Applications Of Hadoop On Red Hat Storage ServerRed_Hat_Storage
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Lucas Jellema
 

Similaire à Cloud Services Powered by IBM SoftLayer and NetflixOSS (20)

NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013
 
Netflix0SS Services on Docker
Netflix0SS Services on DockerNetflix0SS Services on Docker
Netflix0SS Services on Docker
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinal
 
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...
Power Your Mobile Applications On The Cloud [IndicThreads Mobile Application ...
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
 
Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
 
Cloud presentation NELA
Cloud presentation NELACloud presentation NELA
Cloud presentation NELA
 
Ultra-scale e-Commerce Transaction Services with Lean Middleware
Ultra-scale e-Commerce Transaction Services with Lean Middleware Ultra-scale e-Commerce Transaction Services with Lean Middleware
Ultra-scale e-Commerce Transaction Services with Lean Middleware
 
Disruptive Trends in Application Development
Disruptive Trends in Application DevelopmentDisruptive Trends in Application Development
Disruptive Trends in Application Development
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
 
Docker & aPaaS: Enterprise Innovation and Trends for 2015
Docker & aPaaS: Enterprise Innovation and Trends for 2015Docker & aPaaS: Enterprise Innovation and Trends for 2015
Docker & aPaaS: Enterprise Innovation and Trends for 2015
 
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...
 
Tech talk microservices debugging
Tech talk microservices debuggingTech talk microservices debugging
Tech talk microservices debugging
 
spring-cloud.pptx
spring-cloud.pptxspring-cloud.pptx
spring-cloud.pptx
 
12 Factor App Methodology
12 Factor App Methodology12 Factor App Methodology
12 Factor App Methodology
 
Un-clouding the cloud
Un-clouding the cloudUn-clouding the cloud
Un-clouding the cloud
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes matters
 
Customer Applications Of Hadoop On Red Hat Storage Server
Customer Applications Of Hadoop On Red Hat Storage ServerCustomer Applications Of Hadoop On Red Hat Storage Server
Customer Applications Of Hadoop On Red Hat Storage Server
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
 

Plus de aspyker

Herding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes PublicHerding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes Publicaspyker
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientistsaspyker
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2aspyker
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayentaaspyker
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containersaspyker
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talkaspyker
 
Container World 2018
Container World 2018Container World 2018
Container World 2018aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1aspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4aspyker
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thingaspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflixaspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3aspyker
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016aspyker
 

Plus de aspyker (20)

Herding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes PublicHerding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes Public
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
 
Container World 2018
Container World 2018Container World 2018
Container World 2018
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
 

Dernier

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Cloud Services Powered by IBM SoftLayer and NetflixOSS

  • 1. Public Cloud Services using IBM Cloud and Netflix OSS Jan 2014 Andrew Spyker @aspyker
  • 2. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 2
  • 3. About me … • IBM STSM of Performance Architect and Strategy • Eleven years in performance in WebSphere – – – – Led the App Server Performance team for years Small sabbatical focused on IBM XML technology Works in Emerging Technology Institute, CTO Office Now cloud service operations • Email: aspyker@us.ibm.com – – – – Blog: http://ispyker.blogspot.com/ Linkedin: http://www.linkedin.com/in/aspyker Twitter: http://twitter.com/aspyker Github: http://www.github.com/aspyker • RTP dad that enjoys technology as well as running, wine and poker 3
  • 4. Develop or maintain a service today? • Develop – yes • Maintain – starting • So far http://www.flickr.com/photos/stevendepolo/ – Multiple services inside of IBM – Other services for use in our PaaS environment 4
  • 5. What qualifies me to talk? • My monkey? • Of cloud prize ~ 40 entrants – Best example mash-up sample • Nomination and win – Best portability enhancement • Nomination – More on this coming … • • Other nominees - http://techblog.netflix.com/2013/09/netflixoss-meetup-s1e4-cloud-prize.html Other winners - http://techblog.netflix.com/2013/11/netflix-open-source-software-cloud.html 5
  • 6. Seriously, how did I get here? • Experience with performance and scale on standardized benchmarks (SPEC/TPC) – Non representative of how to (web) scale • Pinning, biggest monolithic DB “wins”, hand tuned for fixed size – Out of date on modern architecture for mobile/cloud • Created Acme Air – http://bit.ly/acmeairblog • Demonstrated that we could achieve (web) scale runs – 4B+ Mobile/Browser request/day – With modern mobile and cloud best practices 6
  • 7. What was shown? • Peak performance and scale – You betcha! • Operational visibility – Only during the run via nmon collection and post-run visualization • • • • True operational visibility - nope Devops – nope HA and DR – nope Manual and automatic elastic scaling - nope 7
  • 8. What next? • Went looking for what best industry practices around devops and high availability at web scale existed – Many have documented via research papers and on highscalability.com – Google, Twitter, Facebook, Linkedin, etc. • Why Netflix? – Documented not only on their tech blog, but also have released working OSS on github – Also, given dependence on Amazon, they are a clear bellwether of web scale public cloud availability 8
  • 9. Steps to NetflixOSS understanding • Recoded Acme Air application to make use of NetflixOSS runtime components • Worked to implement a NetflixOSS devops and high availability setup around Acme Air (on EC2) run at previous levels of scale and performance on IBM middleware • Worked to port NetflixOSS runtime and devops/high availability servers to IBM Cloud (SoftLayer) and RightScale • Through public collaboration with Netflix technical team – Google groups, github and meetups 9
  • 10. Why? • To prove that advanced cloud high availability and devops platform wasn’t “tied” to Amazon • To understand how we can advance IBM cloud platforms for our customers • To understand how we can host our IBM public cloud services better 10
  • 11. Another Cloud Portability work of note • In this presentation, focused on portability across public clouds Project Aurora • What about applicability to private cloud? • Paypal worked to port the cloud management system to OpenStack and Heat – https://github.com/paypal/aurora • Additional work required to port runtime aspects as we did in public cloud 11
  • 12. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 12
  • 13. My view of Netflix goals • As a business – Be the best streaming media provider in the world – Make best content deals based on real data/analysis • Technology wise – Have the most availability possible – “Stream starts per unit of time” is KPI measured for entire business – Deliver features to customers first in market • Requiring high velocity of IT change – Do all of this at web scale • Culture wise – Create a high performance delivery culture that attracts top talent 13
  • 14. Standing on the shoulder of a giants • Public Cloud (Amazon) – When adding streaming, Netflix decided they • Shouldn’t invest in building data centers worldwide • Had to plan for the streaming business to be very big – Embraced cloud architecture paying only for what they need • Open Source – Many parts of runtime depend on open source • Linux, Apache Tomcat, Apache Cassandra, etc. • Requires top technical talent and OSS committers – Realized that Amazon wasn’t enough • Started a cloud platform on top that would eventually be open sourced - NetflixOSS http://en.wikipedia.org/wiki/ File:Andre_in_the_late_%2780s.jpg 14
  • 15. NetflixOSS on Github • “Technical indigestion as a service” – Adrian Cockcroft • netflix.github.io – 40+ OSS projects – Expanding every day • Focusing more on interactive midtier server technology today … 15
  • 16. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 16
  • 17. High Availability Thoughts • Three of every part of your architecture – – – – – EVERYTHING in your architecture (including IaaS components) Likely more via clustering/partitioning One = SPOF Two = slow active/standby recovery Three = where you get zero downtime when failures occur • All parts of application should fail independently – No one part should take down entire application – When linked, highest availability is limited to lowest availability component – Apply circuit breaker pattern to isolate systems • If a part of the system results in total end user failure – Use partitioning to ensure only some smaller percentage of users are affected 17
  • 18. Faleure • What is failing? – Underlying IaaS problems • Instances, racks, availability zones, regions – Software issues • Operating system, servers, application code Inspiration – Surrounding services • Other application services, DNS, user registries, etc. • How is a component failing? – – – – Fails and disappears altogether Intermittently fails Works, but is responding slowly Works, but is causing users a poor experience 18
  • 19. Overview of IaaS HA • Launch instances into availability zones – Instances of various sizes (compute, storage, etc.) • Availability zones are isolated from each over Availability zones are connected /w low-latency links Regions contain availability zones Regions independent of each other Regions have higher latency to each other Datacenter/ Availability Zone Datacenter/ Availability Zone Internet This gives a high level of resilience to outages – Unlikely to affect multiple availability zones or regions • Datacenter/ Availability Zone Organized into regions and availability zones – – – – – • Region (Dallas) Cloud providers require customer be aware of this topology to take advantage of its benefits within their application Second Region Datacenter/ Availability Zone Datacenter/ Availability Zone Datacenter/ Availability Zone 19
  • 20. Acme Air As A Sample ELB Web App Front End (REST services) App Service (Authentication) Data Tier Greatly simplified … 20
  • 21. Micro-services architecture • Decompose system into isolated services that can be developed separately • Why? – They can fail independently vs. fail together monolythically – They can be developed and released with difference velocities by different teams • To show this we created separate “auth service” for Acme Air • In a typical customer facing application any single front end invocation could spawn 20-30 calls to services and data sources 21
  • 22. How do services advertise themselves? • Upon web app startup, Karyon server is started – Karyon will configure (via Archaius) the application – Karyon will register the location of the instance with Eureka • Others can know of the existence of the service • Lease based so instances continue to check in updating list of available instances – Karyon will also expose a JMX console, healthcheck URL • Devops can change things about the service via JMX • The system can monitor the health of the instance App Service (Authentication) Name, Port IP address, Healthcheck url Karyon App Server Eureka Eureka Server(s) Eureka Server(s) Eureka Server(s) Server(s) config.properties, auth-service.properties Or remote Archaius stores 22
  • 23. How do consumers find services? • Service consumers query eureka at startup and periodically to determine location of dependencies – Can query based on availability zone and cross availability zone Web App Front End (REST services) Eureka client App Server What “auth-service” instances exist? Eureka Eureka Server(s) Eureka Server(s) Eureka Server(s) Server(s) 23
  • 25. How does the consumer call the service? • Protocols impls have eureka aware load balancing support build in – In client load balancing -- does not require separate LB tier • Ribbon – REST client – Pluggable load balancing scheme – Built in failure recovery support (retry next server, mark instance as failing, etc.) • Other eureka enabled clients – Custom code in non-Java or Ribbon enabled systems (Java or pure REST) – More from Netflix • Memcached (EVCache), Asystanax (Cassandra and Priam) coming Web App Front End (REST services) Call “auth-service” Ribbon REST client Eureka client App Service App Service (Authentication) App Service (Authentication) App Service (Authentication) (Authentication) 25
  • 26. PS. This is a common pattern • Same idea, but different implementations – Airbnb.com’s SmartStack • Zookeeper/Synapse/Nerve/HAProxy – Parse.com’s clustering • Zookeeper/Ngnix 26
  • 27. How to deploy this with HA? Instances? • Asgard deploys across AZs • Using auto scaling groups in managed by Asgard • More on Asgard later Eureka? • • DNS and Elastic IP trickery Deployed across AZs • For clients to find eureka servers – – • For new eureka servers – – – • DNS TXT record for domain lists AZ TXT records AZ TXT records have list of Eureka servers Look for list of eureka servers IP’s for the AZ it’s coming up in Look for unassigned elastic IP’s, grab one and assign it to itself Sync with other already assigned IP’s that likely are hosting Eureka server instances Simpler configurations with less HA are available 27
  • 28. Protect yourself from unhealthy services • Wrap all calls to services with Hystrix command pattern – Hystrix implements circuit breaker pattern – Executes command using semaphore or separate thread pool to guarantee return within finite time to caller – If a unhealthy service is detected, start to call fallback implementation (broken circuit) and periodically check if main implementation works (reset circuit) • Hystrix also provides caching, request collapsing with synchronous and asynchronous (reactive via RxJava) invocation Execute auth-service call Call “auth-service” Hystrix Web App Front End (REST services) Ribbon REST client App Service App Service (Authentication) App Service (Authentication) App Service (Authentication) (Authentication) Fallback implementation 28
  • 29. Denominator • Most (simple) geographic (region) based disaster recovery depends on front end DNS traffic switching • Java Library and CLI for cross DNS configuration • Allows for common, quicker (than using various DNS provider UI) and automated DNS updates • Plugins have been developed by various DNS providers 29
  • 30. Augmenting the ELB tier - Zuul • Originally developed to do cross region routing for regional HA – Advanced geographic (region) based disaster recovery • Zuul also adds devops support in the front tier routing – – – – – • And some common function – – – – – • • Stress testing (squeeze testing) Canary testing Dynamic routing Load Shedding Debugging Region 1 Load Balancers Filter Filter Filter Filters Zuul Zuul Zuul Edge Service Region 2 Load Balancers Zuul Zuul Zuul Edge Service Authentication Security Static response handling Multi-region resiliency (DR for ELB tier) Insight Through dynamically deployable filters (written in Groovy) Eureka aware using ribbon, and archaius like shown in runtime section 30
  • 31. HA in application architecture • Stateless application design – – – – Legacy application design has state Temporal state should be pushed to caching servers Durable state should be pushed to partitioned data servers Trades off peak latency for uptime (sometimes no trade off) • Partitioned data servers – Wealth of NoSQL servers available today – Be careful of oversold “consistency” promises • Look for third party “Jepsen-like” testing – Be ready to deal with compensated approaches – Consider differences in system of record vs. interaction data stores 31
  • 32. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 32
  • 33. Automatic Recovery Thoughts • Automatic recovery depends on elastic, ephemeral instance cluster design powered by “auto scaling” • If something fails once, it will fail again • No repeated failure should be a pager call – Instead should be email with automated recovery information to be analyzed offline • Test failure on your system before the system tests your failure 33
  • 34. Auto Scaling (for the masses) • For many, auto scaling is more auto recovery – Far more important to keep N instances running than be able to scale automatically to 2N, 10N, 100N • For many, automatic scaling isn’t appropriate – First understand how the system can be elastically scaled with operator expertise manually 34
  • 35. ASGard Region (Dallas) Datacenter/ Availability Zone Tell IaaS to start these instances and Keep this many Instances running Datacenter/ Availability Zone Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) Datacenter/ Availability Zone Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) • Asgard is the console for automatic scaling and recovery 35
  • 36. Asgard creates an “application” • Enforces common practices for deploying code – Common approach to linking auto scaling groups to launch configurations, load balancers, security groups, scaling policies and images • Adds missing concept to the IaaS domain model – “application” – Apps clustering and application lifecycle vs. individually launched and managed images • Example – – – – Application – app1 Cluster – app1-env Asgard group version n – app1-env-v009 Asgard group version n+1 – app1-env-v010 36
  • 37. When to test recovery (and HA)? • Failure is inevitable. Don’t try to avoid it! • How do you know if your backup is good? – Try to restore from your backup every so often – Better to ensure backup works before you have a crashed system and find out your backup is broken • How do you know if your system is HA? – Try to force failures every so often – Better to force those failures during office hours – Better to ensure HA before you have a down system and angry users – Best to learn from failures and add automated tests 37
  • 38. The Simian Army • A bunch of automated “monkeys” that perform automated system administration tasks • Anything that is done by a human more than once can and should be automated • Absolutely necessary at web scale 38
  • 39. Bad Monkeys • Open Sourced – Chaos Monkey – Used to randomly terminate instances – Now block network, burn cpu, kill processes, fail amazon api, fail dns, fail dynamo, fail s3, introduce network errors/latency, detach volumes, fill disk, burn I/O http://www.flickr.com/photos/27261720@N00/132750805 • Not yet open sourced – Chaos Gorilla • Kill datacenter/availability zone instances – Chaos Kong • Kill all instances in an entire region – Latency Monkey • Introduce latency into service calls directly (ribbon server side) – Split Brain Monkey • Datacenters/availability zones continue to operate, but isolated from each other 39
  • 40. Elastic Scale • Basic elastic scale required to achieve high availability – To run three or more of any component • Front tier specific considerations – Will likely need to scale far higher than micro-services – Use distributed caching with TTL where appropriate – Otherwise micro-service architecture could overload data servers • Scaling larger (or Web Scale) will find bottlenecks that require changes to architecture and/or tuning – Iterative process of improvement 40
  • 41. Elastic scaling in application architecture • Clusters that replicate data within the cluster must discover new peers (and timeout dead ones) • Clusters that connect to other clusters must discover new dependency instances (and timeout dead ones) • Many legacy architectures contain static cluster definitions that require “re-starts” to update information – Code changes required to leverage dynamic connectivity 41
  • 42. Full Auto Scaling • Eventually web scale will require auto scaling based on policy – Attach policy based on request latency, utilization, queue depth, etc. • Words of caution, be careful to – Design policies to be proactive on scale up or risk scaling that isn’t fast enough to keep up with demand – Design policies to be generous on scale down or risk over-scaling down and immediate need for scale up 42
  • 43. Scaling Continues to Evolve • Reactive auto scaling is “easy” but naïve – Instances fail – Unexpected spike in demand • What if your traffic is “predictable”, consider – User population follows a daily pattern – User population known to follow different patterns each day (work days vs. weekends) – End of month influx of work • Scryer is Netflix’s predictive analytics to not wait for reactive scaling – Better end user experience, less over deployment (cheaper), more consistent utilization (cheaper) – Not yet open sourced 43
  • 44. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • How to grade public cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 44
  • 45. Thoughts on Continuous Delivery • Legacy waterfall habits are hard to break Inspiration – “Leaks” of old world continue to show – Especially if product has to be released in “shrink wrapped” form in parallel • Netflix approach and technology assists breaking these habits – Provide the tools and proof points and the organization will follow 45
  • 46. Continuous Delivery Pipeline • Developers – Perform local testing before checking code into continuous build • Continuous build – Builds code, tests code and flags any breaks for immediate attention – Builds packages ready for image installation • Image bakery – Builds image for deployment that then show up in Asgard • Continuous deployment – Images deployed through Asgard – Instances are given image and environmental context from Asgard • • Same images should be used in production that are used in test Due to micro-services (API as contract) approach – No need to co-ordinate typical deployments across teams 46
  • 47. Asgard devops procedures • • • • Fast rollback Canary testing Red/Black pushes More through REST interfaces – Adhoc processes allowed, enforced through Asgard model • More coming using Glisten and workflow services 47
  • 49. Ability to reconfigure - Archaius • Using dynamic properties, can easily change properties across cluster of applications, either Application – NetflixOSS named props • Hystrix timeouts for example Runtime – Custom dynamic props Hierarchy • High throughput achieved by polling approach • HA of configuration source dependent on what source you use URL JMX Karyon Console Persisted DB Application Props Libraries Container – HTTP server, database, etc. DynamicIntProperty prop = DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE); int value = prop.get(); // value will change over time based on configuration 49
  • 50. Get baked! • Caution: Flame/troll bait ahead!! – Criticism – “Netflix is ruining the cloud” • Overhead of images for every code version • Ties to Amazon AMI’s (have proven this tie can be broken) • Netflix takes the approach of baking images as part of build such that – Instance boot-up doesn’t depend on outside servers – Instance boot-up only starts servers already set to run – New code = new instances (never update instances in place) • Why? – Critical when launching hundreds of servers at a time – Goal to reduce the failure points in places where dynamic system configuration doesn’t provide value – Speed of elastic scaling, boot and go – Discourages ad hoc changes to server instances 50
  • 51. AMInator • Starting image/volume – Foundational image created (maybe via loopback), base AMI with common software created/tested independently • Aminator running – Bakery – Bakery obtains a known EBS volume of the base image from a pool – Bakery mounts volume and provisions the application (apt/deb or yum/rpm) – Bakery snapshots and registers snapshot • Recent work to add other provisioning such as chef as plugins 51
  • 52. Imaginator • Implementation of Aminator – For IBM SoftLayer cloud • Creates image templates – Starts from base OS and adds deb/rpm’s • Snapshots images for later deployment • Not yet open sourced 52
  • 53. Good Monkeys • Janitor Monkey – Somewhat a mitigation for baking approach – Will mark and sweep unused resources (instances, volumes, snapshots, ASG’s, launch configs, images, etc.) – Owners notified, then removed • Conformity Monkey http://www.flickr.com/photos/sonofgroucho/5852049290 – Check instances are conforming to rules around security, ASG/ELB, age, status/health check, etc. 53
  • 54. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 54
  • 55. Thoughts on Operational Visibility • Programming model to expose metrics should be simple • Systems need to expose internals in a way that is sensible to the owners and operators • The tools that view the internals need to match the level of abstraction developers care about • The tools must give sufficient context when viewing any single metric or alert 55
  • 56. Monitoring - Servo • Annotation based publishing through JMX of application metrics • Gauges, counters, and timers • Filters, Observers, and Pollers to publish metrics – Can export metrics to metric collection servers • Netflix exposes their metrics to Atlas – The entire Netflix monitoring infrastructure hasn’t been open sourced due to complexity and priority 56
  • 57. Back to Hystrix • Main reason for Hystrix is protect yourself from dependencies, but … • Same layer of indirection to services can provide visualization • You can aggregate the view across clusters via Turbine • Other alert system and dashboards can read from Turbine 57
  • 58. Edda • IaaS does not typically provide – Historical views of the state of the system – All views between components an operator might want to see • Edda polls current state and stores the data in a queriable database • Provides a adhoc queriable view of all deployment aspects • Provides a historical view – For correlation of problems to changes – Becoming a more common place feature in cloud 58
  • 59. Ice • Cloud spend and usage analytics • Communicates with billing API to give birds eye view of cloud spend with drill down to region, availability zone, and service team through application groups • Watches differently priced instances and instance sizes to help optimize • Not point in time – Shows trends to help predict future optimizations 59
  • 60. Agenda • Blah, blah, blah • How can I learn more? • How do I play with this? • Let’s write some code! 60
  • 61. Want to play? • NetflixOSS blog and github – http://techblog.netflix.com – http://github.com/Netflix • NetflixOSS as ported to IBM Cloud – https://github.com/EmergingTechnologyInstitute – SoftLayer Image Templates coming soon • Acme Air, NetflixOSS AMI’s – Try Asgard/Eureka with a real application – http://bit.ly/aa-AMIs • Thanks! Questions? See what we ported to IBM Cloud (video) – http://bit.ly/noss-sl-blog • Fork and submit pull requests to Acme Air – http://github.com/aspyker/acmeair-netflix 61