How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.
As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!
4. Michael Kehoe
/USR/BIN/WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland
5. Who are we?
PRODUCTION-SRE TEAM AT LINKEDIN
• Disaster Recovery Planning and
Automation
• Incident Response and Automation
• Visibility Engineering
• Reliability Principles
6. LinkedIn
EVOLUTION OF THE INFRASTRUCTURE
2003 2010 2011 2013 2014 2015
Active &
Passive
Active &
Active
Multi-colo 3-
way Active &
Active
Multi-colo n-
way Active &
Active
9. What is Resilience Engineering?
• Projects that directly demand increased
resilience from our applications and
infrastructure.
• Application Injection Failure
• Infrastructure Injection Failure
• Full Disaster-Recovery Tests
11. How often have you heard stories where someone
thought they had a disaster strategy, never tested it and
it fails when you need it the most?
12. Problem Statement
• How do we ensure that we always have
disaster recovery ability without incident?
• How do we consistently test for disaster
recovery ability without disrupting the
company?
14. Project Overview
1
• Build a process (with Automation) to facilitate disaster recovery
• Operate the process on regular cadence
• Provide reporting on outcomes of tests with engineering executives
25. Benefits of Load-testing
CAPACITY PLANNING
• Through this process, we continuously validate our infrastructure
capacity
• This is the best signal we can possibly get since we’re simulating a
real disaster
26. Benefits of Load-testing
IDENTIFY BUGS
2
• Some bugs are only found at high load (under duress)
• Helps find inefficiency’s that otherwise may not be found until it’s too late
• Gives us clues on how to make our code more resilient to potential failure
27. Benefits of Load-testing
CONFIDENCE
2
• Through load-testing, we’ve built confidence in our disaster recovery
strategy
• We understand exactly:
• What process to follow
• How long it takes to avert disaster
• What are the risks associated with a disaster incident
29. Key Takeaways
• Resilience Engineering is a must for
LinkedIn
• Design infrastructure to facilitate disaster
recovery
• Disaster-test regularly to avoid surprises
• Automate your testing/ process to reduce
engagement time
Anil
TrafficShift is a two part application - A web application provides easy way for engineers to create planned and emergency offline plans.
We leverage couchbase as our key/value persistence store
Python backend worker processes talks to Salt Master via Salt API
And instructs stickyrouting service to turn buckets online and offline
We leverage this toolset to run load tests or stress tests of our datacenters
Uff that’s a lot of talk, how to mitigate issues by doing trafficshift. But if you keenly observe, we are migrating live traffic across datacenter, why not leverage the same to stress test datacenter ? How awesome is that ? Not stress test single service, stress the whole system. I am gonna talk about load testing next.
Anil
As you can see by turning precise number of buckets offline in US-West and US-East - we can reroute that extra traffic to Target datacenter
We do this in a pretty controlled manner in steps until the threshold level of 50% is reached. If for any reason, an alert fires during this stress test, our TrafficShift tool acknowledges that automatically rebalances the site traffic, sends out the stress test report to SREs