Failure testing is a fundamental piece of Twitter’s reliability engineering. Over the years, we developed a rich toolchain that allows us to detect and fix scalability problems long before they happen. In this talk, we’ll cover some of the strategies we employ and discuss our always evolving approach to API stress testing and its “unit test” equivalent, redline testing.
08448380779 Call Girls In Friends Colony Women Seeking Men
Stress Testing at Twitter: a tale of New Year Eves
1. Stress Testing as a Service
A tale of New Year’s Eves at Twitter
@herval
2. Twitter?
📈 Traffic is always growing
💥 Traffic is spiky
"The microphone for the masses"
3. Castle in the Sky, 2011
https://www.wired.com/2014/09/how-twitter-handles-traffic-from-the-japanese-who-tweet-like-no-one-else/
8k
Tweets/second
The Spell of Destruction
is cast with the word
"balse" to bring down the
magic city.
5. Japanese New Year, 2012
https://www.wired.com/2014/09/how-twitter-handles-traffic-from-the-japanese-who-tweet-like-no-one-else/
16k
Tweets/second
Tweeting "happy new year!"
exactly at midnight is a lucky
charm.
6.
7. Brazil vs Germany, 2014
https://www.theguardian.com/technology/2014/jul/15/twitter-world-cup-tweets-germany-brazil
35m Tweets during the game
30k
Tweets/second on
each of the 7 goals
10. About me
@herval
Engineering Manager - Insights & Reliability
• 2 years at Twitter
• Startup founder (3x), SoundCloud,
Pivotal Labs
• Have a dog, a wife and way too many
mechanical keyboards
Managing tooling to reduce the
number of "oh no" moments.
11. "Embracing Failure 24/7" (2016)
https://www.wired.com/brandlab/2017/10/juniper-mazdak-hashemi/
If you don't break your site,
someone will break it for you
• How do you know the weak links?
• How do you convince people to let you break things?
• How do you automate that?
12. Good news!
Some spikes are foreseeable
• World Cup
• Super Bowl
• Oscars
• New Years Eve
Some aren't, but you can prepare
Social network spikes: mostly breaking news or memes
Know your patterns!
13. "Breaking News" style traffic
Sudden, huge
increases in traffic
Hard to react even with
auto-scaling (booting an AWS instance
takes longer than the entire spike!)
Failures are sudden
Allocating infrastructure all
the time for these events is
wasteful
14. The "meme" traffic
Smaller in
amplitude, longer
Auto-scaling systems
handle this well
Leads to slow implosions
(service A slowly running out of memory,
service B accumulating logs, etc)
18. The pyramid of reliability testing
Redline Testing
Stress Testing
Omni Testing
E2E
E2E
Full service
19. The Bikeshed of Stress Tests
👍 Stress testing
Hit a small number of APIs, predetermined load goals and duration
🧐 Redline testing
Test an API/service until it breaks (and measure why it broke)
🧐 Omni testing
Simulating an entire site load, for a predetermined duration
20. Building the first test suite
Initial dataset covering big historical scenarios
"Tweets with images"
Retweet storms
"Happy new year!!!!!!!!!!!"
Test a specific scenario that is bound to happen again
cat (traffic logs) → grep (the APIs you want to test) → sed (users ids) → 💰
(not exactly that simple with 1PB of log files)
21. Building the first test suite
Generate a histogram of API calls, pick a slice (top 10 APIs)
• Do this often, traffic patterns change as product changes
• Avoids over-testing things that aren't popular
• Sanitize the parameters (replace actual user ids with test users, fake products, etc)
• Automated combinations may not make sense - some heuristics
required here
Build a set of test users that realistically mirror site usage
• Model the entire population given a set of parameters
• By far the trickiest part on a social network (most
people just hardcode these)
22. Building the first test suite
Test Suite automation
Jobs constantly evaluating the Top 10 APIs
into a (Hadoop) dataset of test targets
Jobs generating a few million ”fake users”
and their relationships (billions of
relationships)
23. Be extra extra extra extra careful not
to affect actual users
Introduce the concept of test users in your infrastructure
• Isolates interactions (a regular user won't ever see a test user)
• Doesn't confuse business metrics
• Pass an HTTP header during tests to inform services you only want test users
• Very hard to introduce when you have a microservice jungle
Start with read APIs until you're absolutely comfortable with your
heuristics
About those "fake users"...
24. Be extra extra extra extra careful not
to affect actual users
About those "fake users"...
26. Staging versus production
Staging environments may not
be practical
• Twitter, Google, Netflix, etc -
infra too big to have a full end-
to-end staging → Test in
production with fake users
• Stress testing in staging won't
detect production bottlenecks
27. Generating test load
Spin up load generators
Deploy outside the DC/zone/area you are testing
Load the datasets
Collect metrics & results
Replay the traffic logs you generated
Low thousands of RPS: A single service loading from a CSV prototype
Low Millions of RPS: multiple instances, load events from a queue V1
Millions to Billions of RPS: multiple instances, multiple zones, multiple queues V2
29. Generating test load
https://github.com/twitter/iago (Scala)
https://github.com/tsenart/vegeta (Golang)
https://jmeter.apache.org/ (Java)
// Make 1000 HTTP requests at a roughly constant rate of 10/sec
val client = ClientBuilder().codec(http()).hosts("twitter.com:80").build()
val transport = new FinagleTransport(FinagleService(client))
val consumer = new RequestConsumer(() => new PoissionProcess(10)
// add 1000 requests to the queue
for (i <- (1 to 1000)) {
consumer.offer(new ParrotRequest(uri= Uri("/jack/status/20", Nil))
}
// start sending
transport.start()
consumer.start()
// wait for the consumer to exhaust the queue
while(consumer.size > 0) {
Thread.sleep(100)
}
// shutdown
consumer.shutdown()
transport.close()
31. COLLECT ALL THE METRICS
Real time collection
needed to circuit
break and hitting
goals
Long term storage of
metrics needed for
postmortems &
planning
34. PARENTAL ADVISORY: Circuit Break
Have a stop button at hand
(not necessarily a physical button)
Cancelling traffic must be fast
😔 V1: de-scheduling Mesos jobs -> ~1-5
minutes → downtime
😲 V2: DNS-level toggle (Zookeeper) → sub-
second emergency brake
https://github.com/herval/groundcontrol
38. The test plan
Proper Stress Testing is disruptive by nature
Communication is key 🔑
• Avoid confusion among teams
• Prevent people from firefighting during the test
Timing is fundamental ⏰
• Testing too close to large events = surprises
• Testing too far = surprises
• Smaller, regular tests + planning schedule = win
39. The test plan
Get input and
signoff from
service owners on
goals
Build a unified
schedule
Build trust before building the
software
40. The test plan
Communicate tests
as they happen (and
their results)
Follow-up and fix
bottlenecks
Build trust before building the
software
42. What did we learn?
Build degradable systems
to absorb spikes(automatic feature
degradation, feature toggles, circuit breaks)
Build self-healing systems
to prevent slow decay (auto-
scaling, self-monitoring)
43. But did it work?
Japanese New Year (2015): no incidents
Japanese New Year (2016): no incidents
Japanese New Year (2017): no incidents
World Cup (2018): 🤞🇧🇷
44. Acknowledgments
Special thanks to the Reliability Team (past & present members)
- Ali Alzabarah
- Esteban Kuber
- Kyle Laplante
- Mazdak Hashemi
- Niranjan Baiju
- Pascal Borghino
- Ramya Krishnan
- Steve Salevan
And all the SRE heroes that save the day, every day.
No desenho, a meta era derrubar o castelo - na vida real, derrubou-se o twitter
Staging nao reflete producao (a menos que duplique toda a infra)
Preferably outside the DC/zone/area you are testing
Watch for I/O bottlenecks (you may need more than one instance for big tests)
Replay the synthetic traffic logs you generated
Low thousands of RPS: A single service loading from a CSV
Low Millions of RPS: multiple instances, load events from a queue
Millions to Billions of RPS: multiple instances, multiple zones, multiple queues
2 mil workers, meio segundo
Nao adianta nada disso se vc nao é bom em planejar e comunicar