Stress Testing at Twitter: a tale of New Year Eves

Stress Testing as a Service
A tale of New Year’s Eves at Twitter
@herval

Twitter?
📈 Traffic is always growing
💥 Traffic is spiky
"The microphone for the masses"

Castle in the Sky, 2011
https://www.wired.com/2014/09/how-twitter-handles-traffic-from-the-japanese-who-tweet-like-no-one-else/
8k
Tweets/second
The Spell of Destruction
is cast with the word
"balse" to bring down the
magic city.

Japanese New Year, 2012
https://www.wired.com/2014/09/how-twitter-handles-traffic-from-the-japanese-who-tweet-like-no-one-else/
16k
Tweets/second
Tweeting "happy new year!"
exactly at midnight is a lucky
charm.

Brazil vs Germany, 2014
https://www.theguardian.com/technology/2014/jul/15/twitter-world-cup-tweets-germany-brazil
35m Tweets during the game
30k
Tweets/second on
each of the 7 goals

About me
@herval
Engineering Manager - Insights & Reliability
• 2 years at Twitter
• Startup founder (3x), SoundCloud,
Pivotal Labs
• Have a dog, a wife and way too many
mechanical keyboards
Managing tooling to reduce the
number of "oh no" moments.

"Embracing Failure 24/7" (2016)
https://www.wired.com/brandlab/2017/10/juniper-mazdak-hashemi/
If you don't break your site,
someone will break it for you
• How do you know the weak links?
• How do you convince people to let you break things?
• How do you automate that?

Good news!
Some spikes are foreseeable
• World Cup
• Super Bowl
• Oscars
• New Years Eve
Some aren't, but you can prepare
 Social network spikes: mostly breaking news or memes
Know your patterns!

"Breaking News" style traffic
Sudden, huge
increases in traffic
 Hard to react even with
auto-scaling (booting an AWS instance
takes longer than the entire spike!)
 Failures are sudden
 Allocating infrastructure all
the time for these events is
wasteful

The "meme" traffic
Smaller in
amplitude, longer
 Auto-scaling systems
handle this well
 Leads to slow implosions
(service A slowly running out of memory,
service B accumulating logs, etc)

S.R.E.
https://landing.google.com/sre/book/chapters/part3.html
Adopt sane practices
before you stress
test anything
 Centralized Logging
 Monitoring
 Incident Response
 Postmortems

The pyramid of reliability testing
Redline Testing
Stress Testing
Omni Testing
E2E
E2E
Full service

The Bikeshed of Stress Tests
👍 Stress testing
 Hit a small number of APIs, predetermined load goals and duration
🧐 Redline testing
 Test an API/service until it breaks (and measure why it broke)
🧐 Omni testing
 Simulating an entire site load, for a predetermined duration

Building the first test suite
Initial dataset covering big historical scenarios
 "Tweets with images"
 Retweet storms
 "Happy new year!!!!!!!!!!!"
Test a specific scenario that is bound to happen again
cat (traffic logs) → grep (the APIs you want to test) → sed (users ids) → 💰
(not exactly that simple with 1PB of log files)

Generate a histogram of API calls, pick a slice (top 10 APIs)
• Do this often, traffic patterns change as product changes
• Avoids over-testing things that aren't popular
• Sanitize the parameters (replace actual user ids with test users, fake products, etc)
• Automated combinations may not make sense - some heuristics
required here
Build a set of test users that realistically mirror site usage
• Model the entire population given a set of parameters
• By far the trickiest part on a social network (most
people just hardcode these)

Test Suite automation
 Jobs constantly evaluating the Top 10 APIs
into a (Hadoop) dataset of test targets
 Jobs generating a few million ”fake users”
and their relationships (billions of
relationships)

Be extra extra extra extra careful not
to affect actual users
Introduce the concept of test users in your infrastructure
• Isolates interactions (a regular user won't ever see a test user)
• Doesn't confuse business metrics
• Pass an HTTP header during tests to inform services you only want test users
• Very hard to introduce when you have a microservice jungle
 Start with read APIs until you're absolutely comfortable with your
heuristics
About those "fake users"...

Be extra extra extra extra careful not
to affect actual users
About those "fake users"...

Staging versus production
Staging environments may not
be practical
• Twitter, Google, Netflix, etc -
infra too big to have a full end-
to-end staging → Test in
production with fake users
• Stress testing in staging won't
detect production bottlenecks

Generating test load
Spin up load generators
 Deploy outside the DC/zone/area you are testing
 Load the datasets
 Collect metrics & results
Replay the traffic logs you generated
 Low thousands of RPS: A single service loading from a CSV  prototype
 Low Millions of RPS: multiple instances, load events from a queue  V1
 Millions to Billions of RPS: multiple instances, multiple zones, multiple queues  V2

Test Loads
Target
Systems
Execution
Coordinator
Load
Generators
Target
SystemsTarget APIs
Queue
Load
Generators
Load
Generators
DC/Zone under test
Load Generation DCs/Zones
Service
Metrics
Stats
collector

Generating test load
https://github.com/twitter/iago (Scala)
https://github.com/tsenart/vegeta (Golang)
https://jmeter.apache.org/ (Java)
// Make 1000 HTTP requests at a roughly constant rate of 10/sec
val client = ClientBuilder().codec(http()).hosts("twitter.com:80").build()
val transport = new FinagleTransport(FinagleService(client))
val consumer = new RequestConsumer(() => new PoissionProcess(10)
// add 1000 requests to the queue
for (i <- (1 to 1000)) {
consumer.offer(new ParrotRequest(uri= Uri("/jack/status/20", Nil))
}
// start sending
transport.start()
consumer.start()
// wait for the consumer to exhaust the queue
while(consumer.size > 0) {
Thread.sleep(100)
}
// shutdown
consumer.shutdown()
transport.close()

COLLECT ALL THE METRICS
Real time collection
needed to circuit
break and hitting
goals
Long term storage of
metrics needed for
postmortems &
planning

Metrics collection & management

PARENTAL ADVISORY: Circuit Break
Have a stop button at hand
(not necessarily a physical button)
Cancelling traffic must be fast
😔 V1: de-scheduling Mesos jobs -> ~1-5
minutes → downtime
😲 V2: DNS-level toggle (Zookeeper) → sub-
second emergency brake
https://github.com/herval/groundcontrol

https://zookeeper.apache.org/

import com.twitter.zookeeper.ZooKeeperClient
import org.apache.zookeeper.CreateMode
val zk = new ZooKeeperClient("zookeeper.local:2181")
zk.create("/test_deadbeef_circuit_break", "ok".getBytes, CreateMode.PERSISTENT)
zk.watchNode("/test_deadbeef_circuit_break", { (data : Option[Array[Byte]]) =>
data match {
case Some(d) => if new String(d) == "abort" { testService.abort() }
case None => println("Test is done")
}
})
https://zookeeper.apache.org/

The test plan
Proper Stress Testing is disruptive by nature
Communication is key 🔑
• Avoid confusion among teams
• Prevent people from firefighting during the test
Timing is fundamental ⏰
• Testing too close to large events = surprises
• Testing too far = surprises
• Smaller, regular tests + planning schedule = win

The test plan
 Get input and
signoff from
service owners on
goals
 Build a unified
schedule
Build trust before building the
software

The test plan
 Communicate tests
as they happen (and
their results)
 Follow-up and fix
bottlenecks
Build trust before building the
software

What did we learn?
Build degradable systems
to absorb spikes(automatic feature
degradation, feature toggles, circuit breaks)
Build self-healing systems
to prevent slow decay (auto-
scaling, self-monitoring)

But did it work?
Japanese New Year (2015): no incidents
World Cup (2018): 🤞🇧🇷

Acknowledgments
Special thanks to the Reliability Team (past & present members)
- Ali Alzabarah
- Esteban Kuber
- Kyle Laplante
- Mazdak Hashemi
- Niranjan Baiju
- Pascal Borghino
- Ramya Krishnan
- Steve Salevan
And all the SRE heroes that save the day, every day.

Stress Testing at Twitter: a tale of New Year Eves

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Stress Testing at Twitter: a tale of New Year Eves

Similaire à Stress Testing at Twitter: a tale of New Year Eves (20)

Plus de Herval Freire

Plus de Herval Freire (11)

Dernier

Dernier (20)

Stress Testing at Twitter: a tale of New Year Eves

Notes de l'éditeur