SlideShare une entreprise Scribd logo
1  sur  45
Stress Testing as a Service
A tale of New Year’s Eves at Twitter
@herval
Twitter?
📈 Traffic is always growing
💥 Traffic is spiky
"The microphone for the masses"
Castle in the Sky, 2011
https://www.wired.com/2014/09/how-twitter-handles-traffic-from-the-japanese-who-tweet-like-no-one-else/
8k
Tweets/second
The Spell of Destruction
is cast with the word
"balse" to bring down the
magic city.
© Alex Norris http://webcomicname.com/
Japanese New Year, 2012
https://www.wired.com/2014/09/how-twitter-handles-traffic-from-the-japanese-who-tweet-like-no-one-else/
16k
Tweets/second
Tweeting "happy new year!"
exactly at midnight is a lucky
charm.
Brazil vs Germany, 2014
https://www.theguardian.com/technology/2014/jul/15/twitter-world-cup-tweets-germany-brazil
35m Tweets during the game
30k
Tweets/second on
each of the 7 goals
Gooooooallllllllll
About me
@herval
Engineering Manager - Insights & Reliability
• 2 years at Twitter
• Startup founder (3x), SoundCloud,
Pivotal Labs
• Have a dog, a wife and way too many
mechanical keyboards
Managing tooling to reduce the
number of "oh no" moments.
"Embracing Failure 24/7" (2016)
https://www.wired.com/brandlab/2017/10/juniper-mazdak-hashemi/
If you don't break your site,
someone will break it for you
• How do you know the weak links?
• How do you convince people to let you break things?
• How do you automate that?
Good news!
Some spikes are foreseeable
• World Cup
• Super Bowl
• Oscars
• New Years Eve
Some aren't, but you can prepare
 Social network spikes: mostly breaking news or memes
Know your patterns!
"Breaking News" style traffic
Sudden, huge
increases in traffic
 Hard to react even with
auto-scaling (booting an AWS instance
takes longer than the entire spike!)
 Failures are sudden
 Allocating infrastructure all
the time for these events is
wasteful
The "meme" traffic
Smaller in
amplitude, longer
 Auto-scaling systems
handle this well
 Leads to slow implosions
(service A slowly running out of memory,
service B accumulating logs, etc)
Before we begin...
S.R.E.
https://landing.google.com/sre/book/chapters/part3.html
Adopt sane practices
before you stress
test anything
 Centralized Logging
 Monitoring
 Incident Response
 Postmortems
Deciding what to test
The pyramid of reliability testing
Redline Testing
Stress Testing
Omni Testing
E2E
E2E
Full service
The Bikeshed of Stress Tests
👍 Stress testing
 Hit a small number of APIs, predetermined load goals and duration
🧐 Redline testing
 Test an API/service until it breaks (and measure why it broke)
🧐 Omni testing
 Simulating an entire site load, for a predetermined duration
Building the first test suite
Initial dataset covering big historical scenarios
 "Tweets with images"
 Retweet storms
 "Happy new year!!!!!!!!!!!"
Test a specific scenario that is bound to happen again
cat (traffic logs) → grep (the APIs you want to test) → sed (users ids) → 💰
(not exactly that simple with 1PB of log files)
Building the first test suite
Generate a histogram of API calls, pick a slice (top 10 APIs)
• Do this often, traffic patterns change as product changes
• Avoids over-testing things that aren't popular
• Sanitize the parameters (replace actual user ids with test users, fake products, etc)
• Automated combinations may not make sense - some heuristics
required here
Build a set of test users that realistically mirror site usage
• Model the entire population given a set of parameters
• By far the trickiest part on a social network (most
people just hardcode these)
Building the first test suite
Test Suite automation
 Jobs constantly evaluating the Top 10 APIs
into a (Hadoop) dataset of test targets
 Jobs generating a few million ”fake users”
and their relationships (billions of
relationships)
Be extra extra extra extra careful not
to affect actual users
Introduce the concept of test users in your infrastructure
• Isolates interactions (a regular user won't ever see a test user)
• Doesn't confuse business metrics
• Pass an HTTP header during tests to inform services you only want test users
• Very hard to introduce when you have a microservice jungle
 Start with read APIs until you're absolutely comfortable with your
heuristics
About those "fake users"...
Be extra extra extra extra careful not
to affect actual users
About those "fake users"...
Executing the tests
Staging versus production
Staging environments may not
be practical
• Twitter, Google, Netflix, etc -
infra too big to have a full end-
to-end staging → Test in
production with fake users
• Stress testing in staging won't
detect production bottlenecks
Generating test load
Spin up load generators
 Deploy outside the DC/zone/area you are testing
 Load the datasets
 Collect metrics & results
Replay the traffic logs you generated
 Low thousands of RPS: A single service loading from a CSV  prototype
 Low Millions of RPS: multiple instances, load events from a queue  V1
 Millions to Billions of RPS: multiple instances, multiple zones, multiple queues  V2
Test Loads
Target
Systems
Execution
Coordinator
Load
Generators
Target
SystemsTarget APIs
Queue
Load
Generators
Load
Generators
DC/Zone under test
Load Generation DCs/Zones
Service
Metrics
Stats
collector
Generating test load
https://github.com/twitter/iago (Scala)
https://github.com/tsenart/vegeta (Golang)
https://jmeter.apache.org/ (Java)
// Make 1000 HTTP requests at a roughly constant rate of 10/sec
val client = ClientBuilder().codec(http()).hosts("twitter.com:80").build()
val transport = new FinagleTransport(FinagleService(client))
val consumer = new RequestConsumer(() => new PoissionProcess(10)
// add 1000 requests to the queue
for (i <- (1 to 1000)) {
consumer.offer(new ParrotRequest(uri= Uri("/jack/status/20", Nil))
}
// start sending
transport.start()
consumer.start()
// wait for the consumer to exhaust the queue
while(consumer.size > 0) {
Thread.sleep(100)
}
// shutdown
consumer.shutdown()
transport.close()
Monitoring
COLLECT ALL THE METRICS
Real time collection
needed to circuit
break and hitting
goals
Long term storage of
metrics needed for
postmortems &
planning
Metrics collection & management
Metrics collection & management
PARENTAL ADVISORY: Circuit Break
Have a stop button at hand
(not necessarily a physical button)
Cancelling traffic must be fast
😔 V1: de-scheduling Mesos jobs -> ~1-5
minutes → downtime
😲 V2: DNS-level toggle (Zookeeper) → sub-
second emergency brake
https://github.com/herval/groundcontrol
PARENTAL ADVISORY: Circuit Break
https://zookeeper.apache.org/
PARENTAL ADVISORY: Circuit Break
import com.twitter.zookeeper.ZooKeeperClient
import org.apache.zookeeper.CreateMode
val zk = new ZooKeeperClient("zookeeper.local:2181")
zk.create("/test_deadbeef_circuit_break", "ok".getBytes, CreateMode.PERSISTENT)
zk.watchNode("/test_deadbeef_circuit_break", { (data : Option[Array[Byte]]) =>
data match {
case Some(d) => if new String(d) == "abort" { testService.abort() }
case None => println("Test is done")
}
})
https://zookeeper.apache.org/
Planning & Communication
The test plan
Proper Stress Testing is disruptive by nature
Communication is key 🔑
• Avoid confusion among teams
• Prevent people from firefighting during the test
Timing is fundamental ⏰
• Testing too close to large events = surprises
• Testing too far = surprises
• Smaller, regular tests + planning schedule = win
The test plan
 Get input and
signoff from
service owners on
goals
 Build a unified
schedule
Build trust before building the
software
The test plan
 Communicate tests
as they happen (and
their results)
 Follow-up and fix
bottlenecks
Build trust before building the
software
yeah!
What did we learn?
Build degradable systems
to absorb spikes(automatic feature
degradation, feature toggles, circuit breaks)
Build self-healing systems
to prevent slow decay (auto-
scaling, self-monitoring)
But did it work?
Japanese New Year (2015): no incidents
Japanese New Year (2016): no incidents
Japanese New Year (2017): no incidents
World Cup (2018): 🤞🇧🇷
Acknowledgments
Special thanks to the Reliability Team (past & present members)
- Ali Alzabarah
- Esteban Kuber
- Kyle Laplante
- Mazdak Hashemi
- Niranjan Baiju
- Pascal Borghino
- Ramya Krishnan
- Steve Salevan
And all the SRE heroes that save the day, every day.
Thank you!
@herval

Contenu connexe

Tendances

Building GUI App with Electron and Lisp
Building GUI App with Electron and LispBuilding GUI App with Electron and Lisp
Building GUI App with Electron and Lispfukamachi
 
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comRuby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comIlya Grigorik
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...Aman Kohli
 
PyGotham 2014 Introduction to Profiling
PyGotham 2014 Introduction to ProfilingPyGotham 2014 Introduction to Profiling
PyGotham 2014 Introduction to ProfilingPerrin Harkins
 
Introduction to performance tuning perl web applications
Introduction to performance tuning perl web applicationsIntroduction to performance tuning perl web applications
Introduction to performance tuning perl web applicationsPerrin Harkins
 
Test Failed, Then...
Test Failed, Then...Test Failed, Then...
Test Failed, Then...Toru Furukawa
 
[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuitNAVER D2
 
Async programming on NET
Async programming on NETAsync programming on NET
Async programming on NETyuyijq
 
Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)Matthew Campbell
 
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...Big Data Spain
 
JavaScript Engines and Event Loop
JavaScript Engines and Event Loop JavaScript Engines and Event Loop
JavaScript Engines and Event Loop Tapan B.K.
 
Stress test your backend with Gatling
Stress test your backend with GatlingStress test your backend with Gatling
Stress test your backend with GatlingAndrzej Ludwikowski
 
Altitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
Altitude NY 2018: 132 websites, 1 service: Your local news runs on FastlyAltitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
Altitude NY 2018: 132 websites, 1 service: Your local news runs on FastlyFastly
 
Dexador Rises
Dexador RisesDexador Rises
Dexador Risesfukamachi
 
Docker and jvm. A good idea?
Docker and jvm. A good idea?Docker and jvm. A good idea?
Docker and jvm. A good idea?Christopher Batey
 
Beautiful code instead of callback hell using ES6 Generators, Koa, Bluebird (...
Beautiful code instead of callback hell using ES6 Generators, Koa, Bluebird (...Beautiful code instead of callback hell using ES6 Generators, Koa, Bluebird (...
Beautiful code instead of callback hell using ES6 Generators, Koa, Bluebird (...andreaslubbe
 
Mad scalability: Scaling when you are not Google
Mad scalability: Scaling when you are not GoogleMad scalability: Scaling when you are not Google
Mad scalability: Scaling when you are not GoogleAbel Muíño
 
Kubernetes Native Java
Kubernetes Native JavaKubernetes Native Java
Kubernetes Native JavaAlex Soto
 
Event Driven Architecture Concepts in Web Technologies - Part 1
Event Driven Architecture Concepts in Web Technologies - Part 1Event Driven Architecture Concepts in Web Technologies - Part 1
Event Driven Architecture Concepts in Web Technologies - Part 1Hamidreza Soleimani
 
PyCon AU 2015 - Using benchmarks to understand how wsgi servers work
PyCon AU 2015  - Using benchmarks to understand how wsgi servers workPyCon AU 2015  - Using benchmarks to understand how wsgi servers work
PyCon AU 2015 - Using benchmarks to understand how wsgi servers workGraham Dumpleton
 

Tendances (20)

Building GUI App with Electron and Lisp
Building GUI App with Electron and LispBuilding GUI App with Electron and Lisp
Building GUI App with Electron and Lisp
 
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comRuby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
 
PyGotham 2014 Introduction to Profiling
PyGotham 2014 Introduction to ProfilingPyGotham 2014 Introduction to Profiling
PyGotham 2014 Introduction to Profiling
 
Introduction to performance tuning perl web applications
Introduction to performance tuning perl web applicationsIntroduction to performance tuning perl web applications
Introduction to performance tuning perl web applications
 
Test Failed, Then...
Test Failed, Then...Test Failed, Then...
Test Failed, Then...
 
[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit
 
Async programming on NET
Async programming on NETAsync programming on NET
Async programming on NET
 
Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)
 
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
 
JavaScript Engines and Event Loop
JavaScript Engines and Event Loop JavaScript Engines and Event Loop
JavaScript Engines and Event Loop
 
Stress test your backend with Gatling
Stress test your backend with GatlingStress test your backend with Gatling
Stress test your backend with Gatling
 
Altitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
Altitude NY 2018: 132 websites, 1 service: Your local news runs on FastlyAltitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
Altitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
 
Dexador Rises
Dexador RisesDexador Rises
Dexador Rises
 
Docker and jvm. A good idea?
Docker and jvm. A good idea?Docker and jvm. A good idea?
Docker and jvm. A good idea?
 
Beautiful code instead of callback hell using ES6 Generators, Koa, Bluebird (...
Beautiful code instead of callback hell using ES6 Generators, Koa, Bluebird (...Beautiful code instead of callback hell using ES6 Generators, Koa, Bluebird (...
Beautiful code instead of callback hell using ES6 Generators, Koa, Bluebird (...
 
Mad scalability: Scaling when you are not Google
Mad scalability: Scaling when you are not GoogleMad scalability: Scaling when you are not Google
Mad scalability: Scaling when you are not Google
 
Kubernetes Native Java
Kubernetes Native JavaKubernetes Native Java
Kubernetes Native Java
 
Event Driven Architecture Concepts in Web Technologies - Part 1
Event Driven Architecture Concepts in Web Technologies - Part 1Event Driven Architecture Concepts in Web Technologies - Part 1
Event Driven Architecture Concepts in Web Technologies - Part 1
 
PyCon AU 2015 - Using benchmarks to understand how wsgi servers work
PyCon AU 2015  - Using benchmarks to understand how wsgi servers workPyCon AU 2015  - Using benchmarks to understand how wsgi servers work
PyCon AU 2015 - Using benchmarks to understand how wsgi servers work
 

Similaire à Stress Testing at Twitter: a tale of New Year Eves

Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics HeroTechWell
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
Shift-left SRE: Self-healing on OpenShift with Ansible
Shift-left SRE: Self-healing on OpenShift with AnsibleShift-left SRE: Self-healing on OpenShift with Ansible
Shift-left SRE: Self-healing on OpenShift with AnsibleJürgen Etzlstorfer
 
Modern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxModern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxC4Media
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineTop Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner
 
JavaOne 2015: Top Performance Patterns Deep Dive
JavaOne 2015: Top Performance Patterns Deep DiveJavaOne 2015: Top Performance Patterns Deep Dive
JavaOne 2015: Top Performance Patterns Deep DiveAndreas Grabner
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...Chris Fregly
 
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...Chris Fregly
 
Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Visug
 
Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Maarten Balliauw
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyonddion
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 
Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)CIVEL Benoit
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1CIVEL Benoit
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Maarten Balliauw
 
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cPerformance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cAjith Narayanan
 
Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickrxlight
 

Similaire à Stress Testing at Twitter: a tale of New Year Eves (20)

Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics Hero
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Shift-left SRE: Self-healing on OpenShift with Ansible
Shift-left SRE: Self-healing on OpenShift with AnsibleShift-left SRE: Self-healing on OpenShift with Ansible
Shift-left SRE: Self-healing on OpenShift with Ansible
 
Modern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxModern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a Fox
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineTop Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your Pipeline
 
JavaOne 2015: Top Performance Patterns Deep Dive
JavaOne 2015: Top Performance Patterns Deep DiveJavaOne 2015: Top Performance Patterns Deep Dive
JavaOne 2015: Top Performance Patterns Deep Dive
 
Revoke-Obfuscation
Revoke-ObfuscationRevoke-Obfuscation
Revoke-Obfuscation
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
 
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
 
Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)
 
Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyond
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 
Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...
 
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cPerformance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12c
 
Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickr
 

Plus de Herval Freire

Scala na soundcloud [QCon]
Scala na soundcloud [QCon]Scala na soundcloud [QCon]
Scala na soundcloud [QCon]Herval Freire
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation enginesHerval Freire
 
Sollipsis Publishing Catalog
Sollipsis   Publishing CatalogSollipsis   Publishing Catalog
Sollipsis Publishing CatalogHerval Freire
 
Sollipsis Venture Forum
Sollipsis   Venture ForumSollipsis   Venture Forum
Sollipsis Venture ForumHerval Freire
 
Sollipsis Premium Games
Sollipsis   Premium GamesSollipsis   Premium Games
Sollipsis Premium GamesHerval Freire
 
Padrões De Projeto e Anti Patterns
Padrões De Projeto e Anti PatternsPadrões De Projeto e Anti Patterns
Padrões De Projeto e Anti PatternsHerval Freire
 
Biofeedback - Slides CBIS
Biofeedback - Slides CBISBiofeedback - Slides CBIS
Biofeedback - Slides CBISHerval Freire
 
[ “Love", :Ruby ].each { |i| p i }
[ “Love", :Ruby ].each { |i| p i }[ “Love", :Ruby ].each { |i| p i }
[ “Love", :Ruby ].each { |i| p i }Herval Freire
 

Plus de Herval Freire (11)

Scala na soundcloud [QCon]
Scala na soundcloud [QCon]Scala na soundcloud [QCon]
Scala na soundcloud [QCon]
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation engines
 
Startup Don'ts
Startup Don'tsStartup Don'ts
Startup Don'ts
 
Sollipsis Publishing Catalog
Sollipsis   Publishing CatalogSollipsis   Publishing Catalog
Sollipsis Publishing Catalog
 
Sollipsis Venture Forum
Sollipsis   Venture ForumSollipsis   Venture Forum
Sollipsis Venture Forum
 
Sollipsis Premium Games
Sollipsis   Premium GamesSollipsis   Premium Games
Sollipsis Premium Games
 
Sollipsis Catalog
Sollipsis   CatalogSollipsis   Catalog
Sollipsis Catalog
 
Padrões De Projeto e Anti Patterns
Padrões De Projeto e Anti PatternsPadrões De Projeto e Anti Patterns
Padrões De Projeto e Anti Patterns
 
Biofeedback - Slides CBIS
Biofeedback - Slides CBISBiofeedback - Slides CBIS
Biofeedback - Slides CBIS
 
Tag Libraries
Tag LibrariesTag Libraries
Tag Libraries
 
[ “Love", :Ruby ].each { |i| p i }
[ “Love", :Ruby ].each { |i| p i }[ “Love", :Ruby ].each { |i| p i }
[ “Love", :Ruby ].each { |i| p i }
 

Dernier

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Dernier (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Stress Testing at Twitter: a tale of New Year Eves

  • 1. Stress Testing as a Service A tale of New Year’s Eves at Twitter @herval
  • 2. Twitter? 📈 Traffic is always growing 💥 Traffic is spiky "The microphone for the masses"
  • 3. Castle in the Sky, 2011 https://www.wired.com/2014/09/how-twitter-handles-traffic-from-the-japanese-who-tweet-like-no-one-else/ 8k Tweets/second The Spell of Destruction is cast with the word "balse" to bring down the magic city.
  • 4. © Alex Norris http://webcomicname.com/
  • 5. Japanese New Year, 2012 https://www.wired.com/2014/09/how-twitter-handles-traffic-from-the-japanese-who-tweet-like-no-one-else/ 16k Tweets/second Tweeting "happy new year!" exactly at midnight is a lucky charm.
  • 6.
  • 7. Brazil vs Germany, 2014 https://www.theguardian.com/technology/2014/jul/15/twitter-world-cup-tweets-germany-brazil 35m Tweets during the game 30k Tweets/second on each of the 7 goals
  • 9.
  • 10. About me @herval Engineering Manager - Insights & Reliability • 2 years at Twitter • Startup founder (3x), SoundCloud, Pivotal Labs • Have a dog, a wife and way too many mechanical keyboards Managing tooling to reduce the number of "oh no" moments.
  • 11. "Embracing Failure 24/7" (2016) https://www.wired.com/brandlab/2017/10/juniper-mazdak-hashemi/ If you don't break your site, someone will break it for you • How do you know the weak links? • How do you convince people to let you break things? • How do you automate that?
  • 12. Good news! Some spikes are foreseeable • World Cup • Super Bowl • Oscars • New Years Eve Some aren't, but you can prepare  Social network spikes: mostly breaking news or memes Know your patterns!
  • 13. "Breaking News" style traffic Sudden, huge increases in traffic  Hard to react even with auto-scaling (booting an AWS instance takes longer than the entire spike!)  Failures are sudden  Allocating infrastructure all the time for these events is wasteful
  • 14. The "meme" traffic Smaller in amplitude, longer  Auto-scaling systems handle this well  Leads to slow implosions (service A slowly running out of memory, service B accumulating logs, etc)
  • 16. S.R.E. https://landing.google.com/sre/book/chapters/part3.html Adopt sane practices before you stress test anything  Centralized Logging  Monitoring  Incident Response  Postmortems
  • 18. The pyramid of reliability testing Redline Testing Stress Testing Omni Testing E2E E2E Full service
  • 19. The Bikeshed of Stress Tests 👍 Stress testing  Hit a small number of APIs, predetermined load goals and duration 🧐 Redline testing  Test an API/service until it breaks (and measure why it broke) 🧐 Omni testing  Simulating an entire site load, for a predetermined duration
  • 20. Building the first test suite Initial dataset covering big historical scenarios  "Tweets with images"  Retweet storms  "Happy new year!!!!!!!!!!!" Test a specific scenario that is bound to happen again cat (traffic logs) → grep (the APIs you want to test) → sed (users ids) → 💰 (not exactly that simple with 1PB of log files)
  • 21. Building the first test suite Generate a histogram of API calls, pick a slice (top 10 APIs) • Do this often, traffic patterns change as product changes • Avoids over-testing things that aren't popular • Sanitize the parameters (replace actual user ids with test users, fake products, etc) • Automated combinations may not make sense - some heuristics required here Build a set of test users that realistically mirror site usage • Model the entire population given a set of parameters • By far the trickiest part on a social network (most people just hardcode these)
  • 22. Building the first test suite Test Suite automation  Jobs constantly evaluating the Top 10 APIs into a (Hadoop) dataset of test targets  Jobs generating a few million ”fake users” and their relationships (billions of relationships)
  • 23. Be extra extra extra extra careful not to affect actual users Introduce the concept of test users in your infrastructure • Isolates interactions (a regular user won't ever see a test user) • Doesn't confuse business metrics • Pass an HTTP header during tests to inform services you only want test users • Very hard to introduce when you have a microservice jungle  Start with read APIs until you're absolutely comfortable with your heuristics About those "fake users"...
  • 24. Be extra extra extra extra careful not to affect actual users About those "fake users"...
  • 26. Staging versus production Staging environments may not be practical • Twitter, Google, Netflix, etc - infra too big to have a full end- to-end staging → Test in production with fake users • Stress testing in staging won't detect production bottlenecks
  • 27. Generating test load Spin up load generators  Deploy outside the DC/zone/area you are testing  Load the datasets  Collect metrics & results Replay the traffic logs you generated  Low thousands of RPS: A single service loading from a CSV  prototype  Low Millions of RPS: multiple instances, load events from a queue  V1  Millions to Billions of RPS: multiple instances, multiple zones, multiple queues  V2
  • 29. Generating test load https://github.com/twitter/iago (Scala) https://github.com/tsenart/vegeta (Golang) https://jmeter.apache.org/ (Java) // Make 1000 HTTP requests at a roughly constant rate of 10/sec val client = ClientBuilder().codec(http()).hosts("twitter.com:80").build() val transport = new FinagleTransport(FinagleService(client)) val consumer = new RequestConsumer(() => new PoissionProcess(10) // add 1000 requests to the queue for (i <- (1 to 1000)) { consumer.offer(new ParrotRequest(uri= Uri("/jack/status/20", Nil)) } // start sending transport.start() consumer.start() // wait for the consumer to exhaust the queue while(consumer.size > 0) { Thread.sleep(100) } // shutdown consumer.shutdown() transport.close()
  • 31. COLLECT ALL THE METRICS Real time collection needed to circuit break and hitting goals Long term storage of metrics needed for postmortems & planning
  • 32. Metrics collection & management
  • 33. Metrics collection & management
  • 34. PARENTAL ADVISORY: Circuit Break Have a stop button at hand (not necessarily a physical button) Cancelling traffic must be fast 😔 V1: de-scheduling Mesos jobs -> ~1-5 minutes → downtime 😲 V2: DNS-level toggle (Zookeeper) → sub- second emergency brake https://github.com/herval/groundcontrol
  • 35. PARENTAL ADVISORY: Circuit Break https://zookeeper.apache.org/
  • 36. PARENTAL ADVISORY: Circuit Break import com.twitter.zookeeper.ZooKeeperClient import org.apache.zookeeper.CreateMode val zk = new ZooKeeperClient("zookeeper.local:2181") zk.create("/test_deadbeef_circuit_break", "ok".getBytes, CreateMode.PERSISTENT) zk.watchNode("/test_deadbeef_circuit_break", { (data : Option[Array[Byte]]) => data match { case Some(d) => if new String(d) == "abort" { testService.abort() } case None => println("Test is done") } }) https://zookeeper.apache.org/
  • 38. The test plan Proper Stress Testing is disruptive by nature Communication is key 🔑 • Avoid confusion among teams • Prevent people from firefighting during the test Timing is fundamental ⏰ • Testing too close to large events = surprises • Testing too far = surprises • Smaller, regular tests + planning schedule = win
  • 39. The test plan  Get input and signoff from service owners on goals  Build a unified schedule Build trust before building the software
  • 40. The test plan  Communicate tests as they happen (and their results)  Follow-up and fix bottlenecks Build trust before building the software
  • 41. yeah!
  • 42. What did we learn? Build degradable systems to absorb spikes(automatic feature degradation, feature toggles, circuit breaks) Build self-healing systems to prevent slow decay (auto- scaling, self-monitoring)
  • 43. But did it work? Japanese New Year (2015): no incidents Japanese New Year (2016): no incidents Japanese New Year (2017): no incidents World Cup (2018): 🤞🇧🇷
  • 44. Acknowledgments Special thanks to the Reliability Team (past & present members) - Ali Alzabarah - Esteban Kuber - Kyle Laplante - Mazdak Hashemi - Niranjan Baiju - Pascal Borghino - Ramya Krishnan - Steve Salevan And all the SRE heroes that save the day, every day.

Notes de l'éditeur

  1. No desenho, a meta era derrubar o castelo - na vida real, derrubou-se o twitter
  2. Staging nao reflete producao (a menos que duplique toda a infra)
  3. Preferably outside the DC/zone/area you are testing Watch for I/O bottlenecks (you may need more than one instance for big tests) Replay the synthetic traffic logs you generated Low thousands of RPS: A single service loading from a CSV Low Millions of RPS: multiple instances, load events from a queue Millions to Billions of RPS: multiple instances, multiple zones, multiple queues
  4. 2 mil workers, meio segundo
  5. Nao adianta nada disso se vc nao é bom em planejar e comunicar