5. 5
The challenge
10 Million+ viewers
Design goal of 50,000 requests/s, 10,000 buzzes/second
Equivalent to 130 Billion requests/month
But just on Saturday night
And four weeks to build
Thursday, 26 May 2011 5
6. 6
The challenge
Where does 130
Billion requests fit?
Source: http://www.google.com/adplanner/static/top1000/#
Thursday, 26 May 2011 6
7. 7
Where we started....
app.livetalkback.com cdn.livetalkback.com
Control plane
ELB CloudFront
Zabbix
Webserver Webserver
Django Django
Ubuntu Ubuntu
MySQL S3
Thursday, 26 May 2011 7
8. 8
Step 1: Testing
Started with a platform with a previous peak of 100 requests/s
No idea where it would break
Tsung! http://tsung.erlang-projects.org/
Thursday, 26 May 2011 8
9. 9
Step 2: ELB
Amazon Elastic Load Balancer
“Infinite capacity”
BUT very long impulse response and NO controls :(
HAProxy to the rescue
5K requests/s per node
Thursday, 26 May 2011 9
10. 10
Step 3: Avoid the DB
MySQL was never going to be able to handle 10,000 writes/s, nor 50,000
reads
“Hey, Django does memcached. Problem solved”
Help, our memcached server I/O is maxed out :(
Two-layer cache: https://gist.github.com/953524
Write-behind data
Thursday, 26 May 2011 10
11. 11
But we want analytics!
Now 10K things to write to disk every second
Logging? Database?
This is starting to look like BIG DATA
Thursday, 26 May 2011 11
13. 13
Step 5: Cassandra
Deployed Cassandra cluster on EC2 to handle buzz records
Tested to > 10K writes/s
All good!
“So how many users did we have last night?”
Thursday, 26 May 2011 13
14. 14
Where we ended...
app.livetalkback.com cdn.livetalkback.com
10
Control plane
HAProxy HAProxy CloudFront nodes
Chef
Webserver Webserver 100+
nodes
Django Django
Ubuntu Ubuntu
Zabbix
Memcached Cassandra
Memcached Cassandra RDS Master S3
Thursday, 26 May 2011 14
15. 15
Scaling up - and down
Configuring 100+ servers by
hand each week would have
been a pain
Used to Chef to automate
Also builds the test swarm
http://wiki.opscode.com/display/
chef/Home
Thursday, 26 May 2011 15
16. 16
Now what?
Still challenges with analytics & ad-hoc queries
Looking at Brisk and Hadoop
We’re sucking the Twitter firehose for Tellybug
MySQL is coping so far, but only just
Thursday, 26 May 2011 16
17. 17
Questions?
boxm@livetalkback.com
@malcolmbox
Thursday, 26 May 2011 17