17. WebSocket Load
balancing
Permission and
security model
(Admin, Mods, ...)
Frontend Server Backend Server
UI
Ok, so let’s try this!
Frontend
Server
NodeJs
data storage
Redis
Cluster
hitbox
REST-API
PHP
Nginx
Backend
Server
NodeJs
Auto scalingAuto scaling
Average roundtrip / message: < 300ms
18. • Small, cheap machines
• Handle the connections, no logic
• When it breaks it breaks only for a few user
• Automatic Failover to another chat frontend server
• Socket.io for handling websockets
• Carrier for sending messages between front & back
• Up & Downscale possible as needed
Frontend Server
19. • Small, cheap machines
• Handles all the logic
• Stateless, can be restarted/upgraded any time
• Easy expandable with new features
• Up & Downscale possible as needed
• Load balancing via round robin
Backend Server
20. • Fast
• I mean, REALLY fast!
• You can cluster it
• Easy to back up
Redis
25. WebSocket Load
balancing
Permission and
security model
(Admin, Mods, ...)
Frontend Server Backend Server
UI
Ok, let’s fix Websockets
Frontend
Server
NodeJs
data storage
Redis
Cluster
hitbox
REST-API
PHP
Nginx
Backend
Server
NodeJs
Auto scaling
Auto scaling
Long Polling Fallback
Fallback
Server
NodeJs
34. • Frontend servers report CPU load every 10 Seconds
• Lowest X frontend servers are send to the UI
• UI selects a frontend server randomly from this five
• If UI gets disconnected it removes server from list
• UI tries another frontend server
• IF no servers left UI gets X new frontend servers from API
Load Balancing
Thats me, 1980/81 with my first computer, anyone know the computer? I have studed arts, lived in new york & berlin, have made startups and have crashed startups
What is hitbox? This is the frontpage
This is a streamer, he plays games and streams them. Most of them are also entertainer, making money wiht advertising & subscriptions
6 M uniques/month, number 2 in the world
Sounds easy, or? Exists since 30 years.
So, how hard can it be?
Lot of things to do! And thats just the beginning!
Most important is realtime, you write something and all others should see it as fast as possible.
For example, he dances (and lost 20 kg this way) and people cheer him up in the chat.
So back to the chat, IRC is a protocol that is used since 30 years, we wanted to make something new, something modern, something without netsplits, etc.
We started with this because our backend is already in php, lets see if this works out!
Easy setup:
And mysql as database.
Well, these two sentences tell you already all of the problems....
Imagine „Long running php process to server multiple websocket connections“
It worked for up to 2000 connections, not very scaleable!
So back to the drawing board, we wanted something modern, so lets use modern software!
We went with nodejs & redis, anyone here has experience with nodejs as servers?
We use a two way setup:
Frontend servers and backendservers and redis as a data storage. If we loose the redis data we just loose who is in what chatroom, just press f5 and you are back in.
We use AWS
Single core machines
Same machines as for frontends
I can only recommend it, i never saw a redis instance failing (except for getting slow)
So, looks like a perfect system, lets code it!
We did and...
it worked!
So we could party!
Not so fast!!!!!!!
There we had our first problem, something everyone should support
Its a fucking standard!
But there are firewalls that block it, there are mobile devices that block it or even worse, tell you that a websocket connection is working but it isnt, they just lie to you!
0,5-1% have this problem, but they where mailing us like hell....
0,5-1%
So we had to use fallback servers for long polling. Long polling means a lot of overhead from the http-protocoll, so these servers can handle only 1/10 of the normal frontend server, but it works!
So we thought we can party again.
Well, the hitbox audience is young, so they try a lot... You wont imagine how often we get ddosed or people try to abuse the api....
And last yeatr, someone managed it:
It was during at that time biggest event ever, 60k people on one stream and suddenly all of them saw this.
And we did this!
Well, they did not managed to break our system or steal any userdata, the only think they did was insert in the „nameColor“ some javascript, and we did not validated it. We validated everything else, but not this one, because it is only a number...
so
Really, everything, really, really everything!
Again, we thought we can party!
But.... Then others came and did this
A websocket DDOS! Sending massive amounts of join commands to the chat.
So we had to think about how we can distribute this load better or make it harder for them to reeach all frontend server, remember, they are up & down scaling automatic.
So this is our way how we do load balancing on the frontend servers, works really good.
If they ddos a few servers this servers will not get new connections and from the upscale we get new servers that are not ddosed.
Why the random factor in the ui? F5, more on this later.
So once again we party hard!
Until he came
Rezigiusz a polish Youtuber & streamer with a lot of fans that love to type
Think of it as one direction of poland
When he is streaming he has around 1-15k viewers and they type 2000 messages a second into the chat!
1995 get blocked, but the backend servers have to check this....
So the event loop of nodejs exploded....
But, using async.js, whic is a great tool to queue work we could clean up the event loop, delaying some messages a few milliseconds but letting the main tasks working fine
So for example we made queues for the most important function, login, logout, chatmsg, etc.
So, we can party again!
But, dont forget one of the biggest problems you can run into...
I know, this sound s stupid, but i will give you two examples:
Imagine you have a stream with 100k viewers. Every time a new viewer comes to this stream he/she gets the info about how to get the stream from our server.
Now imagine the streamer has a problem, lets say his computer crashes and the stream drops, mean is getting black or stucked.
What does 100k people do?
This.
And lets hope that your api can handle this!
And they wont stop until trhe have a stream again!
We learned a lot about caching, otherwise you cannot handle this, memcache & redis are your friend here.
The second example is stupid sotware design:
It is quite often that streamer announce when they start to stream and then people are waiting already on the page for them to go online.
Well, we have the chat already connected anyway, why not send a special message over the chat to trigger the start of the stream...
Sounds easy, for our system it is
Because than again you weill self ddos yourself, imagine this with 100k people waiting...
So sometimes realtime is really bad, because it is realtime... And it can destroy you
So we got back to the good old interval because then you distribute the 100k connections over 30 seconds, giving you much more time handle the load.
So, we can party again!
The same guy as at the beginning, he has its own website with animated gifs
Well, at the end something that is very important for me, monitor everything!
Our swiss army knife is statsd from etsy, a great peace of software written in nodes that monitors stuff via udp and works great.
We use it in cobimation with graphite and monitor really everything.
See the down-spike on active chat connections? That is when node is not able to keep the 10 seconds timing for the reporting of the stats, you get used to it
Well, and at the end, is the chat system working? Does it scale?
Well, i dont have a screenshot about our latest record that was close to 200k, but this one shows you a channel with 100k people.
All 154k connections where handled by 16 frontend servers and 8 backend servers, costing us around $20 for the evening.
And dont forget the network traffic!
Around 160-200Mbit per machine, only text outgoing! These cheap machines are limited by around 200mbit.
Thats it, thank you!