Would you ever play an online game if you were not able to communicate with your teammates? Isn’t it fun if you can make new friends, arrange pre-made games and celebrate your victories with people you like to play with?
Riot Games’ League of Legends handles millions of online players at any given time. Each chat server is responsible for routing over 1 billion real time events a day. In order to support the overwhelming user base and be prepared future growth, as well as pave the road for the upcoming features, chat infrastructure had to be designed and built with the utmost care, so that it would never fail the players.
In this talk I would like to present how we achieved linear scalability, improved the overall fault tolerance, created a framework for real time code upgrades and got ready for the new features we want to ship. I will also discuss in detail why we chose to use Erlang as a foundation for the system, and why we migrated our data from MySQL to Riak.
3. WHAT IS LEAGUE OF LEGENDS?
2009
LAUNCH
TEAM
ORIENTED
100+
CHAMPS
MODERN
FANTASY
4. MESSAGING SERVICE
Private player chat and group chats.
PRESENCE SERVICE
Friend lists, availability and status.
SOCIAL GRAPH SERVICE
Internal service for store, match history, leagues.
CHAT
WHAT IS IT?
10. CHAT AT 10K FEET
STABLE, SCALABLE CHAT SERVICE
DATA
PROTOCOL SERVER STORE
11. SERVER: EJABBERD
‣ Open source Jabber/XMPP server
‣ Relatively nice scalability and performance with default configuration
‣ Wide adoption and active, helpful community
‣ Very good as a starting point for our own server solution
▾ We were aware that one day we would need to start customizing it
‣ Written in Erlang programming language
12. TECHNOLOGY: ERLANG/OTP
Erlang is...
Which gives us...
A functional language
Built with concurrency and
distribution in mind
Able to scale extremely well
Capable of reloading code on the fly
A declarative style of programming
An easier way to build our
distributed applications
More time to focus on coding
Less downtime
13. SERVER: EJABBERD - PHILOSOPHY
Share nothing approach; enables massive, near linear
horizontal scalability. ARCHITECTURE
Implementation of self-healing properties, which bring the
system to a well-known, stable state.
FAULT
TOLERANCE
When something is massively broken - do not fix it! LET IT
CRASH
15. SERVER: EJABBERD - IMPLEMENTATION
PHASE 1 - MAKE IT WORK
‣ Over time mostly rewritten
‣ Removed unwanted and unneeded
parts
‣ Optimized certain flow paths
‣ Make it compatible with industry
standards
‣ Wrote over 600 tests to cover it
Invite
Alice Bob
Accept
Alice Bob
Invite
Alice Bob
Accept
Alice Bob
Alice Bob
16. SERVER: EJABBERD - IMPLEMENTATION
PHASE 1 - MAKE IT WORK
‣ Over time mostly rewritten
‣ Removed unwanted and unneeded
parts
‣ Optimized certain flow paths
‣ Make it compatible with industry
standards
‣ Wrote over 600 tests to cover it
Invite
Alice Bob
Accept
Alice Bob
Alice Bob
17. SERVER: EJABBERD - IMPLEMENTATION
PHASE 2: MAKE IT RIGHT
‣ Removed clear bottlenecks
‣ Avoid shared, mutable state
‣ “Make it work, make it right, make it
fast”
MUC
router
user
sesussioenr
sesussioenr
session
MUC
room
user
sesussioenr
sesussioenr
session
user
sesussioenr
sesussioenr
session
MUC
room
MUC
room
18. SERVER: EJABBERD - IMPLEMENTATION
PHASE 2: MAKE IT RIGHT
‣ Removed clear bottlenecks
‣ Avoid shared, mutable state
‣ “Make it work, make it right, make it
fast”
user
sesussioenr
sesussioenr
session
MUC
room
user
sesussioenr
sesussioenr
session
user
sesussioenr
sesussioenr
session
MUC
room
MUC
room
19. SERVER: EJABBERD - IMPLEMENTATION
PHASE 2: MAKE IT RIGHT
‣ Removed clear bottlenecks
‣ Avoid shared, mutable state
‣ “Make it work, make it right, make it
fast”
Session Table:
JID -> Session Handler
session table
Alice
Bob Charlie
20. SERVER: EJABBERD - IMPLEMENTATION
PHASE 3 - MAKE IT FAST
‣ Patched VM and stdlibs
‣ Sacrificing generic nature of
Erlang/OTP framework in favor of
better scalability and fault tolerance
‣ Better traceability and profiling
functions
‣ More visibility into the system
‣ Improved logging for code reloading
and real time system upgrades
21. CHAT AT 10K FEET
STABLE, SCALABLE CHAT SERVICE
PROTOCOL SERVER DATA
STORE
22. NOSQL
DATA STORE: RIAK
SCALE Linearly
scalable
No growth
headaches
FAULT
Higher
TOLERANCE No SPoF uptime
SCHEMA-LESS
Faster
feature
iterations
More
shipped
features
‣ Distributed, fault-tolerant,
key-value store
‣ Masterless, fully peer-to-peer
architecture
‣ AP in CAP theorem, with
eventual consistency
‣ Low, predictable latency
‣ Extreme scalability
‣ Multi data center
replication
23. LESSONS LEARNED
UNDERSTAND YOUR SYSTEM
‣ Over 500 real-time
counters, rates, histograms
collected each minute
‣ Make sure to know counter
values for “correct” and
“abnormal” conditions
‣ Alerts and logs for long
running operations
‣ Integration with Graphite,
Zabbix and Nagios
24. IMPLEMENT FEATURE TOGGLES
LESSONS LEARNED
‣ Safety valve for
things that might
cause problems
‣ Partial deployments
allowing features to
be enabled only for
certain groups of
people
Alice Bob Charlie
group reordering
feature
whitelist: Bob
Bob
25. SUPPORT CODE RELOADING
‣ Patching bugs on the
fly
‣ Changing server
configuration
‣ Collecting data for
future analysis
‣ No downtime
deploys
LESSONS LEARNED
buggy
code
fixed
code
server
restart
buggy
code
fixed
code
26. GET YOUR LOGGING RIGHT
LESSONS LEARNED
‣ Proper logging and
tracing facilities
‣ Debug modes for
selected users
‣ Tools for analysis of
the collected data
Alice
ejabberd.log slow_db.log
trace_alice.log
roster_audit.log muc_audit.log
Honu
27. ALWAYS LOAD TEST YOUR CODE
‣ Automatic verification
of the latest builds
‣ Collecting historical
results for comparison
‣ Measuring the impact
of new features and
changes to the code
‣ Simulating various
failures
LESSONS LEARNED
28. THINGS WILL FAIL
LESSONS LEARNED
‣ Prepare for the worst
‣ It’s just a matter of
time for crash to
happen
‣ It’s not only our code
that fails
‣ Unlikely events
happen every second
under given scale
29. CHAT IS DOING GREAT!
The quality uptime is over 99% each month, and is increasing, with hundreds
of servers deployed all over the world.
SCALE AND PERFORMANCE
Each server offer reliable, low latency to the players, routing over 1B events
a day with low resource utilization.
CHAT IS EVOLVING
Rolling out Riak worldwide, making LoL Chat available outside of the client,
explore possibilities around using social graph data, and more...
CURRENT
SITUATION