More than two years ago Spil Games started building a global cross datacenter storage solution with MySQL as one its key components. This session will focus on our road to get there and share some of the best practices we got out of this.
To give our users the best experience possible we needed to expand our datacenters globally and bring sharded data closer to our end users using a storage abstraction layer called SSP (Spil Storage Layer). This required a new way of thinking and a complete overhaul of our architecture. This meant we also had to change or replace many of our infrastructure components. When we started building this new architecture we also took into account that failure is bound to happen sometime. The new architecture has been built with failure in mind: we can cope with split brain scenarios and, to a certain degree, even with a full datacenter outage. We achieved all this by building Master datacenters containing the persistent storage and Satellite datacenters containing memory-based storage. This approach required new techniques like write-through for inter-datacenter writes and enabled us to implement asynchronous (fire and forget) writes to our persistent layer.
At the same time we had to cope with non-shardable data that was required to be available in all our datacenters at all times and, of course, be available in a similar fashion as our sharded data. We solved this by creating a second storage layer ROAR (Read Often, Alter Rarely) that could cope with high concurrency reads.
The architecture is being built with MySQL, Erlang, Memcache, Openstack and hybird cloud as its building blocks.
Spil Games is a (social) gaming company that grew in a short time from an internet startup to a global online gaming leader with more than 180 million users.
4. 4
• Game publishers
• Company founded in 2001
• 350+ employees world wide
• 180M+ unique visitors per month
• Over 60M registered users
• 45 portals in 19 languages
• Casual games
• Social games
• Real time multiplayer games
• Mobile games
• 40+ MySQL clusters
• 65k queries per second
• 8 Sphinx servers
• 8k queries per second
Facts
8. 8
• Static content
• CDN strategy
• Dynamic content
• Multi data center
• Store data distributed
• Sharding on user level is inevitable
• Replace old functionally sharded applications
Getting closer to the user means
9. 9
• Portal content
• Hardly changes at all
• Can be cached heavily
• User data
• Written frequently
• Read only on demand
• Difficult to cache
• Recommendations
• Written on demand by DWH
• Written and read frequently
Dynamic content
10. 10
• Transparent API layer
• No direct to DB
• Transparent storage system
• Use MySQL where possible
• Migrate data between systems
• Bulk migrations from/to shards
• Bulk migrations from/to data centers
• Connection pooling
Solving other problems
11. 11
• Frequently written data
• SSP (Spilgames Storage Platform)
• Infrequently written data
• ROAR (Read Often, Alter Rarely)
Problem isolation
14. 14
• Dependent on one storage platform
• No more platform-specific query language
• Differentiate writes
• Optimistic (asynchronous)
• Pessimistic (synchronous)
• Shard data better
• Partition on user and function
• Cluster information by users, not by function
• Global expansion
• Partition on geographic location
• Solve uneven usage of data storage
• Move data from shard to shard
• Use a (small) connection pool
• Prevent bursts of writes to create connection stampede
What was our wishlist?
18. 18
• Everything written in Erlang
• SSP utilizes local caching (memcache)
• Flexible (persistent) storage layer
• MySQL
• Couchbase
• Could be any other storage product
Our building blocks
20. 20
• Functional language
• High availability: designed for telecom solutions
• Excels at concurrency, distribution, fault tolerance
• Do more with less!
• Other companies using Erlang:
Why Erlang?
21. 21
• What is the bucket model?
• Each record has one unique owner attribute (GID)
• GID (Global IDentifier) identifying different types
• Bucket(s) per functionality
• Bucket is structured data
• Attributes contain data of records
• Attributes do not have to correspond to schema
How do we shard?
28. 28
• Nearest datacenter (DC) to the end user
• Satellite DC
• Processing and caching
• Do not own/store data
• Storage DC
• Processing, caching and persistent storage
• Store all same user data in same DC
• Partition on user globally
• Global IDentifier per user
Global distribution
29. 29
• Contains GIDs and their master DC
• GIDs master DC predefined
• Migrated GIDs get updated
The lookup server
30. 30
• Globally sharded on GID
• (local) GID Lookup
How does this work?
GID
lookup
Shard 1 Shard 2
Persistent
storage
32. 32
• Add new shard
• Add new bucket (and/or version)
• Mark shard inactive (read only)
• User migrate (bucket+version)
• Shard (bulk) migrate
• Bucket version (bulk) migrate
Tools
33. 33
• Spread data even on shards
• Migration of buckets between shards
• GID migration between DCs
• Creating a new storage DC needs data migration
• Users will automatically be migrated after visiting
another DC many times
Why do we need data migration?
34. 34
• Versioning on bucket definitions
• GIDs are assigned to a bucket version
• Data in old bucket versions remain (read only)
• New data only gets written to new bucket version
• Updates migrate data to new bucket version
• Migrates can be triggered
Seamless schema upgrades
35. 35
Seamless schema upgrades
Demobucket v1
GID
1234
1235
1236
1237
1238
1239
name
Roy
Moss
Jen
Douglas
Denholm
Richmond
Demobucket v2
GID name genderGID
1241
name
Patricia
gender
f
GID
1241
1235
name
Patricia
Moss
gender
f
m
GID
1234
1236
1237
1238
1239
name
Roy
Jen
Douglas
Denholm
Richmond
GID
1234
1237
1238
1239
name
Roy
Douglas
Denholm
Richmond
GID
1241
1235
1236
name
Patricia
Moss
Jen
gender
f
m
f
36. 36
• Every cluster (two masters) will contain two shards
• Data written interleaved
• HA for both shards
• No warmup needed
• Both masters active and “warmed up”
Initial design: Multi Master writes
SSP
Shard 1 Shard 2
37. SSP Master-Master setup
Server-1 Server-2 Server-n
MySQL
active
master
MySQL
active
master
Asynchronous replication
read+writeread+write
MMM
db-ssp001 (192.168.2.1) db-ssp002 (192.168.2.2)
38. SSP Master-Master setup
Server-1 Server-2 Server-n
MySQL
broken
master
MySQL
active
master
read+write
MMM
db-ssp001 (192.168.2.1)
db-ssp002 (192.168.2.2)
39. New design: SSP Galera setup
Galera
Server-1 Server-2 Server-n
Load balancer
MySQL MySQL MySQL
read/write to any node
Synchronous replication
40. 40
• GID write isolation
• One single process is responsible for a GID
• Same rows can’t be written at the same time
• Synchronous replication ensures consistency
• Replication can’t fall behind
Why does Galera make sense?
41. 41
• Not finished yet
• Diffcicult to add development resources
• Teething problems delayed old architecture
migration
• Master-Master was a (very) bad idea
• 2 Galera clusters already in production
• Tools had to be built from scratch
• Initial design lacked parallel writes (bulk import)
• Erlang using a connection pool to MySQL
• Initially set too small
• Bursts of writes larger than queue
SSP in practice
43. 43
• Stores infrequently written data (game catalogue)
• Uses GIDs for games
• Data can be cached aggressively
• API same as the SSP
• Data access patterns are different
• Data is not sharded but replicated
• Components used:
• Memcache: Data caching
• MySQL: Denormalized PK data storage
• Sphinx: String to GID translation
ROAR
44. 44
SSP Architecture overview
Portal
MySQL
JsLib (Javascript, NginX)
(3rd party) game or
app
SPAPI (NginX, Piqi, Mochiweb)
Native Erlang (NginX, Piqi, Webmachine)
ROAR (NginX, PiqiRPC, MySQL)
Loadbalancer
47. ROAR MySQL
Server-1 Server-2 Server-n
Load balancer (port 3306 write, 3307 read)
Galera
MySQL MySQL MySQLMySQL MySQL
read only read only
asynchronous
replication
asynchronous
replication
Sphinx Sphinx
50. 50
• Sphinx can handle 4k queries per second
• Small increases in response times are critical
• Sphinx indexing is CPU intensive
• Used taskset to allocate only a single CPU
• Real time indexing
• Denormalized tables designed for limited set of data
• Large increases in data are bad
• Not globally distributed yet
ROAR in practice
52. 52
• Presentation can be found at:
http://spil.com/plsc2014storage
• If you wish to contact me:
Email: art@spilgames.com
Twitter: @banpei
• Engineering @ Spil Games
Blog: http://engineering.spilgames.com
Twitter: @spilengineering
Thank you!
53. 53
Getting closer to the user:
https://www.flickr.com/photos/wallyonwater54/10360023775/
(creative commons license)
ROAR:
https://www.flickr.com/photos/nickharris1/5939895418/in/photostream/
(creative commons license)
Photo sources