Hear how Turtle Rock launched Evolve, their fast-paced mercenary-vs-monster first-person shooter (FPS), to millions of players using AWS regions around the globe. Turtle Rock provides an in-depth view into Evolve's architecture on AWS, including both their Amazon EC2 and Elastic Load Balancing web API stack, as well as their Crytek-based UDP game servers. Hear how they used Amazon VPC subnets, along with an RDS MySQL based server registration service, to balance players across Availability Zones and regions. Learn about Turtle Rock's innovative game server scaling logic, which maintains a pool of game server capacity while keeping costs in check. Finally, see Evolve’s Graphite and Grafa monitoring setup, which provides player count and server health status across their worldwide fleet.
2. What to expect from the session
1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
3. 1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
4.
5. What is Evolve?
• 5-player action game
• 4-player coop hunter team
• 1 player monster
• Hunters chase the monster
• Monster is hunting wildlife
6. Heavy resource requirements
• Built on Crytek’s CRYENGINE®
• 5 human players but also 30+ AI wildlife
• CPU and memory requirements on par with 40+ player
dedicated servers
7. What we’re working with
• Build process: Executables, assets, packaging
• Clients: Windows PC, Xbox One, PlayStation 4
• Server: Linux, stripped assets
8. 1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
9. Real-time game servers in the cloud?
• Traditionally games are collocated
• Needed to purchase physical hardware
• Manually install and maintain that hardware
10. Real-time game servers in the cloud?
• Using a cloud service simplifies those issues
• Testing proved without a doubt it’s possible
• Resource allocation is strict
• CPU, memory, I/O are all predictable
• We shipped it
11. Hardware requirements
• Must be memory bound not CPU bound
• Automated testing: All bot rounds
• C3 instances gave us the best CPU to memory ratio
• RAID-0 stripe ephemeral disks
• Space for executables, assets, logs and core dumps
• We have swap space for emergencies
12. Network requirements
• Enhanced networking a benefit of C3
• Latency to all instance types were good
• The Internet is the real latency variable
• Low bandwidth UDP protocol
• Optimized for P2P
14. 1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
16. Server reservation request
• Servers checking in
• Poll interval decreases with
reservation
• Database is only for
transactional data
requirement
• No persistence needed
Game Instances
Game Instance
External ELB App Servers
17. Server is reserved
• IP/port/server to host
• Host forwards details
• P2P host migration
External ELB
Region
CH C C C
18. 1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
19. People are error prone
• Automate everything
• Anything done manually is a liability
20. Server automation system
• Unified application for operations and monitoring
• This gives us blanket authentication, authorization,
accounting
• Using AWS SDK
• Very few people have access to AWS
• Every action is predefined from starting a server to
provisioning an entire region
21. Server automation system
HTTPS ELB
Operations
API
Web App
Proxy
Grafana Graphite Logstash
User Database
Task Database
Processor
AWS SDK
Auto Scaling
22. Build distribution
• Every build uploaded into Amazon S3
• 20 GB per file in Amazon CloudFront is real
• Use baked AMIs instead
• Baked before entering production
24. We have dependencies
• MySQL servers
• ELB instances
• Salt configuration servers
• Many other services we haven’t covered
• An instance needs to discover dependencies easily
25. Instance metadata (169.254.169.254)
• Gives us everything we need for auto discovery
• No SDK, no AWS Identity and Access Management
(IAM) rule required, no complexity
• Use subnet as our container
• Build a DNS address out of it in Amazon Route 53
• Region-AZ-subnet-service .Domain.net
Amazon
Route 53
28. Instance configuration
• We use SaltStack for server configuration
• Download configuration at startup
• Including the baked AMIs
• Using user data for startup scripting
• Allows quick changes to config without a rebake
• Tagging instances with our own data
29. 1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
30. Make life easier, VPN
• We VPN to all our VPCs from our office and OPS VPCs
• OPS VPCs are for operations services
• OPS in us-west-1, eu-west-1
• Use different IP subnets per VPC
• Direct private SSH to any instance is great
• Simplifies security group management
31. Region discovery
• UDP ping service for every region
• Measures QoS: Latency and packet loss
• Also verifies build availability at the same time
• Written in C, extremely efficient, doubles as a relay
33. Region failover
• Every client discovers and ranks every region locally
• Second best is always known
• Seamless failover to other regions at any time
35. 1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
36. Difficult problem
• We can’t use AWS Auto Scaling
• Scaling metric is the ratio of active to available servers
• Scaling is done per individual subnet
• Metrics sourced directly from RDS databases
• We don’t need to worry about fragmentation
• Ready for some complex math?
37. Easy solution
• If we’re over 80% utilization scale up 10%
• If we’re under 60% utilization scale down 10%
• Sample often, act at set interval
• Interval must be longer than scale up time
• Track scale downs
41. Easy fix
• Track highest peak over last week
• Take largest of 10% of peak or 10% of current
42. 1. What is Evolve?
2. Real-time Game Simulation in AWS
3. REST Based Server Reservation Service
4. Automation, Key for Success
5. Going Global
6. Auto Scaling
7. Monitoring and metrics
43. Track everything
• We use Graphite for our data collection
• Grafana for our visualization
• Tracking everything of interest from our applications
• StatsD on all instances to aggregate early
44. Scalable graphite
• Aggregation periods are important
• I/O requirements are pretty big
• We’re using I2 instances with their large ephemerals
• Striped and using LVM for snapshotting
• Graphite is single threaded
• Need to scale using multiple processes
47. Scaling Logstash
• Fairly straightforward to scale
• Be sure you’re filtering unnecessary data early
• Used for system and application logs for all servers
48. Crash collection
• We upload raw core dumps into Amazon S3
• Process them early on the machine with GDB
• Send processed data into aggregation app
• Web front end to view crash aggregation data
• Links to raw dumps in Amazon S3
49. 1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
50. Summary
• Game servers in AWS works
• Automate everything
• Subnets, metadata, Route 53 for auto configuration
• Region failover through ping service
• Auto scale 10% over/under 80/60 works
• Global scale of monitoring and operations
51. We’re hiring
• This is what we were working on last year
• What we’re doing now is bigger and better
• TurtleRockStudios.com