Operating online games is fun and challenging. Games are some of the spikiest workloads around, and real-time really means *real-time*. Randy shares many of the DevOps techniques his team has put into practice at KIXEYE: Cloud infrastructure, Service teams, and DevOps Culture. He talks about elastic workloads, micro-services, configuration automation, and a common service "chassis". He further discusses the organizational and technical disciplines of team autonomy, internal vendor-customer relationships, and, of course, "you build it, you run it"!
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
DevOpsDays Silicon Valley 2014 - The Game of Operations
1. The Game of Operations
and
The Operation of Games
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
DevOpsDays Silicon Valley, June 28 2014
2. Background
CTO at KIXEYE
• Real-time strategy games for web and mobile
Director of Engineering for Google App
Engine
• World’s largest Platform-as-a-Service
Chief Engineer at eBay
• Multiple generations of eBay’s real-time
search infrastructure
3. 1973: Xerox PARC and
SuperPaint
en.wikipedia.org/wiki/SuperPaint
www.computerhistory.org/collections/catalog/X1001.89B
5. Real-Time Strategy Games are
… • Real-time
• Spiky
• Computationally-
intensive
• Constantly evolving
• Constantly pushing
boundaries
Technically and
operationally demanding
6. Operating Games: Goals
Player Fun
• If players aren’t playing, we don’t have a business
• If players aren’t having fun, we don’t have a business
for long
• Fun includes game mechanics, feature set, uptime,
performance
Developer Productivity and Satisfaction
• We are a vendor; the studios are our customers
• Must be *strictly better* than the alternatives of build,
buy, borrow
Cost Efficiency
• More output for less
7. The Game of Operations
Cloud
• All studios and services moving to AWS
• Strong focus on automation
Services
• Small, focused teams
• Clean, well-defined interface to customers
DevOps Culture
• One team across development and ops
8. The Game of Operations
Cloud
Services
DevOps Culture
9. Why Cloud? (The Obvious)
Provisioning Speed
• Minutes, not weeks
• Autoscaling in response to load
Near-Infinite Capacity
• No need to predict and plan for growth
• No need to defensively overprovision
Pay For What You Use
• No “utilization risk” from owning / renting
• If it’s not in use, spin it down
10. Why Cloud? (The Less
Obvious)
Instance Shaping
• Instance shapes to fit most parts of the
solution space (compute-intensive, IO-
intensive, etc.)
• If one shape does not fit, try another
Service Quality
• Amazon and Google know how to run data
centers
• Battle-tested and highly automated
• World-class networking, both cluster fabric
and external peering
11. Why Cloud? (Fundamental
Forces)
Economics
• Nearly impossible to beat Google / Amazon
buying power or operating efficiencies
• 2010s in computing are like 1910s in electric
power
Developer Adoption
• It Just Works ™
• Makes it easy to fall in love with infrastructure
12. “Soon it will be just as common to
run your own data center as it is
to run your own electric power
generation”
-- me
13. Autoscaling
Games are very spiky
• Very unpredictable
• Huge variability between peak and trough
Hits are self-reinforcing
14. Automation Work at KIXEYE
Resilient Clients
• Clients back off in response to latency
• Clients continue gameplay despite network
disruption
Elastic Services
• Services grow / shrink based on load
• Service Cluster == AWS Auto Scale Group
15. Automation Work at KIXEYE
Build / Deploy Pipeline
• One button
• Puppet -> Packer -> AMI -> Asgard
• Zero-downtime red-black deployment
• Futures: canarying, auto-rollback
Manageability
• Puppet for configuration management
• Flume -> ElasticSearch / Kibana for logging
• Shinken -> PagerDuty for monitoring and
alerting
16. The Game of Operations
Cloud
Services
DevOps Culture
17. Service Teams
• Give teams autonomy
• Freedom to choose technology, methodology,
working environment
• Responsibility for the results of those choices
• Hold them accountable for *results*
• Give a team a goal, not a solution
• Let team own the best way to achieve the
goal
18. KIXEYE Service Chassis
• Goal: “chassis” for building scalable game
services
• Minimal resources, minimal direction
• 3 people x 1 month
• Consider building on NetflixOSS
Team exceeded expectations
• Co-developed chassis, transport layer, service
template, build pipeline, red-black deployment,
etc.
• Operability and manageability from the beginning
• 15 minutes from no code to running service in
AWS (!)
• Open-sourced at github.com/kixeye
20. Transition to Service
Relationships
Vendor – Customer Relationship
• Friendly and cooperative, but structured
• Clear ownership and division of responsibility
• Customer can choose to use service or not (!)
Service-Level Agreement (SLA)
• Promise of service levels by the provider
• Customer needs to be able to rely on the
service, like a utility
21. Transition to Service
Relationships
Charging and Cost Allocation
• Charge customers for *usage* of the service
• Aligns economic incentives of customer and
provider
• Motivates both sides to optimize
22. The Game of Operations
Cloud
Services
DevOps Culture
23. One Team (!)
• Act as one team across development,
product, operations, etc.
• Solve problems instead of blaming and
pointing fingers
• Political games are not as fun as real-time
strategy games
24. Everyone Is Responsible for
Prod
Everyone’s incentives are aligned
Everyone is strongly motivated to have solid
instrumentation and monitoring
26. Blame-Free Post-Mortems
Learn from mistakes and improve
• What did you do -> What did you learn
• Take emotion and personalization out of it
Post-mortem After Every Incident
• Document exactly what happened
• What went right
• What went wrong
27. Blame-Free Post-Mortems
Open and Honest Discussion
• What contributed to the incident?
• What could we have done better?
Engineers compete to take responsibility (!)
28. “Failure is not falling down but
refusing to get back up”
– Theodore
Roosevelt
29. Transition to DevOps
Organization
• Studios make user-visible games
• Services provide common endpoints
Training / Retraining
• Common bootcamp
• Train devs as Ops, Ops as devs
Transition On-call
• Use primary / secondary on-call as
apprenticeship
32. Come Join Us!
DevOps Whiskey Tasting, July 22
333 Bush St., San Francisco
kixeyeloveswhiskey.eventbrite.com
Hiring in SF, Seattle, Victoria,
Brisbane, Amsterdam
www.kixeye.com/jobs