You’re building the next killer mobile app. How do you ensure that your app is both stable and capable of near-instant data updates? Build a backend! But there’s more to building a backend than standing up a SQL server in your datacenter and calling it a day. Since different types of apps demand different backend services, how do you know what sort of backend you need? And, more importantly, how can you ensure that your backend will scale so you can survive an explosion of users that comes from events like being featured in the app store? Siva Katir and Melissa Benua will discuss the common scenarios facing mobile app developers who are looking to expand beyond just the device and will share best practices learned while building the PlayFab and other companies’ backends. Join Siva and Melissa to learn how you can ensure that your app can scale safely and affordably into the millions of concurrent users (CCU) and across multiple platforms.
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
1. CAN YOUR MOBILE
INFRASTRUCTURE SURVIVE
1 MILLION CONCURRENT
USERS? Melissa Benua
Siva Katir
PlayFab, Inc
Mobile Dev + Test 2016
2. Don’t be your own worst enemy!
The Simpsons: Tapped Out launched
by EA in 2012
Backend was so unprepared for
massive loads of traffic it was pulled
for FIVE months for total redesign
Went on to become a huge and long-
lasting hit in the market for many
years afterwards
Can your company afford to add an
extra 5 months to the development
cycle? Including lost marketing and
promotional spend? Including lost
mindshare? Including bad press?
3. Be your own guardian angel!
Loadout launched on Steam by
Edge of Reality
500x increase in players
overnight on being featured in
Steam store
EC2 auto-scaled in atomic and
replaceable servers instantly to
handle load
No downtime, no panic, no fires
5. What can my backend do for me?
Push updates without going
through full certification process
• New artwork? No problem!
• Message of the day!
• In-app purchase promotions!
Improve customer service
• Have an authoritative source for
what a client ‘has’
• Direct access to grant
entitlements to remediate issues
6. What can my backend do for me?
Support a single user across
multiple devices
• Recover a user’s session even if
they lose or replace their device
• Continue the same session across
multiple devices (phone to tablet)
Perform ‘trusted’ transactions
(especially around receipt
verification)
• Clients are untrustworthy!
• Client-to-Provider transaction can
only say if a receipt is valid, NOT if
a receipt is valid for your app
7. Know Your Project
What is your budget?
• What does it cost to host?
• What does it cost to run?
Who are your engineers?
• Do you have the in-house expertise
to manage all services?
• DevOps? Backend? Whole-Stack?
Front-End?
• Are they willing to be on-call 24x7?
What do you need to put in the
cloud? Why?
8. Know Your Data
What data are you storing?
• User data
• Group data
• Application data
How does each piece of data need
to be queried?
• Can all data be looked up by a
key?
• Need to do arbitrary field queries?
Is the data read and/or write heavy?
How much data do you expect to
store per user?
10. Pick a Cloud Provider
Is your language well supported
in your provider?
How much self management is
required for each service?
How well is scalability built in?
Do you have region
requirements?
• European data protection laws
• Russia and China have special
data laws
11. Large Needs or Small Needs?
Database + basic CRUD APIs?
• AWS Lambda!
Complex data + user
management?
• AWS Mobile or Azure Mobile
Services!
Highly custom requirements?
• Roll your own on a public cloud
(PROCEED WITH CAUTION!)
12. Storing and Retrieving Data
Know your databases strength
• MySQL – Very easy to get started with and
widely supported
• MS-SQL – Powerful query engine and
incredibly performant
• MongoDB – Can query against arbitrary
fields
• DynamoDB – Very easy scaling and fast
random access
Know their weaknesses too
• MySQL – very hard to scale
• MS-SQL – still pretty hard to scale
• MongoDB – very hard to scale correctly
and maintain data integrity
• DynamoDB – can only query against
predefined indexes cost effectively
13. Storing and Retrieving Data
Novel solutions to database shortcomings
• Use multiple databases to take advantage of
their individual strengths
• Example: Store “index” data in SQL, while using
DynamoDB for actual data storage which clients
use
• Allows you to store all data without needing to
scale a difficult to scale database
Keys:
• Have a way to reliably update the SQL database
out of the user’s flow
• Don’t treat the SQL store as authoritative
• Some tools can make this entirely seamless,
such as using DynamoDB write streams and
Lambda to update SQL through
SQL:
{
“playerId”: 00001
“purchaseId”: 1002092,
“purchaseValue”: 0.99,
“purchaseDate”: 03/01/2016 09:01:05
}
DyanamoDB:
{
“playerId”: 00001,
“purchaseId”: 1002092,
“purchasedItems”:
[{
“itemName”: ”in_app_1”,
“purchasePrice”: 0.99
}]
}
SELECT purchaseId, purchaseValue FROM
sqlPurchaseTable WHERE purchaseDate > 3/1/2016
14. Plan For Failure
Design for the worst, hope for the
best
• Any machine can go down at any
time
• No machine should be ‘special’
If any machine can go down then
any machine can also be brought
up
Architect-in failure behavior both up
and down the stack
• DB times out?
• Web server disk fails?
• Third-party provider goes down?
http://gunshowcomic.com/648
16. Saving Data
Remote != Local
Do:
• Save only changed data
• Save data in batches
• Prepare for connection failures
• Prepare for client failures
• Prepare for server failures
Don’t:
• Save on a timer (unless it’s retrying)
• Save duplicated data
• Expect it to work
• Make assumptions on if it worked
http://cloudtweaks.com/
17. Loading Data
Easy Wins
• Client:
• Pre-load data during idle times
• Cache locally
• Assume data can fail to be loaded
• Assume data can arrive corrupted or out of
order
• Assume it will load slow
• If security matters, connect via SSL
• Don’t connect directly to the data store
• Server:
• Cache data that is OK to serve stale
• Design data schemas to make each request
perform as few queries as possible
• Design authorization in such a way to
prevent any, or at least limit any extra
queries
Easy Fails
• Trying to implement a custom SSL
service
• Trying to be clever with caching
• Assuming anything will work on the first
try
18. Scalability
Don’t optimize early
• Actually know what your
bottlenecks are; most likely it is
NOT string handling!
• Run a realistic load test with a
profiler to get actual useful data
Don’t run blind
• Know your KPIs before launch
• Track your KPIs realtime via
counters with DataDog or
Cloudwatch
• Set up alerting to your DRI
19. Scalability
Know what infrastructure to scale and
when
• Data
• API servers
• Load balancers
Design to scale horizontally, not vertically
• All services should be stateless unless they
don’t need to scale with number of users
• Don’t assume a server will exist minute to
minute
Keep a safe capacity margin in your
infrastructure
• 50% is reasonable
• Know how long it will take to increase
capacity
20. Managing Connections
Use connection pooling
Don’t try to outsmart your
language’s connection management
Making a connection has a cost!
Don’t re-invent a protocol if an
existing one will do
• HTTP is way easier to debug than
websockets
• Websockets stream data way more
efficiently than HTTP
• Both are safer than using raw TCP
Siva
- SURVEY: Who knows the story of EA Simpson’s game?
- Caused by a combination of bad client and server design
- Client “phoned home” on a timer
- First round of fixes was to just slow the client timer
- Servers never tested to capacity
Melissa – talk about launching loadout
Melissa
Some apps don’t really need backends
Examples – single-use, single-platform, monetized directly vs freemium
Melissa
How long does cert take?
SURVEY: how many people have passed an app through cert already?
Melissa
People on average have 3 connected devices now
Talk about receipt validation, and apple just changing what they are accepting (wrt pre-IOS 7 receipts)
Talk about adcap receipt fraud
Siva
- Entity Name Feature Creep Story
- Even when you know what you should be making it isn’t always clear if it’s what you need to make.
- UI? Used by non-engineers, predict costs?
- engineers have specialties, have experience?
- Paged at 2am on New Years?
- Cloud expertise on staff?
Siva
- User Data: Stats, inventory, purchase history, game saves
- Guild ranks, group messages, guild inventory
- Game configuration information, side loadable assets
- Querying requirements may dictate your database options
- Are some queries nice to have or mandatory, is it needed for API requests or reporting
- How often do you need to sync and/or read a given piece of data
- How much data to store?
- Actionable data
Intro Siva
- You really want a backend…
- Now design it
- What features?
- How each feature?
Melissa
SURVEY: who is using what right now
SURVEY: who is published in Russia? China?
Melissa
- SURVEY: who is using what right now
Siva
Intro: Databases are hard.
- POSGuys product table story
SURVEY: Who’s hit the SQL connection count max? Most databases have connection count limits (MySQL 151, MSSQL 100 (max is over 50k though, though that doesn’t make it a good idea))
BP had self serve work order system – crashed when SQL server connection pool ran out
SURVEY: Who knows that MongoDB isn’t ACID complaint?
Never insurmountable
- but fix before fire
Siva
- SQL Leaderboards
- Updated via events out of band
- Stale Data OK
- Data doesn’t match
Melissa
Sometimes datacenters catch on fire or lose power
Melissa
Melissa
story about the timer guys (1mb every 10 seconds)
Think about data service plan!
Siva
- use SSL
- only server trusted data store writes
- 1 second cache
- Over 1,500 popular iOS apps were found not to validate SSL authenticity with the OS
Siva
- How long does a given query take?
- How often are you hitting remote services?
- Netflix KPI: Support Calls + Streams Started
- PlayFab API calls and API errors
- If you setup on call make sure they can be notified for pertinent events
Siva
- Stateless is hard but is a must
- Why state is bad
- Why stateless is hard
Melissa
Trust your OS and webserver HTTP host engineers
Every machine has a socket limit (Windows is ~65k, less are available for use) – forking service example
SURVEY: who has used websockets?