Sony Interactive Entertainment engineers presented on their journey moving mission-critical applications from a single AWS region to an active-active multi-region architecture. They modeled their application dependencies as a graph using Neo4j to identify services ready for multi-region and plan the migration order. Key lessons included validating data replication technologies through testing, redesigning some services to be multi-region native, and implementing centralized configuration to isolate applications within a region.
3. What to expect from the session
• Architecture Background
• AWS global infrastructure
• Single vs Multi-Region?
• Multi-Region AWS Services
• Case Study: Sony’s Multi-Region Active/Active Journey
• Design approach
• Lessons learned
• Migrating without downtime
11. Single region high-availability approach
• Leverage multiple Availability Zones (AZs)
Availability Zone A Availability Zone B Availability Zone C
us-east-1
12. Reminder: Region-wide AWS services
• Amazon Simple Storage Service (Amazon S3)
• Amazon Elastic File System (Amazon EFS)
• Amazon Relational Database Services (RDS)
• Amazon DynamoDB
• And many more…
14. Good Reasons for Multi-Region
• Lower latency to a subset of customers
• Legal and regulatory compliance (i.e. data
sovereignty)
• Satisfy disaster recovery requirements
16. Multi-Region services
• Amazon Route 53 (Managed DNS)
• S3 with cross-region replication
• RDS multi-region database replication
• And many more…
• EBS snapshots
• AMI
17. Amazon Route 53
• Health checks
• Send traffic to healthy infrastructure
• Latency-based routing
• Geo DNS
• Weighted Round Robin
• Global footprint via 60+ POPs
• Supports AWS and non-AWS resources
19. S3 – cross-region replication
Automated, fast, and reliable asynchronous replication of data across AWS regions
• Only replicates new PUTs. Once
S3 is configured, all new uploads
into a source bucket will be
replicated
• Entire bucket or prefix based
• 1:1 replication between any 2
regions / storage classes
• Transition S3 ownership from
primary account to sub-account
Use cases:
• Compliance—store data hundreds of miles apart
• Lower latency—distribute data to regional customers
• Security—create remote replicas managed by separate AWS accounts
Source
(Virginia)
Destination
(Oregon)
20. RDS cross-region replication
• Move data closer to customers
• Satisfy disaster recovery requirements
• Relieve pressure on database master
• Promote read-replica to master
• AWS managed service
24. What to expect from the session
• Architecture Background
• AWS global infrastructure
• Single vs Multi-Region?
• Enabling AWS services
• Case Study: Sony Multi-Region Active/Active
• Design approach
• Lessons learned
• Migrating without downtime
25.
26. Who is talking?
Alexander Filipchik (PSN: LaserToy)
Principal Software Engineer
at Sony Interactive Entertainment
Dustin Pham
Principal Software Engineer
at Sony Interactive Entertainment
28. Small team, large responsibility
• Service team ran like a startup
• Less than 10 core people working on new PS3 store
services
• PSN’s user base was already in the several hundred
millions of users
• Relied on quick iterations of architecture on AWS
34. Delivered new store
• Great job, now onto the PS4
• PS4 launch – 1 million users at once on Day 1, Hour 1
• Designing for many different use cases at scale
36. Next step: make highly available
• Highly available for us: multiregion active/active
• Raising key questions:
• How does one move a large set of critical apps with
hundreds of terabytes of live data?
• How did we architect every aspect to allow for multiregional,
active-active?
• How do we turn on active-active without user impact?
• User impact includes Hardware (ps3/ps4/etc.) and Game
partners!
• Where do we even begin?
38. Applications
• First question to answer: What does it mean to be
multiregional?
• Different people had different answers:
• Active/stand-by vs. active/active
• Full data replication vs. partial
• Automatic failover vs. manual
• Etc.
40. Agreement
• “You should be able to lose 1 of anything” approach.
• Which means, we should be able to survive without any
visible impact losing of:
• 1 server
• 1 Availability Zone
• 1 region
41. Starting with uncertainty
• Multiple macro and micro services
• Stateless and stateful services
• They depend on multiple technologies
• Some are multiregional and some are not
• Documentation was as always: out of date
44. Stages of grief
• Denial – can’t be true, let’s check again
• Anger – we told everyone to be active/active ready!!!
• Bargaining – active/stand-by?
• Depression – we can’t do it
• Acceptance – let’s work to fix it, we have 6 months…
45. What it tells us
• We can’t just put things in two regions and expect them
to work
• We will need to do some work to:
• Migrate services to technology which is multiregional by
design
• Somehow make underlying technology multiregional
46. Scheduling/optimization problem
• There is work that should be done on both apps and
infrastructure side
• We need to schedule it so we can get results faster
and minimize waits
• And we wanted machine to help us
47. The world’s leading graph database
That can store a graph of 30B nodes
Here to help us to deal with our problem
48. Why Neo4J
• Graph engine and we are dealing with a graph
• Query language that is very powerful
• Can be populated programmatically
• Can show us something we didn’t expect
49. How to use it?
• Model
• Identify nodes and relations
• Tracing
• Code analyzer
• Talking to people
• Generate the graph
• Run queries
50. Model example
• Nodes
• Users
• Technology: (Cassandra, Redis)
• multiregional: true/false
• Service (applications)
• stateless: true/false
• Edges
• Usage patterns (read, write)
56. What to do next
• Validate multiregional technologies do actually work
• Figure out what to do with non-multiregional technologies
• Move services in the following order:
57. Validating our main DB (Cassandra)
A lot of unknowns:
• Will it work?
• Will performance degrade?
• How eventual is multiregional eventual consistency?
• Will we hit any roadblocks?
• Well, how many roadblocks will we hit?
58. What did we know?
Netflix is doing it on AWS and they actually tested it
They wrote 1M records in one region of a multiregion
cluster
500 ms later read in other clusters was initiated
All records were successfully read
59. Well…
Some questions to answer:
Should we just trust the
Netflix’s results and just
replicate data and see what
happens?
Is their experiment applicable
to our situation?
Can we do better?
Break
Something
Free
Coffee
Say,
"there's
gotta be a
better way
to do this"
HOW TO GET AN ENGINEER'S
ATTENTION
60. Cassandra validation strategy
• Use production load/data
• Simulate disruptions
• Track replication latencies
• Track lost mutations
• Cassandra modifications were required
65. Things that are not multiregional by design
We gave teams 2 options:
• Redesign if is critical to user’s experience
• If not in the critical path (batch jobs)
• active/passive
• master/slave
• Use Kafka as a replication backbone (recommended)
73. Phase 1: Infrastructure key points
• Building infrastructure in new region must be fully
automated (Infrastructure as Code)
• Regional communication decisions
• VPNs?
• Over Internet?
• Do infrastructures have to match exactly?
• 1st region evolved organically
• 2nd region should be blueprint for all new region DCs
75. Phase 2: Data option 1 replication over VPN
Public Subnet
ELBs
Data tier
Inbound tier
Outbound tier
Region 2
VPN
76. Phase 2: Data option 1 replication over VPN
• Pros
• Setting up VPN with current network architecture would be
easier on data tier
• Secure
• Managing data nodes intercommunication is straight forward
and has lower operational overhead
• Cons
• Limit on throughput
• Data set is large and can quickly saturate VPN
• Scaling more applications in future will be complicated!
77. Phase 2: Data option 2 replication over ENIs with public IPs
Private subnet
Public subnet
ELBs
Data tier
Inbound tier
Outbound tier
Region 2
SSL
SSL
78. Phase 2: Data option 2 replication over ENIs with public IPs
• Pros
• Not network constrained
• Able to add more applications + data without need of building
new infrastructure to support
• Cons
• Operationally, more orchestration (Cassandra, for example,
needs to know other node Elastic IPs)
• Internode data transfer security is a must
80. Phase 3: App tier + cache strategy
• Applications communicate within a region only
• Applications do not call another region’s databases,
caches, or applications
• Isolation creates for predictable failure cases and clearly
defines failure domains
• Monitoring and alerting are greatly simplified in this
model
82. Phase 4: Client routing
• Predictable “sticky” routing to avoid user bounce via
Georouting
• Data replication manages cross region state
• Allows for routing to stateless services
• Ability to do % based routing to manage different failure
scenarios
84. Software design for multiregion deployments
• Typical software architecture
APIs
Business Logic
Data Access
Cross
Cutting
Config
85. Software design for multiregion deployments
Region 1 Region 2
Remember when we mentioned to have application tier call patterns to be
isolated in a region? How do we achieve this simply?
86. Software configuration approaches
• An application config to connect to a database could
look like:
cassandra.seeds=10.0.1.16,10.0.1.17
• A naïve approach would be to have an application have
multiple configs per deployable depending on its region
cassandra.seeds.region1=10.0.1.16,10.0.1.17
cassandra.seeds.region2=10.0.2.16,10.0.2.16
• This, of course, results in an app config management
nightmare, especially now with 2 regions
87. Software configuration approaches
• What if we
implemented a
basic “central"
way of
configuration
Region x
Region x
Local DB
Where are my C*
Seeds?
IPs are x.x.x.xcassandra.seeds=cass-
seed1, cass-seed2
cass-seed1 resolves to
x.x.x.x
88. Simplified software configuration (context)
• Context is made available to application which contains:
• Data Center/region
• Endpoint short-name resolution
• Environment (Dev, QA, Prod, A/B)
• Database connection details
• Context is the responsibility of the infrastructure itself
and is provided through build automation, AWS tagging,
etc.
• App is responsible for behaving correctly off of context
89. Infrastructure as code
• New regions must be built through automation
• Specification of services to Terraform
• Internal tool and DSL was built to manage domain
specific needs
• Example:
• Specify an app requires Cassandra and SNS
• Generates Terraform to create security groups for ports 9160,
7199-7999, build SNS, build ELB for app, etc.
90. Database automation
• Ansible run to assist
in build Cassandra in
public subnet and
associate EIPs to
every new node
• Manages network
rules (whitelisting)
• Manages certificates
and SSL
Private Subnet
Public Subnet
ELBs
Outbound Tier
Region 2
SSL
SSL
92. Monitoring through proper tagging
• Part of the “Context” applications are aware of is the
region
• Adds “region” to any app logs
• Region tags then added in metrics and can be surfaced
in grafana or any monitoring of your choice
• Cross-regional monitoring key metrics and alerting
• Data replication (hints in Cassandra, seconds behind master
in MySQL, etc.)
• Data in/out
93. Putting it all together
Region 1 Region 2
Create
infrastructure
Replicate
DNS
95. Lessons learned
• Data synchronization is super critical, so dependency
map based off of the data technologies first.
• Always run your own benchmarking.
• Do not allow legacy to control other region’s design. Find
a healthy transition and balance between old and new.
• Applications must be context-driven.
• Depending on your data load, Cross-regional VPNs may
not make sense.