AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Alexander Filipchik – Principal Engineer, Sony Interactive Entertainment
Dustin Pham – Principal Engineer, Sony Interactive Entertainment
David Green – Enterprise Solutions Architect, Amazon Web Services
Moving Mission-Critical Apps from One
Region to Multi-Region active/active
November 30, 2016
ARC309

What to expect from the session
• Architecture Background
• AWS global infrastructure
• Single vs Multi-Region?
• Multi-Region AWS Services
• Case Study: Sony’s Multi-Region Active/Active Journey
• Design approach
• Lessons learned
• Migrating without downtime

AWS worldwide locations
Region (14)
Coming Soon (4)

Transit
Transit
AZ
AZ
AZ AZAZ
Region topology

Transit
Transit
AZ
AZ
AZ AZAZ
Availability Zone

Availability Zone
Transit
Transit
AZ
AZ
AZ AZAZ

Single region high-availability approach
• Leverage multiple Availability Zones (AZs)
Availability Zone A Availability Zone B Availability Zone C
us-east-1

Reminder: Region-wide AWS services
• Amazon Simple Storage Service (Amazon S3)
• Amazon Elastic File System (Amazon EFS)
• Amazon Relational Database Services (RDS)
• Amazon DynamoDB
• And many more…

OK … should I use Multi-Region?

Good Reasons for Multi-Region
• Lower latency to a subset of customers
• Legal and regulatory compliance (i.e. data
sovereignty)
• Satisfy disaster recovery requirements

Multi-Region services
• Amazon Route 53 (Managed DNS)
• S3 with cross-region replication
• RDS multi-region database replication
• And many more…
• EBS snapshots
• AMI

Amazon Route 53
• Health checks
• Send traffic to healthy infrastructure
• Latency-based routing
• Geo DNS
• Weighted Round Robin
• Global footprint via 60+ POPs
• Supports AWS and non-AWS resources

prod-1 prod-2
95% 5%
example.net
health
health
+
weight
Example: Weighted with failover
prod.examp.net
examp-fail.s3-website

S3 – cross-region replication
Automated, fast, and reliable asynchronous replication of data across AWS regions
• Only replicates new PUTs. Once
S3 is configured, all new uploads
into a source bucket will be
replicated
• Entire bucket or prefix based
• 1:1 replication between any 2
regions / storage classes
• Transition S3 ownership from
primary account to sub-account
Use cases:
• Compliance—store data hundreds of miles apart
• Lower latency—distribute data to regional customers
• Security—create remote replicas managed by separate AWS accounts
Source
(Virginia)
Destination
(Oregon)

RDS cross-region replication
• Move data closer to customers
• Satisfy disaster recovery requirements
• Relieve pressure on database master
• Promote read-replica to master
• AWS managed service

Many resources exist
AWS Reference Architecture Implementation Guides

What to expect from the session
• Architecture Background
• AWS global infrastructure
• Single vs Multi-Region?
• Enabling AWS services
• Case Study: Sony Multi-Region Active/Active
• Design approach
• Lessons learned
• Migrating without downtime

Who is talking?
Alexander Filipchik (PSN: LaserToy)
Principal Software Engineer
at Sony Interactive Entertainment
Dustin Pham
Principal Software Engineer
at Sony Interactive Entertainment

Small team, large responsibility
• Service team ran like a startup
• Less than 10 core people working on new PS3 store
services
• PSN’s user base was already in the several hundred
millions of users
• Relied on quick iterations of architecture on AWS

MULTIPLE NEW VIRTUAL REALITY
PLATFORM LAUNCHES OF VARYING
EXPERIENCE LEVEL
THE YEAR OF VR
Cardboard

Delivered new store
• Great job, now onto the PS4
• PS4 launch – 1 million users at once on Day 1, Hour 1
• Designing for many different use cases at scale

Architecture phases
Proof of
Concept
Scale Optimize
Make Highly
Available
SF Bay

Next step: make highly available
• Highly available for us: multiregion active/active
• Raising key questions:
• How does one move a large set of critical apps with
hundreds of terabytes of live data?
• How did we architect every aspect to allow for multiregional,
active-active?
• How do we turn on active-active without user impact?
• User impact includes Hardware (ps3/ps4/etc.) and Game
partners!
• Where do we even begin?

Applications
• First question to answer: What does it mean to be
multiregional?
• Different people had different answers:
• Active/stand-by vs. active/active
• Full data replication vs. partial
• Automatic failover vs. manual
• Etc.

After some healthy discussions

Agreement
• “You should be able to lose 1 of anything” approach.
• Which means, we should be able to survive without any
visible impact losing of:
• 1 server
• 1 Availability Zone
• 1 region

Starting with uncertainty
• Multiple macro and micro services
• Stateless and stateful services
• They depend on multiple technologies
• Some are multiregional and some are not
• Documentation was as always: out of date

Inventory of dependencies
0
10
20
30
40
50
60
70
80
90
100
Tech
%ofapplications

What is multiregional by design?
With some customizations

Stages of grief
• Denial – can’t be true, let’s check again
• Anger – we told everyone to be active/active ready!!!
• Bargaining – active/stand-by?
• Depression – we can’t do it
• Acceptance – let’s work to fix it, we have 6 months…

What it tells us
• We can’t just put things in two regions and expect them
to work
• We will need to do some work to:
• Migrate services to technology which is multiregional by
design
• Somehow make underlying technology multiregional

Scheduling/optimization problem
• There is work that should be done on both apps and
infrastructure side
• We need to schedule it so we can get results faster
and minimize waits
• And we wanted machine to help us

The world’s leading graph database
That can store a graph of 30B nodes
Here to help us to deal with our problem

Why Neo4J
• Graph engine and we are dealing with a graph
• Query language that is very powerful
• Can be populated programmatically
• Can show us something we didn’t expect

How to use it?
• Model
• Identify nodes and relations
• Tracing
• Code analyzer
• Talking to people
• Generate the graph
• Run queries

Model example
• Nodes
• Users
• Technology: (Cassandra, Redis)
• multiregional: true/false
• Service (applications)
• stateless: true/false
• Edges
• Usage patterns (read, write)

Graph example
Can be enriched with:
• Load balancers
• Security groups
• VPCs
• NATs
• Etc.

And running some Neo4j magic
This one is important

What to do next
• Validate multiregional technologies do actually work
• Figure out what to do with non-multiregional technologies
• Move services in the following order:

Validating our main DB (Cassandra)
A lot of unknowns:
• Will it work?
• Will performance degrade?
• How eventual is multiregional eventual consistency?
• Will we hit any roadblocks?
• Well, how many roadblocks will we hit? 

What did we know?
Netflix is doing it on AWS and they actually tested it
They wrote 1M records in one region of a multiregion
cluster
500 ms later read in other clusters was initiated
All records were successfully read

Well…
Some questions to answer:
Should we just trust the
Netflix’s results and just
replicate data and see what
happens?
Is their experiment applicable
to our situation?
Can we do better?
Break
Something
Free
Coffee
Say,
"there's
gotta be a
better way
to do this"
HOW TO GET AN ENGINEER'S
ATTENTION

Cassandra validation strategy
• Use production load/data
• Simulate disruptions
• Track replication latencies
• Track lost mutations
• Cassandra modifications were required

Preparation
Exporter
Region 1
Region 2
Ingester
Ingester

Test
Read/Write
Loader
Region 1
Read/Write
Loader
Region 2

Sample results (usw1-usw2)
1
10
100
1000
10000
100000
1000000
10000000
61714
61716
61718
61720
61722
61724
61726
61728
61802
61804
61806
61808
61810
61812
61814
61816
61818
61820
61822
61824
61826
61828
61830
61832
61834
61836
61838
61840
61842
61844
61846
61848
61850
61852
61854
61856
61858
61900
61902
61904
61906
61908
61910
61912
61914
Two DC connection cut-off and recovery ( latency in logarithmic scale)
Pct95 Pct99
Pct999 MaxLag

Things that are not multiregional by design
We gave teams 2 options:
• Redesign if is critical to user’s experience
• If not in the critical path (batch jobs)
• active/passive
• master/slave
• Use Kafka as a replication backbone (recommended)

Solr example (pre active/active)
Indexer
Master
App1
App2
Replicator
Replicator
Read Replicas
Read Replicas

Solr example (easy active/active)
Indexer
Master
Replicators
Read Replicas
Apps
Replicators
Read Replicas
Apps
Region 1 Region 2

Solr example (Kafka active/active)
Indexer
Read Replicas
Apps
Region 1
Solr Indexer
Indexer
Read Replicas
Apps
Region 2
Solr Indexer

Are we missing anything?
Yes, infrastructure

Breaking up the system into moveable parts
App + caching tier
Data tier
Inbound tier
Outbound tier
Clients

Phase 1: Infrastructure
Private Subnet
Public Subnet
ELBs Inbound tier
Outbound Tier
Infrastructure to build/move:
• VPCs
• Subnets
• ACLs
• ELBs
• IGW
• NAT
• Egress

Phase 1: Infrastructure key points
• Building infrastructure in new region must be fully
automated (Infrastructure as Code)
• Regional communication decisions
• VPNs?
• Over Internet?
• Do infrastructures have to match exactly?
• 1st region evolved organically
• 2nd region should be blueprint for all new region DCs

Phase 2: Data
Public subnet
ELBs
Data tier
Inbound tier
Outbound tier

Phase 2: Data option 1 replication over VPN
Public Subnet
ELBs
Data tier
Inbound tier
Outbound tier
Region 2
VPN

Phase 2: Data option 1 replication over VPN
• Pros
• Setting up VPN with current network architecture would be
easier on data tier
• Secure
• Managing data nodes intercommunication is straight forward
and has lower operational overhead
• Cons
• Limit on throughput
• Data set is large and can quickly saturate VPN
• Scaling more applications in future will be complicated!

Phase 2: Data option 2 replication over ENIs with public IPs
Private subnet
Public subnet
ELBs
Data tier
Inbound tier
Outbound tier
Region 2
SSL
SSL

Phase 2: Data option 2 replication over ENIs with public IPs
• Pros
• Not network constrained
• Able to add more applications + data without need of building
new infrastructure to support
• Cons
• Operationally, more orchestration (Cassandra, for example,
needs to know other node Elastic IPs)
• Internode data transfer security is a must

Phase 3: App tier + cache strategy
Outbound Tier
Region 2

Phase 3: App tier + cache strategy
• Applications communicate within a region only
• Applications do not call another region’s databases,
caches, or applications
• Isolation creates for predictable failure cases and clearly
defines failure domains
• Monitoring and alerting are greatly simplified in this
model

Phase 4: Client routing
Region 1 Region 2
DNS

Phase 4: Client routing
• Predictable “sticky” routing to avoid user bounce via
Georouting
• Data replication manages cross region state
• Allows for routing to stateless services
• Ability to do % based routing to manage different failure
scenarios

Software design for multiregion deployments
• Typical software architecture
APIs
Business Logic
Data Access
Cross
Cutting
Config

Software design for multiregion deployments
Region 1 Region 2
Remember when we mentioned to have application tier call patterns to be
isolated in a region? How do we achieve this simply?

Software configuration approaches
• An application config to connect to a database could
look like:
cassandra.seeds=10.0.1.16,10.0.1.17
• A naïve approach would be to have an application have
multiple configs per deployable depending on its region
cassandra.seeds.region1=10.0.1.16,10.0.1.17
cassandra.seeds.region2=10.0.2.16,10.0.2.16
• This, of course, results in an app config management
nightmare, especially now with 2 regions

Software configuration approaches
• What if we
implemented a
basic “central"
way of
configuration
Region x
Region x
Local DB
Where are my C*
Seeds?
IPs are x.x.x.xcassandra.seeds=cass-
seed1, cass-seed2
cass-seed1 resolves to
x.x.x.x

Simplified software configuration (context)
• Context is made available to application which contains:
• Data Center/region
• Endpoint short-name resolution
• Environment (Dev, QA, Prod, A/B)
• Database connection details
• Context is the responsibility of the infrastructure itself
and is provided through build automation, AWS tagging,
etc.
• App is responsible for behaving correctly off of context

Infrastructure as code
• New regions must be built through automation
• Specification of services to Terraform
• Internal tool and DSL was built to manage domain
specific needs
• Example:
• Specify an app requires Cassandra and SNS
• Generates Terraform to create security groups for ports 9160,
7199-7999, build SNS, build ELB for app, etc.

Database automation
• Ansible run to assist
in build Cassandra in
public subnet and
associate EIPs to
every new node
• Manages network
rules (whitelisting)
• Manages certificates
and SSL
Private Subnet
Public Subnet
ELBs
Outbound Tier
Region 2
SSL
SSL

Monitoring multiregional deployments

Monitoring through proper tagging
• Part of the “Context” applications are aware of is the
region
• Adds “region” to any app logs
• Region tags then added in metrics and can be surfaced
in grafana or any monitoring of your choice
• Cross-regional monitoring key metrics and alerting
• Data replication (hints in Cassandra, seconds behind master
in MySQL, etc.)
• Data in/out

Putting it all together
Region 1 Region 2
Create
infrastructure
Replicate
DNS

Lessons learned
• Data synchronization is super critical, so dependency
map based off of the data technologies first.
• Always run your own benchmarking.
• Do not allow legacy to control other region’s design. Find
a healthy transition and balance between old and new.
• Applications must be context-driven.
• Depending on your data load, Cross-regional VPNs may
not make sense.

PlayStation is hiring in SF:
Find us at hackitects.com

Remember to complete
your evaluations!

AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Similaire à AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309) (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)