Operating a highly available cloud service is not just about technology and architecture. It has a lot to do with people and processes. Everything fails all the time. So, how do you ensure you have the right people and the right processes in the right places to run a highly available web service. This talk covers people, processes and technology and tools required to run a highly available web service.
1. Operating a Highly Available
Cloud Service
November 14, 2013
Depankar Neogi
Chief Architect
QuickBase, Intuit Inc.
Presented at Boston Cloud Services Meetup
http://www.meetup.com/Boston-cloud-services/events/141118632/
2. Agenda
• Intuit and QuickBase
• Building and Running Highly Available Cloud
Services
–People & Process
–Technology
The single most important thing to keep in mind when
designing for High Availability is to anticipate failure.
2
4. What is QuickBase?
Easily customized
to meet unique
business needs
Excel to
QuickBase
in less than
5 minutes
Brand NEW modern UI
enables Ease of Use
An Enterprise
platform to
empower your
team to build
applications
Requirements,
processes and
teams evolving
constantly
More than
4,500
companies
use QuickBase
500,000+
current users
One platform solves jobs across the enterprise.
Project Management, IT helpdesk, CRM, Field service, Human resources, etc.
4
5. QuickBase – Customized applications matching
your unique requirements
Roles Based UI
Dashboards
& Reports
Data Storage
& Backup
Secure Access
Control
Relational Data
Tables
Business logic &
workflow
Open extensible API’s
Common Infrastructure Services
5
6. Modern, Easy, Productive, Dynamic, Fast
30 million requests per day
80 K unique visitors per day
100,000 active apps at any time
25 milliseconds median processing time
Supports Dynamic DML, DDL, CRUD
Cloud based Database with a beautiful UX
6
7. New QuickBase DIY Data Access
Liberators
Data Mapping
WSQL Transforms
Virtual tables
Liberator
Cache
Library
Warehouse
Scheduler
Repository
1. QuickBase UI
Extended with new
DIY data sharing
2. New Data Sharing
Service
A
N
Y
A
P
I
3. Connections to
Popular Industry Data
Intuit-class infrastructure
(security, billing, HADR, hosting)
8
13. People & Process: Monitoring Business Metrics
• It’s critical to detect a problem before your customers have
to tell you or you have to ask them.
• By monitoring real time business metrics and comparing
the actual data to a historical curve you can more quickly
detect if there is a problem and avoid sifting through alerting
and monitoring white noise that your systems will
inevitability produce.
• Five evolutionary questions that monitoring should answer:
1.
2.
3.
4.
5.
Is there a problem?
Where is the problem?
What is the problem?
Why is there a problem?
Will there be a problem?
• External versus Internal Monitoring
http://akfpartners.com/techblog/2009/06/15/monitoring-strategies/
14
14. People & Process: Invest in Good Tools
A good tool will help you find the
needle in a haystack - fast
95 K Requests in 12 hour window
Peak Request: 4.3 req/sec (1286 request/5 min window)
15
Processing Time: 61 millisecond per request
15. People & Process: Incident Management Process
•
•
•
•
•
•
•
•
•
Incident Management Team (IMT)
Incident Management Response Plan
Activating the IMT, notifications
Having the right break-out rooms
Classification of the incident
Communication of the incident
Time keeper
Management versus Technical Process
Tracking:
– SLA
– RPO (recovery point objective)
– RTO (recovery time objective)
• Incident closure, recovery
• Evaluation process
16
16. People & Process: Runbook and messaging
• Runbook
– Detail process for managing the incident
– Contact Information
– Managing data center cutover, recovery steps, testing, managing
replication
• Messaging book
–
–
–
–
–
Who is responsible for communication
Who creates and approves the message
How you communicate
At what cadence
What you tell your customers
• Social Media Strategy
–
–
17
If you are not transparent, your customers will let you know
Social Media coordinator – own the channels
17. People & Process: Service Page
Provide Customers ability to find out the health of the system
and be notified of any service related issues
18
18. People & Process: Service Page
Transparency is Key. If you let the customers know what you know,
they will respect you and may remain loyal to your business.
19
19. People & Process: Business Fault Isolation
•
•
•
•
•
What if your data center went down
And the production server is down because the data center is down
And your email server was in the same data center
And your marketing server was in the same data center
And your service page was on a server in the same date center
• How do you communicate with all your customers?
Business Fault Isolation prevents your business from a SPOF
(single point of failure).
20
20. People & Process: Review Process
• SaaS or Operations Review Process should have a fixed
cadence and be led by a company leader
• Review Team should include leaders from:
– Finance
– Compliance & Risk
– CTO
– Operations
– Product
• Dashboard with KPI
• Review Fire drills
• Change Control Process
– Preferably change one thing at a time
21
22. The Three Pillars of High Availability
The goal of High Availability and Disaster Recovery (HA/DR) is
to provide Business Continuance through:
Lack of Service Outage = Happy Customers = Greater Business Value
HA/DR directly enhances a customer’s experience through
greater offering availability
23. High Availability Architecture Principles
• Design for Failure
– Avoid Single Points of Failure
– Graceful Degradation and Soft Dependencies
– Asynchronous Design
– Keep State Confined to Where it is Needed
• Design for Operability
– Design to be Monitored
– Design for Hot Deployment and Rollback
– Automate Where Possible
• Keep Everything “In Production”
• Scale Out (Not Up)
• Keep it Fresh…and Mature
24. Architecture Patterns for High Availability
Swimlanes
1)
2)
Active/Active
3)
Single Write Master
4)
25
Active/Passive
Store and Forward
25. Active / Passive
Primary Data Center
Secondary Data
Center
Near Real-time
Replication
Active
Data
26
Passive
Back Up
26. Swimlane Principle
A “Swimlane” is:
A set of predefined systems and software infrastructure tuned
to support a predefined workload
• Only a portion of an offering’s total users are hosted on any
given swimlane
Within a Swimlane:
– Each Swimlane is independent and self-sufficient and
shares no compute/storage resources with other swimlanes
– Offering transactions occur within a Swimlane
– Only access to Shared Services go outside the Swimlane
– Standard Fault Detection and Fault Recovery methods
are used
27
27. High Availability with Swimlanes
Application Partitioning
GTM
via Swimlanes
DC 1
Fault Domain 1
Fault Domain 2
WS
AS
Storage
28
WS: web server; AS: app server
WS
AS
Swimlane 2
AS
Storage
Swimlane 4’
Swimlane 3
Storage
WS
F5 GTM
Storage
WS
AS
Storage
WS
AS
Storage
Intuit Proprietary & Confidential
WS
AS
Storage
Swimlane 4
AS
F5 LTM
Swimlane 3’
WS
DNS
Swimlane 1’
F5 GTM
Swimlane 2’
F5 LTM
Swimlane 1
DC 2
Internet
WS
AS
Storage
28. Swimlanes Support Application Needs
• Scalability
• Replicated swimlanes add capacity with linear scalability
• Fault Isolation
• Complete failure only impacts a subset of users due to application
partitioning and data sharding
• High Availability
• Individual tiers can be made highly available through intra-VM application
recovery, intra-swimlane application failover or intra-swimlane VM restart
• Disaster Recovery
• Disaster recovery is achieved through swimlane failover, either in the same
or a remote data center
• Automation
• The identical nature of a swimlane allows for a high degree of operational
automation
29
29. Active / Active – Swim Lanes
Global
Load
Balancer
Data Center 1
25%
customers
Data Center 2
25%
customers
25%
customers
Replication
25%
customers
DB3 active
DB1 active
-----------------
-----------------
DB1 passive
DB3 passive
DB2 active
Replication
DB4 active
----------------DB4 passive
30
----------------DB2 passive
30. Active / Active – Single Write Master
DC1
DC2
DC3
DC4
Writes
Updates
Cache Updates
Read
Cache
31
Read
Cache
Read
Cache
Read
Cache
31. Design for Failure: Resiliency Patterns
Throttling versus Circuit Breaker
32
32. Circuit Breaker Pattern
Circuit Breaker State Diagram
Caller
C
Dependency
Closed
On call/ pass through
Open
Trip breaker
D
Call succeeds / reset count
On Call / Fail
Call fail/count failure
On timeout / attempt reset
Threshold reached/trip breaker
Trip breaker
Attempt
Attempt
Reset
Reset
Half Open
On call / pass through
On succeed/reset
On fail /trip breaker
http://techblog.netflix.com/2012_02_01_archive.html
33