This document discusses patterns for scaling systems incrementally. It introduces the ACD/C approach of making systems async, caching results, distributing work, and compromising on consistency as needed. Specific architectures like map reduce and distributed queues are presented. The challenges of partial failures, upgrades, and changing topologies are discussed. Testing is emphasized as critical for managing scaled systems.
2. Who Am I?
• Kendall Miller
• One of the Founders of Gibraltar
Software
• Small Independent Software Vendor Founded in
2008
• Developers of VistaDB and Loupe
• Engineers, not Sales People
• Enterprise Systems Architect &
Developer since 1995
• BSE in Computer Engineering,
University of Illinois Urbana-
Champaign (UIUC)
3. What Do We Do?
Advanced logging and analysis of errors,
performance, and usage patterns for .NET
web apps, desktop apps and services
The easy-to-deploy, SQL Server-compatible,
pure .NET embedded database.
13. ACD/C
• Async – Do the work whenever
• Caching – Don’t do any work you
don’t have to
• Distribution – Get as many people to
do the work as you can
• Consistency – We all agree on these
key things
14. Async
• Decouple operations so you do the
minimum amount of work in
performance critical paths
• Queue work that can be completed
later to smooth out load
• Speculative Execution
• Scheduled Requests (Nightly
processes)
15. Caching
• Save results of earlier work nearby
where they are handy to use again
later
• Apply in front of anything that’s time
consuming
• Easiest to apply from the left to the
right
• Simple strategies can be really
effective (EF Dump all on update)
16. Why Caching?
• Loading the world is impractical
• Apps ask a lot of repeating questions.
• Stateless applications even more so
• Answers don’t change often
• Authoritative information is expensive
17. Distribution
• Distribute requests across multiple
systems
• Classic web “Scale Out” approach
• The less state held, the easier to
distribute work.
• Distributed database = hard
• Distributed static content server = easy
• Request routing for distribution can
serve other availability purposes
18. Consistency
• The degree to which all parties
observe the same state of the system
at the same time
• Scaling inevitably requires
compromise
• Forces one source of the truth for absolute
consistency and requires extensive locking to
ensure parties agree
• The real world doesn’t require the consistency we
tend to demand of our systems
19. Consistency Challenges
• Singleton Data Structures (Order
numbers..)
• State held between the endpoints of
a process
• Consistent results of queries across
partitioned datasets
27. Fallacies of Distributed Computing
• The network is reliable
• Latency is zero
• Bandwidth is infinite
• The network is secure
• Topology doesn’t change
• There is one administrator
• Transport cost is zero
• The network is homogeneous
29. Fresh Problems: Partial Failures
• Break system into individual failure
zones
• Monitor each instance of each zone
for problems
• Route around bad instances
32. Fresh Problems: Upgrades
• Break system into individual upgrade
zones
• Upgrade each zone – Drain & Stop,
Upgrade, Verify.
• Cut traffic over to updated zones
What level of scaling are we talking about?
Scaling is the ability to cope and perform under an increasing workload.
Being Available is really about a request being completed in a period of time.
SO: What’s the Period of Time? And What is your limit of load?
This is VISITORS per DAY
Microsoft.com: 60M
Twitter.com: 35M
Amazon.com: 15M
Target.com: 2M
DevExpress.com & Telerik.com: 25K
Hanselman.com: 12K
Gibraltar Software: 1K
This is VISITORS per DAY
Microsoft.com: 60M
Twitter.com: 35M
Amazon.com: 15M
Target.com: 2M
DevExpress.com & Telerik.com: 25K
Hanselman.com: 12K
Gibraltar Software: 1K
ASYNC
CACHING
DISTRIBUTION
CONSISTENCY
THIS IS NOT ABOUT ASYNC FOR FASTER PERCEIVED PERFORMANCE
Improve response under load
Do only the work you have to
Up to 95% of the work on the typical site can be pulled from cache
Add reverse proxy (Load Balancer)
Add additional middle tier servers
Session state and identity need to be factored out
Partition (“Sticky session”) first, then true load balancing with no state in center
Break down traffic by easy to determine characteristic: Customer, product category, etc.
Add storage regions that are self-consistent
Can vary exact mix of what data is in each container and how you partition
Typically some parts may be shared like Identity
Cross-zone aggregation is slow
Cross-zone coherency strategy
Middle tier routes storage requests based on easy to determine characteristic
Consistency strategy complexity (reports may reflect delayed data, different parties may not see the same view of the world)
Separate long running, dangerous, or serialized tasks from general work
Workflow consistency strategy required
Complications with deployment and versioning
Deferred failure scenarios.
Add reverse proxy (Load Balancer)
Add additional middle tier servers
Session state and identity need to be factored out
Partition (“Sticky session”) first, then true load balancing with no state in center
Break down traffic by easy to determine characteristic: Customer, product category, etc.
Add storage regions that are self-consistent
Can vary exact mix of what data is in each container and how you partition
Typically some parts may be shared like Identity
Cross-zone aggregation is slow
Cross-zone coherency strategy