2. Some definitions
• Scalability is how big you can get.
• Reliability is how consistent you are (in the short term).
• Availability is being reliable and scalable (in the long
run).
• Scalability and reliability are not related (one does not
cause the other or impacts the other).
• Can’t have availability without scalability or reliability.
3. Without further ado
• The requirement:
– Emergency responder system requires notifications from
emergency workers for availability.
• Results in:
– System that is available 24x7 to respond to notifications.
• Constrained by budget.
• Currently at ~1,200 users/sec
4. Without further ado
• The requirement:
– Call routing system that must respond to every single request
as fast as possible.
• Results in:
– System that is available 24x7 to respond to calls (marketing
and others).
• Constrained by time.
6. What we tried first
• What the blogs say
– Be redundant in every part of the system (this gets very
expensive!)
• What the teachers say (by the book)
– Formal engineering (this is very expensive too!)
• What our gut told us
– Test test test!
7. What we learned
• Scalability is a process not a destination.
• Reliability is not a matter of QA.
• The tools matter – but not in the traditional sense.
– SQL Server (from 2005 to 2008 R2)
– Windows (from 2003 to 2008 R2)
8. Some statistics
System availability 2010
14 minutes of failures from at least 2 of 3 monitoring
locations
1%
IIS 7
8% Network
32% Framework Bugs
52% App Bugs
37% 16%
SQL Server - 100% CPU
SQL Server - Mirroring
4% SQL Server - No reason
2%
* Only outages at the core router are displayed here as network problems.
9. Specific Lessons
• Design code for failure (not for 100% reliability or 0
bugs).
– Redundancy in code is critical.
• Fail fast and fail often.
– Don’t wait until the system fails completely.
• Monitor and validate.
– Monitor as frequently as it is affordable.
10. Design for failure
• Why once if you can • Use all available tools
twice?
– Bidirectional replication
for regional duplication
of traffic.
– Cheap load balancers.
– Cheap RAID 10 SATA.
– Don’t trust your
database.
11. Fail fast and fail often
• Specific configuration settings for IIS.
– Yes! App pool recycling increases availability.
• Specific configuration for queuing.
– Use a messaging system that always responds and stores
safely in case the database is not available.
• Make a lot of noise.
12. Monitor and validate
• Monitis
– Cheap, but support is not there.
• Other tools
– Gomez.com – expensive, but if you can afford it, great.
• Inside tools
– Open source and MS tools.
13. The bottom line
Design for failure Traditional Route
Database approach - Expect the system to Database Clustering
operate without a DB for
brief periods of time.
- Do mirroring locally.
- Do replication remotely.
Hardware approach - Configure for redundancy at Redundancy everywhere
the telecom level.
- Configure regional
redundancy (invest in
another server with another
host, make sure network is
different enough).
Code approach - Design multiple systems that Design for 0 bugs (the formal
do the same thing in simple method). Increase QA.
ways.
- There is nothing wrong with
multiple code paths, even
processes.
- Reliability is not having the
same bug in the same place.
Editor's Notes
Mention how the most scalable systems are not always the most available (banks). Mention how the most reliable systems are not the most scalable (phones, specific purpose stuff).Analogy with orange chicken.
Explain the process a bit more. Tell the story from the perspective of those who use it and those who receive a benefit. Also discuss the AR example and Call routing and call tracking.
Explain the process a bit more. Tell the story from the perspective of those who use it and those who receive a benefit. Also discuss the AR example and Call routing and call tracking.
For reliability, store your session data in the database. For reliability, everything must be redundant.For availability the system must be designed from the ground up.
Explain how scalability is a moving target, maybe add the graphics from IAR.Explain how a system that works now is obsolete faster than you can think of. Mention the debate that is going on with “MySpace failed because of Microsoft”.Lessons from the MS bugs in the asp.net session handling code, enhancements in Windows performance over the different versions (from IIS 6 to IIS 7.5)
Explain how availability is defined in terms of users and not in terms of system