7. Background
▪ First started using a single CDN in 2008
▪ Exponential Growth
▪ Start of 2012 began investigation into running
multiple CDNs
@lozzd • @ickymettle
8. Why use a CDN?
▪ Goal: Consistently fast user experience globally
▪ Improve last mile performance by caching content
close to the user
▪ Offload content delivery from origin infrastructure
to the CDN provider
@lozzd • @ickymettle
10. Why use more than one CDN?
▪ Resilience
-
Eliminate single point of failure
@lozzd • @ickymettle
11. Why use more than one CDN?
▪ Resilience
-
Eliminate single point of failure
▪ Flexibility
-
Balance traffic based on business requirements
@lozzd • @ickymettle
12. Why use more than one CDN?
▪ Resilience
-
Eliminate single point of failure
▪ Flexibility
-
Balance traffic based on business requirements
▪ Cost
-
Manage provider costs
@lozzd • @ickymettle
14. The Plan
1. Establish evaluation criteria
2. Initial configuration and testing
3. Test with production traffic
4. Operationalising
@lozzd • @ickymettle
18. Performance
▪ Baseline Response Times
-
Should be within ±5% of our existing CDN provider’s
response times
@lozzd • @ickymettle
19. Performance
▪ Baseline Response Times
-
Should be within ±5% of our existing CDN provider’s
response times
▪ Hit Ratios and Origin Offload
-
Provider should achieve equivalent or better origin offload
performance and hit ratios
@lozzd • @ickymettle
22. Configuration
▪ Complexity
-
how complex is the providers configuration system
▪ Self service
-
can you make changes directly or do they require
professional services or other intervention
@lozzd • @ickymettle
23. Configuration
▪ Complexity
-
how complex is the providers configuration system
▪ Self service
-
can you make changes directly or do they require
professional services or other intervention
▪ Latency for changes
-
how quickly do changes take to propagate
@lozzd • @ickymettle
38. curl -i -H 'Host: img0.etsystatic.com'
global-ssl.fastly.net/someimage.jpg
HTTP/1.1 200 OK
Server: Apache
Last-Modified: Sat, 09 Nov 2013 23:43:38 GMT
Cache-Control: max-age=94670800
[...]
X-Served-By: cache-lo82-LHR
X-Cache: HIT
X-Cache-Hits: 1
39. Mean Time To Curl = Done
https://www.etsy.com/listing/99871278
40. Mean Time To Curl
▪ No need to touch existing infrastructure
▪ Smoke test of functionality
▪ 10 minutes from configuration to curl
▪ New providers should be plug and play
@lozzd • @ickymettle
42. Testing with Production Traffic
▪ Images only at first
▪ Good test of caching performance
▪ Easy to test by swapping hostnames
▪ Made even easier with our A/B testing framework
@lozzd • @ickymettle
43. A/B Test Framework
▪ Fine grained control
▪ Enable test for specific users or groups
▪ Percentage of users
▪ All controlled via configuration in code
▪ Rapid and complete rollback
@lozzd • @ickymettle
51. Metrics and Monitoring
▪ Get more detail by pulling metrics in house
▪ Write script to pull data from API
▪ Create dashboards with data
@lozzd • @ickymettle
52. Metrics and Monitoring
▪ Get more detail by pulling metrics in house
▪ Write script to pull data from API
▪ Create dashboards with data
@lozzd • @ickymettle
55. Testing Plan
1. for c in $cdns; do rampup $c; done;
2. Deliberately slow and steady
3. Watch traffic increase
4. Watch origin offload increase
5. Watch performance
@lozzd • @ickymettle
56. Downsides of this approach
▪ AB testing can’t be used for main site
▪ Exposing your test CNAMEs
▪ Especially if hotlinking is a concern
@lozzd • @ickymettle
57. Downsides of this approach
▪ Exposing your test CNAMEs
▪ Especially if hotlinking is a concern
@lozzd • @ickymettle
58. How do you know it’s broke?
▪ Check the graphs!
▪ Check with your community
▪ Keep support in the loop
@lozzd • @ickymettle
66. Balancing Traffic Using DNS
▪ Traffic Manager
▪ Extends DNS to dynamically return records based
on rules
▪ Weighted round robin
@lozzd • @ickymettle
69. Balancing Traffic Using DNS
▪ Rule updates typically made via web UI
▪ Can be slow and error prone
▪ Changes need to be applied to all three domains
▪ API available to make changes programmatically
@lozzd • @ickymettle
87. DNS balancing downsides
▪ Low TTLs for fast convergence
▪ Mo QPS == Mo Money
▪ More DNS lookups for users
@lozzd • @ickymettle
88. DNS balancing downsides
▪ Low TTLs for fast convergence
▪ Mo QPS == Mo Money
▪ More DNS lookups for users
▪ Not 100% instant or deterministic
@lozzd • @ickymettle
95. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
@lozzd • @ickymettle
96. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
self.reg = re.compile('^S+(s:)? (?P<remote_addr>[0-9.]+),?
[0-9.,- ]+ [[^]]+] "GET /status/images/beacon.gif?
(beacon_)?source=(?P<source>S+) HTTP/1.d" d+ [d-]+ "(?
P<referrer>[^"]+)" "(?P<user_agent>[^"]+)" .*$')
@lozzd • @ickymettle
97. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
@lozzd • @ickymettle
98. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
@lozzd • @ickymettle
99. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios
@lozzd • @ickymettle
100. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios
@lozzd • @ickymettle
101. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios
@lozzd • @ickymettle
111. Backend Monitoring
▪ Vendor APIs to bring data in house
▪ Data in-house benefits include
-
Integration with our anomaly detection systems
-
Consistent and unified view of all CDN metrics
-
We control data retention period
@lozzd • @ickymettle
112. Awareness
▪ Over 100 engineers
▪ Deploying 60 times a day
▪ Correlating external and internal services
@lozzd • @ickymettle
119. Frontend Monitoring
▪ Performance is important to us
▪ Monitoring overall site performance
▪ Monitoring performance by CDN provider
▪ Real User Monitoring on key pages to track page
performance
@lozzd • @ickymettle
120. Frontend Monitoring
▪ Performance is important to us
▪ Monitoring overall site performance
▪ Monitoring performance by CDN provider
▪ SOASTA mPulse on key pages to track real user
page performance
@lozzd • @ickymettle
124. Debugging: What broke?
▪ MTTD/MTTR can be extremely low with this
system
▪ But not always
@lozzd • @ickymettle
125. Debugging: What broke?
▪ MTTD/MTTR can be extremely low with this
system
▪ But not always
@lozzd • @ickymettle
126. Debugging: What broke?
▪ MTTD/MTTR can be extremely low with this
system
▪ But not always
@lozzd • @ickymettle
127. Debugging: What broke?
▪ Non technical member base
▪ Confusing and time consuming
▪ Amazing support team
▪ Log as much information as possible
@lozzd • @ickymettle
129. Great success
▪ 12 months in the benefits have far outweighed the
few downsides
▪ We’re continuing to evolve the system
▪ We’ll be sure to share our experience with the
community along the way
@lozzd • @ickymettle