SlideShare une entreprise Scribd logo
1  sur  50
Pragmatic Evolution of Super 6
and Sky Bet for Resiliency
M i c h a e l M a i b a u m
S k y B e t t i n g & G a m i n g
@ m m a i b a u m
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
sky-bet-resiliency
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
@mmaibaum
Pragmatic and Achievable
• Focus is on pragmatic, achievable improvements in availability
• Using common patterns
• Concrete examples that apply to many people in many companies
@mmaibaum
• Acquired by Sky in 2000 as part of
the Sports Internet Group &
rebranded in 2002
• Soccer Saturday Super6 launched
in 2008 -first in-house
development
• SB&G sold to CVC in 2015
A Brief History of Sky Betting & Gaming
@mmaibaum
Starting out - Infrastructure & Ops Only
• Tech Team
• No in-house development
• Hosting and operating third party vendors applications
• Waterfall project management and delivery
@mmaibaum
50M
Monthly Transactions (millions)
350M
2010 2012 2014 2016
Transactions
@mmaibaum
One challenge to cope with is traffic that looks like this
@mmaibaum
@mmaibaum
What happens when Jeff says Super 6 on Sky Sports…
@mmaibaum
What happens when Jeff says Super 6 and something
interesting happens in the football…
@mmaibaum
A Diverse Technology Stack
@mmaibaum
Overall Service Performance
• Lots of definitions of ‘reliability’ or ‘availability’ or ‘resilience’
• Today focus on reducing the impact of failure and maintaining overall
service availability
• Percentage of time when no major product unavailable
• 14/15 - 97.9%
• 15/16 - 99.79%
• 16/17 - 99.9%
Super 6 League Calculations
@mmaibaum
League Calculations
• Customers predict scores
• get points for correctly predicting the result and more points for the
correct score
• Round, Month and Season leagues
@mmaibaum
League Calculations
• Update scores and leagues as goals go in
• Hit a scaling wall as product grew
• Standard LAMP application architecture, League calculations falling
behind
• MySQL table, lots of updates to each entry, lots of sorting/reads during
the updates
• Need to redesign and improve reliability
• Lots of solutions proposed
• Many complex, some fit to scale to 100s of millions
• Quite a few included ‘next gen’ distributed computational or database
services
@mmaibaum
League Calculations
Web Tier
API Service
MySQL
Score
Service
Original Approach
Web Tier
League Service
MySQL
Score Service
New Approach
Score Updates
and leagues held in
memory
DB updates, sorts
for every change
@mmaibaum
League Calculations
Web Tier
League Service
MySQL
Score Service
New Approach
But what happens
when the league service
crashes?
@mmaibaum
Run more instances, do the work multiple times
League ServiceLeague Service
Web Tier
League Service
MySQL
Score Service
@mmaibaum
Take Aways
• Isolate the problem, you probably don’t need to rewrite everything
• Don’t overcomplicate things
• Take advantage of any reduction in accuracy requirements
• Only the final result is truly crucial so any rare edge cases in
synchronisation can be tolerated
Decoupling Resources
@mmaibaum
Core Account Overview
OpenBet Stored Procedures & DB
OXI XML
Login UI
Payment
Router
Sidebar UI
OpenBet
Payments App
SSO & Identity
API
Account API
SSO
Consumers
Other
Products
Payment
Services
>4.5THz CPU
>3 TB RAM
>300 VMs
@mmaibaum
Core Account Overview
OpenBet Stored Procedures & DB
OXI XML
Login UI
Payment
Router
Sidebar UI
OpenBet
Payments App
SSO & Identity
API
Account API
SSO
Consumers
Other
Products
Payment
Services
On type of slow request
consume all the resources in a
critical tier of the application
@mmaibaum
Reducing the Blast Radius
Can one kind of slow request consume all the resources in a critical tier of the application?
• We experienced problems with the
PSP integration, causing OXi processes
to stack up waiting for responses
• Eventually not enough OXi processes
were available to service the non-
payments workload
@mmaibaum
Separating Services and Limiting Resources
• We kept experiencing problems with the PSP
• By separating the OXi endpoints we could limit the
impact on other services
• Limited number of payment ‘procs’ if they
saturate, other requests fail quickly
• Generally better visibility of behaviour of the
different requests once separated out. Easier to
manage and scale
@mmaibaum
Reduce dependency on the third party
• Upper limit to reliability when you
have one third party you rely on…
• So get more!
@mmaibaum
Take Aways
• You might have a fairly monolithic service, or a single big DB but you can
often still implement resource limits at some level of the application
• In many closely coupled systems limiting resources to particular use-
case/journey is a key step in limiting the blast radius of a failure
• Your service can’t be more reliable than an external third party service it
relies on, consider using multiple suppliers - often commercially
advantageous
Proactive Bannering
@mmaibaum
After the Grand National
• Grand National is a very busy day…
17,000 bets / minute
93 payments / second
• But taking the bets is the easier bit
@mmaibaum
After the Grand National
• Everyone comes back after the race
to find out if they’ve won anything
25,000 logins/minute
• Querying account history is
relatively slow
• We probably haven’t actually
settled bets yet anyway…
Example from a big race
(Cheltenham Gold Cup)
@mmaibaum
After the Grand National
• We’ve crashed and burned under the
load before
• DB maxes out, load balancers burning,
web servers and redis session stores
all under massive pressure
@mmaibaum
Banners
• Banner deliberately
• Various banner types
• Full banner for a minute or two for those not
already on site
• Account history banner until at least the most
popular selections settled
• Gradually re-enable access
• Ramp percentage of users
@mmaibaum
Simple Smart Banners
• Implemented in Layer 7 Load Balancer
rules
• Allow a configurable percentage of
users in
• Once allowed in, allow users to
continue using service until access
code changed
threshold = 25
access_code = ‘a3fd3d2df4’
banner_cookie = get_cookie(‘smart_banners’)
if ( banner_cookie IS NULL ) {
set_cookie( bucket = random_number(1,100) )
}
customer_bucket = cookie.get_value( ‘bucket’ )
customer_access_code =
cookie.get_value( ‘access_code’ )
if ( access_code == customer_access_code) {
route_request( ‘service’ )
}
else if ( customer_bucket <= threshold ) {
set_cookie( ‘access_code’ = access_code )
route_request( ‘service’ )
}
else {
route_request( ‘banner’ )
}
Pseudocode
@mmaibaum
Take Aways
• Graceful Degradation less impactful than major failure and ‘recovery’ is
quicker
• You can choose to invoke a degraded, less demanding operational mode
• We could make Account History work for post-GN load
• Just not important enough to invest in (yet)
Circuit Breakers
@mmaibaum
My Bets
@mmaibaum
My Bets
Web Pages
Bet API
Couchbase
Core API
~60,000 req/min
Circuit Breaker with global state
Circuit Breaker with local state
Circuit breakers used to protect higher level services from underlying failures
@mmaibaum
Take Aways
• Circuit breakers powerful pattern, crucial for maintaining customer
experience
• Tuning sensitivity important, can amplify small failures into big ones
• Unless you need that twitchy, coordinated response, consider local circuit
breaker state over global
What about People?
@mmaibaum
Systems & Software Architecture isn’t enough
• Organisational Architecture & Culture are crucial
Does the whole business care about failure?
Do teams own and support their services?
How do we identify most pressing problems?
Reactive vs proactive?
How do you persuade people care?
Is technology seen as a ‘contractor’
@mmaibaum
• Is the business reactive
– You’ve had one big failure and then they care (briefly?)
– or
– Pro-active - they set targets and provide time and budget to achieve them?
• Perception of impact vs Actual impact
• We’ve been reactive at times
– big failures leading to a massive focus on reliability
– generally good performance leading to a lack of maintenance
• Too much of either isn’t particularly healthy
Reactive or Proactive?
@mmaibaum
• Error budgets
• Classes of service
If you’ve got 100 things you could make better…
How do you prioritise?
@mmaibaum
• A way of setting a risk appetite
• Reduce pressure to react when you don’t need to
• Help identify the components and problems that are causing the biggest impact
Error Budgets
Products
Total
Revenue Loss
Error Budget
Used
Monthly
Budget
£40k 75% £50k
£500 5% £10k
£35k 87.5% £40k
£1.5k 5% £30k
@mmaibaum
Error Budgets
• Trends
• Should you ‘spend’ your budget?
@mmaibaum
• Ensure ongoing capacity for different types of improvement
• 50% strategic product improvements
• 30% technical improvements and maintenance
• 10% Product Small Improvements & Experiments
• 10% unplanned work
Classes of Service
Measure. Make work visible.
Change the balance depending on the situation!
@mmaibaum
Pride
Knowledge
Feeling the Pain
Technical Ownership
@mmaibaum
Fire Drills
• Small and large scenarios, Component failure and DR drills
• Things will always fail, even if your system is degrading gracefully it is still
degraded
• As you grow the team, ways of managing incidents evolve, coordination becomes
more important
– Incident Command, Roles & Responsibilities
@mmaibaum
Take Aways
• Common patterns, achievable in your team/architecture, can make a big
difference through the accumulation of small improvements
• Culture and ways of working are more important than any technical magic
wand, no matter how shiny
H1 17/18 - 99.99%
Watch the video with slide synchronization
on InfoQ.com!
https://www.infoq.com/presentations/sky-
bet-resiliency

Contenu connexe

Plus de C4Media

Plus de C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Pragmatic Resiliency: Super 6 & Sky Bet Evolution

  • 1. Pragmatic Evolution of Super 6 and Sky Bet for Resiliency M i c h a e l M a i b a u m S k y B e t t i n g & G a m i n g @ m m a i b a u m
  • 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ sky-bet-resiliency • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. @mmaibaum Pragmatic and Achievable • Focus is on pragmatic, achievable improvements in availability • Using common patterns • Concrete examples that apply to many people in many companies
  • 5. @mmaibaum • Acquired by Sky in 2000 as part of the Sports Internet Group & rebranded in 2002 • Soccer Saturday Super6 launched in 2008 -first in-house development • SB&G sold to CVC in 2015 A Brief History of Sky Betting & Gaming
  • 6. @mmaibaum Starting out - Infrastructure & Ops Only • Tech Team • No in-house development • Hosting and operating third party vendors applications • Waterfall project management and delivery
  • 8. @mmaibaum One challenge to cope with is traffic that looks like this
  • 10. @mmaibaum What happens when Jeff says Super 6 on Sky Sports…
  • 11. @mmaibaum What happens when Jeff says Super 6 and something interesting happens in the football…
  • 13. @mmaibaum Overall Service Performance • Lots of definitions of ‘reliability’ or ‘availability’ or ‘resilience’ • Today focus on reducing the impact of failure and maintaining overall service availability • Percentage of time when no major product unavailable • 14/15 - 97.9% • 15/16 - 99.79% • 16/17 - 99.9%
  • 14. Super 6 League Calculations
  • 15. @mmaibaum League Calculations • Customers predict scores • get points for correctly predicting the result and more points for the correct score • Round, Month and Season leagues
  • 16. @mmaibaum League Calculations • Update scores and leagues as goals go in • Hit a scaling wall as product grew • Standard LAMP application architecture, League calculations falling behind • MySQL table, lots of updates to each entry, lots of sorting/reads during the updates • Need to redesign and improve reliability • Lots of solutions proposed • Many complex, some fit to scale to 100s of millions • Quite a few included ‘next gen’ distributed computational or database services
  • 17. @mmaibaum League Calculations Web Tier API Service MySQL Score Service Original Approach Web Tier League Service MySQL Score Service New Approach Score Updates and leagues held in memory DB updates, sorts for every change
  • 18. @mmaibaum League Calculations Web Tier League Service MySQL Score Service New Approach But what happens when the league service crashes?
  • 19. @mmaibaum Run more instances, do the work multiple times League ServiceLeague Service Web Tier League Service MySQL Score Service
  • 20. @mmaibaum Take Aways • Isolate the problem, you probably don’t need to rewrite everything • Don’t overcomplicate things • Take advantage of any reduction in accuracy requirements • Only the final result is truly crucial so any rare edge cases in synchronisation can be tolerated
  • 22. @mmaibaum Core Account Overview OpenBet Stored Procedures & DB OXI XML Login UI Payment Router Sidebar UI OpenBet Payments App SSO & Identity API Account API SSO Consumers Other Products Payment Services >4.5THz CPU >3 TB RAM >300 VMs
  • 23. @mmaibaum Core Account Overview OpenBet Stored Procedures & DB OXI XML Login UI Payment Router Sidebar UI OpenBet Payments App SSO & Identity API Account API SSO Consumers Other Products Payment Services On type of slow request consume all the resources in a critical tier of the application
  • 24. @mmaibaum Reducing the Blast Radius Can one kind of slow request consume all the resources in a critical tier of the application? • We experienced problems with the PSP integration, causing OXi processes to stack up waiting for responses • Eventually not enough OXi processes were available to service the non- payments workload
  • 25. @mmaibaum Separating Services and Limiting Resources • We kept experiencing problems with the PSP • By separating the OXi endpoints we could limit the impact on other services • Limited number of payment ‘procs’ if they saturate, other requests fail quickly • Generally better visibility of behaviour of the different requests once separated out. Easier to manage and scale
  • 26. @mmaibaum Reduce dependency on the third party • Upper limit to reliability when you have one third party you rely on… • So get more!
  • 27. @mmaibaum Take Aways • You might have a fairly monolithic service, or a single big DB but you can often still implement resource limits at some level of the application • In many closely coupled systems limiting resources to particular use- case/journey is a key step in limiting the blast radius of a failure • Your service can’t be more reliable than an external third party service it relies on, consider using multiple suppliers - often commercially advantageous
  • 29. @mmaibaum After the Grand National • Grand National is a very busy day… 17,000 bets / minute 93 payments / second • But taking the bets is the easier bit
  • 30. @mmaibaum After the Grand National • Everyone comes back after the race to find out if they’ve won anything 25,000 logins/minute • Querying account history is relatively slow • We probably haven’t actually settled bets yet anyway… Example from a big race (Cheltenham Gold Cup)
  • 31. @mmaibaum After the Grand National • We’ve crashed and burned under the load before • DB maxes out, load balancers burning, web servers and redis session stores all under massive pressure
  • 32. @mmaibaum Banners • Banner deliberately • Various banner types • Full banner for a minute or two for those not already on site • Account history banner until at least the most popular selections settled • Gradually re-enable access • Ramp percentage of users
  • 33. @mmaibaum Simple Smart Banners • Implemented in Layer 7 Load Balancer rules • Allow a configurable percentage of users in • Once allowed in, allow users to continue using service until access code changed threshold = 25 access_code = ‘a3fd3d2df4’ banner_cookie = get_cookie(‘smart_banners’) if ( banner_cookie IS NULL ) { set_cookie( bucket = random_number(1,100) ) } customer_bucket = cookie.get_value( ‘bucket’ ) customer_access_code = cookie.get_value( ‘access_code’ ) if ( access_code == customer_access_code) { route_request( ‘service’ ) } else if ( customer_bucket <= threshold ) { set_cookie( ‘access_code’ = access_code ) route_request( ‘service’ ) } else { route_request( ‘banner’ ) } Pseudocode
  • 34. @mmaibaum Take Aways • Graceful Degradation less impactful than major failure and ‘recovery’ is quicker • You can choose to invoke a degraded, less demanding operational mode • We could make Account History work for post-GN load • Just not important enough to invest in (yet)
  • 37. @mmaibaum My Bets Web Pages Bet API Couchbase Core API ~60,000 req/min Circuit Breaker with global state Circuit Breaker with local state Circuit breakers used to protect higher level services from underlying failures
  • 38. @mmaibaum Take Aways • Circuit breakers powerful pattern, crucial for maintaining customer experience • Tuning sensitivity important, can amplify small failures into big ones • Unless you need that twitchy, coordinated response, consider local circuit breaker state over global
  • 40. @mmaibaum Systems & Software Architecture isn’t enough • Organisational Architecture & Culture are crucial Does the whole business care about failure? Do teams own and support their services? How do we identify most pressing problems? Reactive vs proactive? How do you persuade people care? Is technology seen as a ‘contractor’
  • 41. @mmaibaum • Is the business reactive – You’ve had one big failure and then they care (briefly?) – or – Pro-active - they set targets and provide time and budget to achieve them? • Perception of impact vs Actual impact • We’ve been reactive at times – big failures leading to a massive focus on reliability – generally good performance leading to a lack of maintenance • Too much of either isn’t particularly healthy Reactive or Proactive?
  • 42. @mmaibaum • Error budgets • Classes of service If you’ve got 100 things you could make better… How do you prioritise?
  • 43. @mmaibaum • A way of setting a risk appetite • Reduce pressure to react when you don’t need to • Help identify the components and problems that are causing the biggest impact Error Budgets Products Total Revenue Loss Error Budget Used Monthly Budget £40k 75% £50k £500 5% £10k £35k 87.5% £40k £1.5k 5% £30k
  • 44. @mmaibaum Error Budgets • Trends • Should you ‘spend’ your budget?
  • 45. @mmaibaum • Ensure ongoing capacity for different types of improvement • 50% strategic product improvements • 30% technical improvements and maintenance • 10% Product Small Improvements & Experiments • 10% unplanned work Classes of Service Measure. Make work visible. Change the balance depending on the situation!
  • 47. @mmaibaum Fire Drills • Small and large scenarios, Component failure and DR drills • Things will always fail, even if your system is degrading gracefully it is still degraded • As you grow the team, ways of managing incidents evolve, coordination becomes more important – Incident Command, Roles & Responsibilities
  • 48. @mmaibaum Take Aways • Common patterns, achievable in your team/architecture, can make a big difference through the accumulation of small improvements • Culture and ways of working are more important than any technical magic wand, no matter how shiny H1 17/18 - 99.99%
  • 49.
  • 50. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/sky- bet-resiliency