SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Scaling for Success:
Lessons from handling peak loads on Azure
Dibran Mulder
Dibran Mulder
CTO / Azure Solutions Architect
Caesar Groep / Cloud Republic
@dibranmulder
Particular Recognized Professional
Co-Host
www.devtalks.nl
@devtalkspodcast
Every January
• > 70% of all primary schools in the Netherlands
take tests on our platform.
• Pre-Covid paper testing was dominant.
• New student tracking platform first time use.
Tuesday 17th of January
8:15 – 8:30 School opening
8:30 – 9:00 Opening by teacher
9:00 – 9:05 Entire country starts taking tests
9:05 – 9:10 Wait for Azure to Scale
9:10 – 10:00 Continue with the test
10:00 – 11:00 Break & Play outside
11:00 – 12:00 Take second test
Web Traffic in the Morning
CPU percentage correlates to Https traffic
Scaling can’t keep up
What happens when you make the newspaper?
Hi Mr. Manager!
• Your alerts, monitoring will go off.
• Service Care is getting flooded with calls
• Your manager is going to sit next to you
• Trust has been violated.
• Your work is being monitored all the time.
Would you?
Scale up using the Azure portal
despite your Infra as Code Policy.
Scale up using Infra as Code
and do a deployment.
Fix the problem yourself as a
team lead.
Let a team member fix the
problem.
Take ownership and act, we are
going to scale up, now!
Organize a meeting and
discuss the best approach.
App Service Plan Scaling
App Service Plan Scaling
• Take a baseline for the night e.g. 2 instances
• Take a baseline for working hours, e.g. 10
instances
• Aggressive autoscaling > 60% CPU increase to
max
• Decrease over time.
App Service Plan Scaling
• Scaling rules with a 5-minute evaluation time are to slow in certain use cases.
• Its better to scale aggressively and decrease over time, it won’t hurt costs that bad.
• Pre-provisioning might be helpful in some cases.
• Its hard to be cost effective and confident at the same time.
• Be prepared to get shit from your nephews and nieces.
• Haven’t we tested right?
• We have load tested the system with a ramp up test up to 5k concurrent users.
• We have tested based on non-functionals according to pre-covid.
• We haven’t tested with 150k real users hitting the system in a 5-minute window
• We didn’t expect a paradigm shift in the adoption of digital testing.
Lessons learned
What is the next weakest link?
Application
Application
authenticatie.x.nl
Application
Social Logins
(Google, Facebook, Twitter)
Industry standards
(Basispoort, Entree)
Azure Active Directory
(Internal employees)
Custom Identity
Providers
(Startwoord, Portal)
Federated AAD
(Partners)
Azure B2C
OpenId Connect
ID Token & Access Token
Refresh Token
API’s
Client configuration
Saml / OpenId Connect
IdentityServer
Persisted Grant Storage
Refresh Token
Identity Server Persisted Grants
• Refresh tokens are ≈ 515bytes
• 900 sec lifetime
• 15 days lineage
• 150k students / 2 hours per day ≈ 10 refresh tokens per student
• 10k teachers / 8 hours per day ≈ 40 refresh tokens per teacher
• Students ≈ 1.5 gb per day
• Teachers ≈ 600mb per day
• Database growth of almost 2.5 gb per day
DTU Load on our Identity Server database
Identity Server Persisted Grants
• Users made extensive use of the online testing environment for students and the student tracking system for
teachers.
• Our composable front-end architecture 3x-ed the amount generated of refresh tokens.
• Refresh tokens are kept in the Persisted Grant Storage to make sure the lineage of tokens is correct. And they
are not reused.
• Database grew to 100gb in roughly 2 weeks.
• Scaling a database from S2-Sx takes up to 1min per GB
• Scaling up a database under stress is taking significantly longer…
• IdentityServer doesn’t cleanup by default but has a TokenCleanup feature.
services.AddIdentityServer()
.AddOperationalStore(options =>
{
// this enables automatic token cleanup. this is optional.
options.EnableTokenCleanup = true;
options.TokenCleanupInterval = 3600; // interval in seconds (default is 3600)
});
ALTER PROCEDURE [dbo].[PersistedGrantCleanup]
AS
BEGIN
SET NOCOUNT ON;
DECLARE @CurrentDateTime as datetime2 = GETDATE();
EXEC sp_autostats 'dbo.PersistedGrants', 'OFF';
WHILE (@@ROWCOUNT > 0)
BEGIN
WAITFOR DELAY '00:00:05’
DELETE TOP(3000)
FROM PersistedGrants
WHERE Expiration < @CurrentDateTime;
END
EXEC sp_autostats 'dbo.PersistedGrants', 'ON';
END
Manual Cleanup
• Exactly 15 days (1296000 seconds) after our initial burst of users
DTU issues are taking place.
• Don’t let IdentityServer cleanup tokens because it uses Entity
Framework
• Competing cleanups with staging and production slots
• Create a stored procedure with a simple Logic App or Azure Function
• Make sure not to stress the database rate limit / throttle the stored
procedure.
Persisted Grants Store
• Composable UI architecture can increase the load on your IAM.
• Refresh token lineage is being stored for security reasons.
• IdentityServer is a good product but lacks database maintenance options.
• Scaling up a database can take a significant amount of time.
• Manually altering infra such as scaling a databases yield source code issues.
• Should I update this in dev or main or a release branch? Or create a hotfix?
• If I deploy a hotfix will this overwrite my scaling settings?
• Do you have a separate pipeline for Infra as Code?
Lessons learned
Ok, so everyone was able to take a test?
We’re good right?
Data
Communication
Hosting .NET, Java, JavaScript, Python
IaaS, PaaS, FaaS
Azure Functions
REST, gRPC, Messaging
Azure Service Bus + NServiceBus
SQL, CosmosDb, Redis.
Azure SQL
Microservices in Azure
Https
Point-to-Point Pub / Sub
Single receiver Multiple receivers
Synchrnonous
Asynchronous
Messaging
• REST is the de-facto standard for communication.
• Is suitable for one-to-one communication.
• Lots of libraries and programming languages
support it. Truly technology agnostic.
• Doesn’t support guaranteed delivery.
• Messaging is the better alternative
• Asynchronous in nature, enables recoverability,
resilience.
• Point-to-Point communication for one-to-one
communication.
• Enables to-to-many communication with pub/sub
patterns.
NServiceBus
NServiceBus: Transactional consistency
• In an event-driven architecture always incorporate
transactional consistency.
• The transaction scope of several processes are linked to
eachother:
1. Handling the incoming message
(StudentChanged)
2. Updating the database
(Database Update)
3. Sending an event
(UserChanged)
• If any of these steps fail all transactions are rolled back.
• NServiceBus has APIs to help with this.
Administration
ParnaSys / Dotcom
Authentication
Service
Bus
GET: Teachers/Students
NServiceBus: Saga’s
• Workflow consisting out of several messages being
handled
• Is started by specific messages
• Handles certain messages
• Somewhat comparable to Azure Durable Functions /
Azure Durable Entities
• State is stored in persistence of choice
• Orchestration is handled via Service bus messages.
• NServiceBus Saga persistence
• SQL Server, MySql, PostgreSql, Azure Table Storage,
MongoDb, RavenDb and more.
Some (hard) lessons learned on
Event driven Architecture
Application
Application
Service
Bus
Testing environment
Post processing
Reporting
Test processing
• Thousands of events come in from the online testing environment.
• Test started, paused, finished, …
• Microservices act on events
• Notify teachers on test status.
• Close tests when started/finished
• Analyze answers after test, such as:
• d/dt, au/ou, ei/ij-analysis
• Categorial mistakes, fractions, multiplications, etc.
• Historical analysis
• Did the student, class, group, school improve over time?
• Sync data with 3rd Party systems such as LAS
• …
Test processing – LOB systems
Service
Bus
Post processing
Line of Business System:
Test products
REST: Get fault patterns
Fix tests
• People work in LOB (line of business) systems during
business hours.
• Expect data to be locked or incomplete.
• Always validate data on your side of the system.
• Use caching with LOB systems. They are 99% of the time
not build for scale.
• Retry policy of 10 times, message will be dead lettered
after 10 retries.
• Retrying exposes LOB systems to even more load.
• Back-off on functional errors, if the test data isn’t
there retrying makes no sense.
PDF Report generation
• In case of the Doorstroomtoets PDF’s needed to be
generated for students and their parents/guardians.
• The External Service only had a REST API
• We used an Azure Function with Service Bus trigger.
• The external service hosted Puppeteer inside an App
Service.
• 100k reports in 1 afternoon didn’t work well.
• Service Pulse saved us, retried in badges.
Report Generator
External System
Service
Bus
POST: Generate Reports
Open Chromium Page
Save as PDF
Service Pulse
• Messaging only works well if you design systems well.
• Commands vs Events
• Point-to-point vs Pub Sub
• Service bus topology
• Distinguish between functional and transient exceptions. Don’t retry on functional
exceptions or backoff for a longer period.
• Out of order event processing is inevitable on large scale
• Idempotent, replaying
• Azure Service bus might refuse connections.
• Azure Service Bus Exception : Cannot allocate more handles. The maximum
number of handles is 4999
• Audit logging enlarges the problem.
• Prefer batching over streaming data processing in SQL Server.
• Build for resilience and you will most likely not lose data.
• You can’t live without a Service Bus monitoring solution with thousands of messages
and dozens of services.
• Transactional consistency helps to avoid Zombie records and Ghost messages
Lessons learned
Regaining trust with Postmortems
Some templates available at:
https://github.com/dastergon/postmortem-templates/blob/master/templates/postmortem-template-azure.md
Title (incident)
Date
Summary of impact
Customer impact
Root cause and mitigation
Next steps
Regaining trust with Postmortems
In the moment:
• Take ownership of the situation. As a DevOps team you must solve the situation.
• Don’t act in emotion, reason with your team.
• As a Team Lead one should:
• Shield your team from stakeholders.
• Don’t fix it at your own. Involve team members.
• Send it yourselves to the corresponding stakeholders.
After the moment:
• Trust has been violated; you must regain it.
• Discuss in the team what went wrong.
• Write a postmortem, be very specific. What was the problem? How did you deal with it? How are you going to prevent
this?
Questions?
cloudrepublic.nl
d.mulder@cloudrepublic.nl
Dibran Mulder
DibranMulder

Contenu connexe

Similaire à Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus

WinOps Conf 2016 - Michael Greene - Release Pipelines
WinOps Conf 2016 - Michael Greene - Release PipelinesWinOps Conf 2016 - Michael Greene - Release Pipelines
WinOps Conf 2016 - Michael Greene - Release PipelinesWinOps Conf
 
Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15Derek Ashmore
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler
 
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaSAtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaSAtlassian
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...rschuppe
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
 
Monitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In AzureMonitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In AzureAlex Bulankou
 
.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric.NET microservices with Azure Service Fabric
.NET microservices with Azure Service FabricDavide Benvegnù
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices💡 Tomasz Kogut
 
Building a document e-signing workflow with Azure Durable Functions
Building a document e-signing workflow with Azure Durable FunctionsBuilding a document e-signing workflow with Azure Durable Functions
Building a document e-signing workflow with Azure Durable FunctionsJoonas Westlin
 
Scaling Systems: Architectures that grow
Scaling Systems: Architectures that growScaling Systems: Architectures that grow
Scaling Systems: Architectures that growGibraltar Software
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul
 
Application Delivery Patterns for Developers - Technical 401
Application Delivery Patterns for Developers - Technical 401Application Delivery Patterns for Developers - Technical 401
Application Delivery Patterns for Developers - Technical 401Amazon Web Services
 
Making communication across boundaries simple with Azure Service Bus
Making communication across boundaries simple with Azure Service BusMaking communication across boundaries simple with Azure Service Bus
Making communication across boundaries simple with Azure Service BusParticular Software
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applicationsAmit Kejriwal
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...garrett honeycutt
 
Getting Started on Google Cloud Platform
Getting Started on Google Cloud PlatformGetting Started on Google Cloud Platform
Getting Started on Google Cloud PlatformAaron Taylor
 
Best Features of Azure Service Bus
Best Features of Azure Service BusBest Features of Azure Service Bus
Best Features of Azure Service BusDaniel Toomey
 

Similaire à Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus (20)

WinOps Conf 2016 - Michael Greene - Release Pipelines
WinOps Conf 2016 - Michael Greene - Release PipelinesWinOps Conf 2016 - Michael Greene - Release Pipelines
WinOps Conf 2016 - Michael Greene - Release Pipelines
 
Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
 
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaSAtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Monitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In AzureMonitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In Azure
 
.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices
 
Building a document e-signing workflow with Azure Durable Functions
Building a document e-signing workflow with Azure Durable FunctionsBuilding a document e-signing workflow with Azure Durable Functions
Building a document e-signing workflow with Azure Durable Functions
 
Scaling Systems: Architectures that grow
Scaling Systems: Architectures that growScaling Systems: Architectures that grow
Scaling Systems: Architectures that grow
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Application Delivery Patterns for Developers - Technical 401
Application Delivery Patterns for Developers - Technical 401Application Delivery Patterns for Developers - Technical 401
Application Delivery Patterns for Developers - Technical 401
 
Making communication across boundaries simple with Azure Service Bus
Making communication across boundaries simple with Azure Service BusMaking communication across boundaries simple with Azure Service Bus
Making communication across boundaries simple with Azure Service Bus
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
 
Getting Started on Google Cloud Platform
Getting Started on Google Cloud PlatformGetting Started on Google Cloud Platform
Getting Started on Google Cloud Platform
 
Best Features of Azure Service Bus
Best Features of Azure Service BusBest Features of Azure Service Bus
Best Features of Azure Service Bus
 

Plus de Particular Software

Beyond simple benchmarks—a practical guide to optimizing code
Beyond simple benchmarks—a practical guide to optimizing code Beyond simple benchmarks—a practical guide to optimizing code
Beyond simple benchmarks—a practical guide to optimizing code Particular Software
 
An exception occurred - Please try again
An exception occurred - Please try againAn exception occurred - Please try again
An exception occurred - Please try againParticular Software
 
Tales from the trenches creating complex distributed systems
Tales from the trenches  creating complex distributed systemsTales from the trenches  creating complex distributed systems
Tales from the trenches creating complex distributed systemsParticular Software
 
Implementing outbox model-checking first
Implementing outbox   model-checking firstImplementing outbox   model-checking first
Implementing outbox model-checking firstParticular Software
 
Reports from the field azure functions in practice
Reports from the field   azure functions in practiceReports from the field   azure functions in practice
Reports from the field azure functions in practiceParticular Software
 
Finding your service boundaries - a practical guide
Finding your service boundaries - a practical guideFinding your service boundaries - a practical guide
Finding your service boundaries - a practical guideParticular Software
 
Decomposing .NET Monoliths with NServiceBus and Docker
Decomposing .NET Monoliths with NServiceBus and DockerDecomposing .NET Monoliths with NServiceBus and Docker
Decomposing .NET Monoliths with NServiceBus and DockerParticular Software
 
DIY Async Message Pump: Lessons from the trenches
DIY Async Message Pump: Lessons from the trenchesDIY Async Message Pump: Lessons from the trenches
DIY Async Message Pump: Lessons from the trenchesParticular Software
 
Share the insight of ServiceInsight
Share the insight of ServiceInsightShare the insight of ServiceInsight
Share the insight of ServiceInsightParticular Software
 
What to consider when monitoring microservices
What to consider when monitoring microservicesWhat to consider when monitoring microservices
What to consider when monitoring microservicesParticular Software
 
Making communications across boundaries simple with NServiceBus
Making communications across boundaries simple with NServiceBusMaking communications across boundaries simple with NServiceBus
Making communications across boundaries simple with NServiceBusParticular Software
 
How to avoid microservice pitfalls
How to avoid microservice pitfallsHow to avoid microservice pitfalls
How to avoid microservice pitfallsParticular Software
 
Connect front end to back end using SignalR and Messaging
Connect front end to back end using SignalR and MessagingConnect front end to back end using SignalR and Messaging
Connect front end to back end using SignalR and MessagingParticular Software
 
Async/Await: NServiceBus v6 API Update
Async/Await: NServiceBus v6 API UpdateAsync/Await: NServiceBus v6 API Update
Async/Await: NServiceBus v6 API UpdateParticular Software
 
Async/Await: TPL & Message Pumps
Async/Await: TPL & Message Pumps Async/Await: TPL & Message Pumps
Async/Await: TPL & Message Pumps Particular Software
 
Making workflow implementation easy with CQRS
Making workflow implementation easy with CQRSMaking workflow implementation easy with CQRS
Making workflow implementation easy with CQRSParticular Software
 
Asynchronous Messaging with NServiceBus
Asynchronous Messaging with NServiceBusAsynchronous Messaging with NServiceBus
Asynchronous Messaging with NServiceBusParticular Software
 

Plus de Particular Software (20)

Beyond simple benchmarks—a practical guide to optimizing code
Beyond simple benchmarks—a practical guide to optimizing code Beyond simple benchmarks—a practical guide to optimizing code
Beyond simple benchmarks—a practical guide to optimizing code
 
An exception occurred - Please try again
An exception occurred - Please try againAn exception occurred - Please try again
An exception occurred - Please try again
 
Tales from the trenches creating complex distributed systems
Tales from the trenches  creating complex distributed systemsTales from the trenches  creating complex distributed systems
Tales from the trenches creating complex distributed systems
 
Got the time?
Got the time?Got the time?
Got the time?
 
Implementing outbox model-checking first
Implementing outbox   model-checking firstImplementing outbox   model-checking first
Implementing outbox model-checking first
 
Reports from the field azure functions in practice
Reports from the field   azure functions in practiceReports from the field   azure functions in practice
Reports from the field azure functions in practice
 
Finding your service boundaries - a practical guide
Finding your service boundaries - a practical guideFinding your service boundaries - a practical guide
Finding your service boundaries - a practical guide
 
Decomposing .NET Monoliths with NServiceBus and Docker
Decomposing .NET Monoliths with NServiceBus and DockerDecomposing .NET Monoliths with NServiceBus and Docker
Decomposing .NET Monoliths with NServiceBus and Docker
 
DIY Async Message Pump: Lessons from the trenches
DIY Async Message Pump: Lessons from the trenchesDIY Async Message Pump: Lessons from the trenches
DIY Async Message Pump: Lessons from the trenches
 
Share the insight of ServiceInsight
Share the insight of ServiceInsightShare the insight of ServiceInsight
Share the insight of ServiceInsight
 
What to consider when monitoring microservices
What to consider when monitoring microservicesWhat to consider when monitoring microservices
What to consider when monitoring microservices
 
Making communications across boundaries simple with NServiceBus
Making communications across boundaries simple with NServiceBusMaking communications across boundaries simple with NServiceBus
Making communications across boundaries simple with NServiceBus
 
How to avoid microservice pitfalls
How to avoid microservice pitfallsHow to avoid microservice pitfalls
How to avoid microservice pitfalls
 
Connect front end to back end using SignalR and Messaging
Connect front end to back end using SignalR and MessagingConnect front end to back end using SignalR and Messaging
Connect front end to back end using SignalR and Messaging
 
Async/Await: NServiceBus v6 API Update
Async/Await: NServiceBus v6 API UpdateAsync/Await: NServiceBus v6 API Update
Async/Await: NServiceBus v6 API Update
 
Async/Await: TPL & Message Pumps
Async/Await: TPL & Message Pumps Async/Await: TPL & Message Pumps
Async/Await: TPL & Message Pumps
 
Async/Await Best Practices
Async/Await Best PracticesAsync/Await Best Practices
Async/Await Best Practices
 
Making workflow implementation easy with CQRS
Making workflow implementation easy with CQRSMaking workflow implementation easy with CQRS
Making workflow implementation easy with CQRS
 
Cqrs but different
Cqrs but differentCqrs but different
Cqrs but different
 
Asynchronous Messaging with NServiceBus
Asynchronous Messaging with NServiceBusAsynchronous Messaging with NServiceBus
Asynchronous Messaging with NServiceBus
 

Dernier

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus

  • 1. Scaling for Success: Lessons from handling peak loads on Azure Dibran Mulder
  • 2. Dibran Mulder CTO / Azure Solutions Architect Caesar Groep / Cloud Republic @dibranmulder Particular Recognized Professional Co-Host www.devtalks.nl @devtalkspodcast
  • 3. Every January • > 70% of all primary schools in the Netherlands take tests on our platform. • Pre-Covid paper testing was dominant. • New student tracking platform first time use.
  • 4. Tuesday 17th of January 8:15 – 8:30 School opening 8:30 – 9:00 Opening by teacher 9:00 – 9:05 Entire country starts taking tests 9:05 – 9:10 Wait for Azure to Scale 9:10 – 10:00 Continue with the test 10:00 – 11:00 Break & Play outside 11:00 – 12:00 Take second test
  • 5.
  • 6. Web Traffic in the Morning
  • 7. CPU percentage correlates to Https traffic
  • 9. What happens when you make the newspaper?
  • 10. Hi Mr. Manager! • Your alerts, monitoring will go off. • Service Care is getting flooded with calls • Your manager is going to sit next to you • Trust has been violated. • Your work is being monitored all the time.
  • 11. Would you? Scale up using the Azure portal despite your Infra as Code Policy. Scale up using Infra as Code and do a deployment. Fix the problem yourself as a team lead. Let a team member fix the problem. Take ownership and act, we are going to scale up, now! Organize a meeting and discuss the best approach.
  • 12. App Service Plan Scaling
  • 13. App Service Plan Scaling • Take a baseline for the night e.g. 2 instances • Take a baseline for working hours, e.g. 10 instances • Aggressive autoscaling > 60% CPU increase to max • Decrease over time.
  • 14. App Service Plan Scaling
  • 15. • Scaling rules with a 5-minute evaluation time are to slow in certain use cases. • Its better to scale aggressively and decrease over time, it won’t hurt costs that bad. • Pre-provisioning might be helpful in some cases. • Its hard to be cost effective and confident at the same time. • Be prepared to get shit from your nephews and nieces. • Haven’t we tested right? • We have load tested the system with a ramp up test up to 5k concurrent users. • We have tested based on non-functionals according to pre-covid. • We haven’t tested with 150k real users hitting the system in a 5-minute window • We didn’t expect a paradigm shift in the adoption of digital testing. Lessons learned
  • 16. What is the next weakest link?
  • 17. Application Application authenticatie.x.nl Application Social Logins (Google, Facebook, Twitter) Industry standards (Basispoort, Entree) Azure Active Directory (Internal employees) Custom Identity Providers (Startwoord, Portal) Federated AAD (Partners) Azure B2C OpenId Connect ID Token & Access Token Refresh Token API’s Client configuration Saml / OpenId Connect IdentityServer Persisted Grant Storage Refresh Token
  • 18.
  • 19. Identity Server Persisted Grants • Refresh tokens are ≈ 515bytes • 900 sec lifetime • 15 days lineage • 150k students / 2 hours per day ≈ 10 refresh tokens per student • 10k teachers / 8 hours per day ≈ 40 refresh tokens per teacher • Students ≈ 1.5 gb per day • Teachers ≈ 600mb per day • Database growth of almost 2.5 gb per day
  • 20. DTU Load on our Identity Server database
  • 21. Identity Server Persisted Grants • Users made extensive use of the online testing environment for students and the student tracking system for teachers. • Our composable front-end architecture 3x-ed the amount generated of refresh tokens. • Refresh tokens are kept in the Persisted Grant Storage to make sure the lineage of tokens is correct. And they are not reused. • Database grew to 100gb in roughly 2 weeks. • Scaling a database from S2-Sx takes up to 1min per GB • Scaling up a database under stress is taking significantly longer… • IdentityServer doesn’t cleanup by default but has a TokenCleanup feature. services.AddIdentityServer() .AddOperationalStore(options => { // this enables automatic token cleanup. this is optional. options.EnableTokenCleanup = true; options.TokenCleanupInterval = 3600; // interval in seconds (default is 3600) });
  • 22.
  • 23. ALTER PROCEDURE [dbo].[PersistedGrantCleanup] AS BEGIN SET NOCOUNT ON; DECLARE @CurrentDateTime as datetime2 = GETDATE(); EXEC sp_autostats 'dbo.PersistedGrants', 'OFF'; WHILE (@@ROWCOUNT > 0) BEGIN WAITFOR DELAY '00:00:05’ DELETE TOP(3000) FROM PersistedGrants WHERE Expiration < @CurrentDateTime; END EXEC sp_autostats 'dbo.PersistedGrants', 'ON'; END Manual Cleanup • Exactly 15 days (1296000 seconds) after our initial burst of users DTU issues are taking place. • Don’t let IdentityServer cleanup tokens because it uses Entity Framework • Competing cleanups with staging and production slots • Create a stored procedure with a simple Logic App or Azure Function • Make sure not to stress the database rate limit / throttle the stored procedure.
  • 25. • Composable UI architecture can increase the load on your IAM. • Refresh token lineage is being stored for security reasons. • IdentityServer is a good product but lacks database maintenance options. • Scaling up a database can take a significant amount of time. • Manually altering infra such as scaling a databases yield source code issues. • Should I update this in dev or main or a release branch? Or create a hotfix? • If I deploy a hotfix will this overwrite my scaling settings? • Do you have a separate pipeline for Infra as Code? Lessons learned
  • 26. Ok, so everyone was able to take a test? We’re good right?
  • 27.
  • 28. Data Communication Hosting .NET, Java, JavaScript, Python IaaS, PaaS, FaaS Azure Functions REST, gRPC, Messaging Azure Service Bus + NServiceBus SQL, CosmosDb, Redis. Azure SQL Microservices in Azure
  • 29. Https Point-to-Point Pub / Sub Single receiver Multiple receivers Synchrnonous Asynchronous Messaging • REST is the de-facto standard for communication. • Is suitable for one-to-one communication. • Lots of libraries and programming languages support it. Truly technology agnostic. • Doesn’t support guaranteed delivery. • Messaging is the better alternative • Asynchronous in nature, enables recoverability, resilience. • Point-to-Point communication for one-to-one communication. • Enables to-to-many communication with pub/sub patterns.
  • 31. NServiceBus: Transactional consistency • In an event-driven architecture always incorporate transactional consistency. • The transaction scope of several processes are linked to eachother: 1. Handling the incoming message (StudentChanged) 2. Updating the database (Database Update) 3. Sending an event (UserChanged) • If any of these steps fail all transactions are rolled back. • NServiceBus has APIs to help with this. Administration ParnaSys / Dotcom Authentication Service Bus GET: Teachers/Students
  • 32. NServiceBus: Saga’s • Workflow consisting out of several messages being handled • Is started by specific messages • Handles certain messages • Somewhat comparable to Azure Durable Functions / Azure Durable Entities • State is stored in persistence of choice • Orchestration is handled via Service bus messages. • NServiceBus Saga persistence • SQL Server, MySql, PostgreSql, Azure Table Storage, MongoDb, RavenDb and more.
  • 33. Some (hard) lessons learned on Event driven Architecture
  • 34. Application Application Service Bus Testing environment Post processing Reporting Test processing • Thousands of events come in from the online testing environment. • Test started, paused, finished, … • Microservices act on events • Notify teachers on test status. • Close tests when started/finished • Analyze answers after test, such as: • d/dt, au/ou, ei/ij-analysis • Categorial mistakes, fractions, multiplications, etc. • Historical analysis • Did the student, class, group, school improve over time? • Sync data with 3rd Party systems such as LAS • …
  • 35. Test processing – LOB systems Service Bus Post processing Line of Business System: Test products REST: Get fault patterns Fix tests • People work in LOB (line of business) systems during business hours. • Expect data to be locked or incomplete. • Always validate data on your side of the system. • Use caching with LOB systems. They are 99% of the time not build for scale. • Retry policy of 10 times, message will be dead lettered after 10 retries. • Retrying exposes LOB systems to even more load. • Back-off on functional errors, if the test data isn’t there retrying makes no sense.
  • 36. PDF Report generation • In case of the Doorstroomtoets PDF’s needed to be generated for students and their parents/guardians. • The External Service only had a REST API • We used an Azure Function with Service Bus trigger. • The external service hosted Puppeteer inside an App Service. • 100k reports in 1 afternoon didn’t work well. • Service Pulse saved us, retried in badges. Report Generator External System Service Bus POST: Generate Reports Open Chromium Page Save as PDF
  • 38. • Messaging only works well if you design systems well. • Commands vs Events • Point-to-point vs Pub Sub • Service bus topology • Distinguish between functional and transient exceptions. Don’t retry on functional exceptions or backoff for a longer period. • Out of order event processing is inevitable on large scale • Idempotent, replaying • Azure Service bus might refuse connections. • Azure Service Bus Exception : Cannot allocate more handles. The maximum number of handles is 4999 • Audit logging enlarges the problem. • Prefer batching over streaming data processing in SQL Server. • Build for resilience and you will most likely not lose data. • You can’t live without a Service Bus monitoring solution with thousands of messages and dozens of services. • Transactional consistency helps to avoid Zombie records and Ghost messages Lessons learned
  • 39. Regaining trust with Postmortems Some templates available at: https://github.com/dastergon/postmortem-templates/blob/master/templates/postmortem-template-azure.md Title (incident) Date Summary of impact Customer impact Root cause and mitigation Next steps
  • 40. Regaining trust with Postmortems In the moment: • Take ownership of the situation. As a DevOps team you must solve the situation. • Don’t act in emotion, reason with your team. • As a Team Lead one should: • Shield your team from stakeholders. • Don’t fix it at your own. Involve team members. • Send it yourselves to the corresponding stakeholders. After the moment: • Trust has been violated; you must regain it. • Discuss in the team what went wrong. • Write a postmortem, be very specific. What was the problem? How did you deal with it? How are you going to prevent this?