What happens when 200k users unexpectedly decide to use your platform simultaneously? We’re using autoscale on Azure PaaS so surely we can handle that, right? Wrong! Ask me how I found out… After going through a bit of trouble, I want to help you avoid the same mistakes I made.
2. Dibran Mulder
CTO / Azure Solutions Architect
Caesar Groep / Cloud Republic
@dibranmulder
Particular Recognized Professional
Co-Host
www.devtalks.nl
@devtalkspodcast
3. Every January
• > 70% of all primary schools in the Netherlands
take tests on our platform.
• Pre-Covid paper testing was dominant.
• New student tracking platform first time use.
4. Tuesday 17th of January
8:15 – 8:30 School opening
8:30 – 9:00 Opening by teacher
9:00 – 9:05 Entire country starts taking tests
9:05 – 9:10 Wait for Azure to Scale
9:10 – 10:00 Continue with the test
10:00 – 11:00 Break & Play outside
11:00 – 12:00 Take second test
10. Hi Mr. Manager!
• Your alerts, monitoring will go off.
• Service Care is getting flooded with calls
• Your manager is going to sit next to you
• Trust has been violated.
• Your work is being monitored all the time.
11. Would you?
Scale up using the Azure portal
despite your Infra as Code Policy.
Scale up using Infra as Code
and do a deployment.
Fix the problem yourself as a
team lead.
Let a team member fix the
problem.
Take ownership and act, we are
going to scale up, now!
Organize a meeting and
discuss the best approach.
13. App Service Plan Scaling
• Take a baseline for the night e.g. 2 instances
• Take a baseline for working hours, e.g. 10
instances
• Aggressive autoscaling > 60% CPU increase to
max
• Decrease over time.
15. • Scaling rules with a 5-minute evaluation time are to slow in certain use cases.
• Its better to scale aggressively and decrease over time, it won’t hurt costs that bad.
• Pre-provisioning might be helpful in some cases.
• Its hard to be cost effective and confident at the same time.
• Be prepared to get shit from your nephews and nieces.
• Haven’t we tested right?
• We have load tested the system with a ramp up test up to 5k concurrent users.
• We have tested based on non-functionals according to pre-covid.
• We haven’t tested with 150k real users hitting the system in a 5-minute window
• We didn’t expect a paradigm shift in the adoption of digital testing.
Lessons learned
19. Identity Server Persisted Grants
• Refresh tokens are ≈ 515bytes
• 900 sec lifetime
• 15 days lineage
• 150k students / 2 hours per day ≈ 10 refresh tokens per student
• 10k teachers / 8 hours per day ≈ 40 refresh tokens per teacher
• Students ≈ 1.5 gb per day
• Teachers ≈ 600mb per day
• Database growth of almost 2.5 gb per day
21. Identity Server Persisted Grants
• Users made extensive use of the online testing environment for students and the student tracking system for
teachers.
• Our composable front-end architecture 3x-ed the amount generated of refresh tokens.
• Refresh tokens are kept in the Persisted Grant Storage to make sure the lineage of tokens is correct. And they
are not reused.
• Database grew to 100gb in roughly 2 weeks.
• Scaling a database from S2-Sx takes up to 1min per GB
• Scaling up a database under stress is taking significantly longer…
• IdentityServer doesn’t cleanup by default but has a TokenCleanup feature.
services.AddIdentityServer()
.AddOperationalStore(options =>
{
// this enables automatic token cleanup. this is optional.
options.EnableTokenCleanup = true;
options.TokenCleanupInterval = 3600; // interval in seconds (default is 3600)
});
22.
23. ALTER PROCEDURE [dbo].[PersistedGrantCleanup]
AS
BEGIN
SET NOCOUNT ON;
DECLARE @CurrentDateTime as datetime2 = GETDATE();
EXEC sp_autostats 'dbo.PersistedGrants', 'OFF';
WHILE (@@ROWCOUNT > 0)
BEGIN
WAITFOR DELAY '00:00:05’
DELETE TOP(3000)
FROM PersistedGrants
WHERE Expiration < @CurrentDateTime;
END
EXEC sp_autostats 'dbo.PersistedGrants', 'ON';
END
Manual Cleanup
• Exactly 15 days (1296000 seconds) after our initial burst of users
DTU issues are taking place.
• Don’t let IdentityServer cleanup tokens because it uses Entity
Framework
• Competing cleanups with staging and production slots
• Create a stored procedure with a simple Logic App or Azure Function
• Make sure not to stress the database rate limit / throttle the stored
procedure.
25. • Composable UI architecture can increase the load on your IAM.
• Refresh token lineage is being stored for security reasons.
• IdentityServer is a good product but lacks database maintenance options.
• Scaling up a database can take a significant amount of time.
• Manually altering infra such as scaling a databases yield source code issues.
• Should I update this in dev or main or a release branch? Or create a hotfix?
• If I deploy a hotfix will this overwrite my scaling settings?
• Do you have a separate pipeline for Infra as Code?
Lessons learned
28. Data
Communication
Hosting .NET, Java, JavaScript, Python
IaaS, PaaS, FaaS
Azure Functions
REST, gRPC, Messaging
Azure Service Bus + NServiceBus
SQL, CosmosDb, Redis.
Azure SQL
Microservices in Azure
29. Https
Point-to-Point Pub / Sub
Single receiver Multiple receivers
Synchrnonous
Asynchronous
Messaging
• REST is the de-facto standard for communication.
• Is suitable for one-to-one communication.
• Lots of libraries and programming languages
support it. Truly technology agnostic.
• Doesn’t support guaranteed delivery.
• Messaging is the better alternative
• Asynchronous in nature, enables recoverability,
resilience.
• Point-to-Point communication for one-to-one
communication.
• Enables to-to-many communication with pub/sub
patterns.
31. NServiceBus: Transactional consistency
• In an event-driven architecture always incorporate
transactional consistency.
• The transaction scope of several processes are linked to
eachother:
1. Handling the incoming message
(StudentChanged)
2. Updating the database
(Database Update)
3. Sending an event
(UserChanged)
• If any of these steps fail all transactions are rolled back.
• NServiceBus has APIs to help with this.
Administration
ParnaSys / Dotcom
Authentication
Service
Bus
GET: Teachers/Students
32. NServiceBus: Saga’s
• Workflow consisting out of several messages being
handled
• Is started by specific messages
• Handles certain messages
• Somewhat comparable to Azure Durable Functions /
Azure Durable Entities
• State is stored in persistence of choice
• Orchestration is handled via Service bus messages.
• NServiceBus Saga persistence
• SQL Server, MySql, PostgreSql, Azure Table Storage,
MongoDb, RavenDb and more.
34. Application
Application
Service
Bus
Testing environment
Post processing
Reporting
Test processing
• Thousands of events come in from the online testing environment.
• Test started, paused, finished, …
• Microservices act on events
• Notify teachers on test status.
• Close tests when started/finished
• Analyze answers after test, such as:
• d/dt, au/ou, ei/ij-analysis
• Categorial mistakes, fractions, multiplications, etc.
• Historical analysis
• Did the student, class, group, school improve over time?
• Sync data with 3rd Party systems such as LAS
• …
35. Test processing – LOB systems
Service
Bus
Post processing
Line of Business System:
Test products
REST: Get fault patterns
Fix tests
• People work in LOB (line of business) systems during
business hours.
• Expect data to be locked or incomplete.
• Always validate data on your side of the system.
• Use caching with LOB systems. They are 99% of the time
not build for scale.
• Retry policy of 10 times, message will be dead lettered
after 10 retries.
• Retrying exposes LOB systems to even more load.
• Back-off on functional errors, if the test data isn’t
there retrying makes no sense.
36. PDF Report generation
• In case of the Doorstroomtoets PDF’s needed to be
generated for students and their parents/guardians.
• The External Service only had a REST API
• We used an Azure Function with Service Bus trigger.
• The external service hosted Puppeteer inside an App
Service.
• 100k reports in 1 afternoon didn’t work well.
• Service Pulse saved us, retried in badges.
Report Generator
External System
Service
Bus
POST: Generate Reports
Open Chromium Page
Save as PDF
38. • Messaging only works well if you design systems well.
• Commands vs Events
• Point-to-point vs Pub Sub
• Service bus topology
• Distinguish between functional and transient exceptions. Don’t retry on functional
exceptions or backoff for a longer period.
• Out of order event processing is inevitable on large scale
• Idempotent, replaying
• Azure Service bus might refuse connections.
• Azure Service Bus Exception : Cannot allocate more handles. The maximum
number of handles is 4999
• Audit logging enlarges the problem.
• Prefer batching over streaming data processing in SQL Server.
• Build for resilience and you will most likely not lose data.
• You can’t live without a Service Bus monitoring solution with thousands of messages
and dozens of services.
• Transactional consistency helps to avoid Zombie records and Ghost messages
Lessons learned
39. Regaining trust with Postmortems
Some templates available at:
https://github.com/dastergon/postmortem-templates/blob/master/templates/postmortem-template-azure.md
Title (incident)
Date
Summary of impact
Customer impact
Root cause and mitigation
Next steps
40. Regaining trust with Postmortems
In the moment:
• Take ownership of the situation. As a DevOps team you must solve the situation.
• Don’t act in emotion, reason with your team.
• As a Team Lead one should:
• Shield your team from stakeholders.
• Don’t fix it at your own. Involve team members.
• Send it yourselves to the corresponding stakeholders.
After the moment:
• Trust has been violated; you must regain it.
• Discuss in the team what went wrong.
• Write a postmortem, be very specific. What was the problem? How did you deal with it? How are you going to prevent
this?