Applications built over the years carry historical design assumptions, such as: it is acceptable to take a system out for upgrade maintenance for a few hours every 6 months.
In today’s world, embracing continuous delivery practices means more frequent releases, which means more downtime. Besides, finding a good maintenance window becomes a struggle with worldwide users, as well as for the operators managing the upgrade out of business hours.
In this talk, I want to demonstrate that by mapping out complex deployments processes, it becomes possible to prioritise work and progressively reduce the deployment impact. I will also give practical advice on how to tackle blockers to zero-downtime deployments, such as:
Migrating database schemas while keeping an application running
Ensuring backward compatibility of messages and APIs
Dealing with long-running background jobs
Mitigating user session loss
Deploying without the comfort of a maintenance window also means that stability during the upgrade is a critical concern. I will go through how it can be achieved through systematic pipeline automation and good system visibility to help operators during the upgrade.
This talk comes directly from my personal experience: our core product used to need a 3 hours blackout for upgrades, every month, with somebody up doing it at night time. Today, we can deploy during working hours without users noticing and are finally able to break away from long release cycles. This was achieved thanks to a strong collaboration between developers, SREs and infrastructure engineers, applying the techniques from this talk.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
DevOpsDays Portugal 2019 - Our journey to zero-downtime deployments
1. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
Changing tyres on a
moving car
Our journey to zero-downtime
deployments
June 4th, 2019 – Lisbon
@PierreVincent pvincent.io
3. @PierreVincent DevOpsDays Portugal 2019
Changing tyres on a moving car
There has been a massive earthquake
in New Zealand and I need to use
Poppulo for regular updates.
Please can you advise when it will be
back online.
“
”– Poppulo customer
5. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
2009 2015
Deploying 10+ times/day
Zero downtime
Deloy on-demand, anytime
Core
Monolith
(est. 2007)
Microservices
(est. 2015)
Deploying every 3 to 6
months
4 hours downtime
On Sunday at 5PM
Deploying every 4 weeks
2 hours downtime
On Sunday at 8PM
10. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
Run database migrations
Enable maintenance mode
Shut down services
Upgrade services
Start services
Disable maintenance mode
Wait for queued jobs to complete
15-60 mins
5-30 mins
15 mins
User impact
Limited functionality
Downtime
Wait for services startup
Deployment steps
12. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
Use expand/contract to split
breaking changes
Application [N] must work
with schema [N+1]
Online database
migration
Decouple schema version
from application version
No destructive operations to
tables/columns in use
Ensure backward
compatibility with non-
breaking changes only
Detect changes likely to cause
locking problems
Limit impact to live traffic
13. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
Expand/Contract example: renaming a column
Create
new
column
Write to
both
columns
Migrate
historical
records
Read
from new
column
Remove
old
column
Release N+1 N+2 N+3
14. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
More on schema migrations
Baron Schwartz - DevOps for the database
Chapter: Loosening the Application/Database coupling
www.vividcortex.com/resources/devops-for-the-database-ebook
Michiel Rook - Database Schema Migrations with Zero Downtime
speakerdeck.com/mrook/database-schema-migrations-with-zero-
downtime-continuous-lifecycle-london-2019
16. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
Drain
Stop
Upgrade
Start
Up [N]
Up [N+1]
1 2
Drain
Stop
Upgrade
Start
Up [N]
Up [N+1]
Featuredowntime
Drain
Stop
Upgrade
Start
Up [N]
Up [N+1]
1 2
Drain
Stop
Upgrade
Start
Up [N]
Up [N+1]
Featurecontinuouslyavailable
Full upgrade Rolling upgrade
18. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
Entire deployment pipeline in source control
+
Consistent and repeatable deployments
No more manual operations
✓
Any change is code-reviewed✓
✓
19. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
Observable deployments
Rolling-upgrade Progress
Core healthchecks
✓
Synthetic journey monitoring✓
✓
Error rates & queues saturation✓
20. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
Deploying every 3 to 6
months
4 hours downtime
On Sunday at 8PM
2009 2015 2019
Deploying every 4 weeks
2 hours downtime
On Sunday at 8PM
Deploying anytime
Zero downtime
During working hours
21. @PierreVincent
DevOpsDays Portugal 2019
Changing tyres on a moving car
Zero-downtime deployments don’t mean
everything stays up or that everything is
immediately running the latest version.
Thank you!
@PierreVincent
pvincent.io
They simply mean users don’t notice a thing
while all this is happening.