7. Benefits
✘ Reduce the risk of introducing a new software versions
✘ Safe rollback strategy if issues are found
✘ The ability to do capacity testing of the new version in a production environment. Does my
new version require enhanced hardware capabilities?
✘ Collateral effect: you get zero downtime deployment !!
8. WHY?
✘ We cannot easily reproduce production traffic patterns
✘ We have no proper testing environment due to
○ Complex integration with third parties
○ Elevated hardware requirements
✘ We don’t have enough validation before going live
✘ You want to increase your confidence when a new release is deployed
10. Similar to blue/green deployment we start deploying the new
version to a subset of our infrastructure. no request are routed
there yet
1. deploy new version
11. We start sending some request to the new version
2. start routing users
12. If no issues are found related to new version, we gradually
increase the percentage of users routed to the new version
3. Monitorize
13. Once all users have been routed to the new version, we remove
the old one
4. remove old version
15. Application monitoring
✘ A list of awesome APM (Application Performance Monitoring) tools & products
(commercial and OSS)
https://github.com/antonarhipov/awesome-apm
17. “Canary deployment also gives you a
rapid way to rollback - if anything goes
wrong you may route all users to the old
version
18. key points to be taken in account
✘ We are running two versions in parallel
✘ Be sure your software supports it
https://es.slideshare.net/sergio_pino/despliega-como-los-grandes-zero-downtime-deployment
19. key points to be taken in account
✘ Deliver to users a consistent experience
✘ How we are going to monitorize?
✘ Do we always have a rollback path?
✘ Don’t replace Devops practices
20. How we route users?
✘ Random pattern
✘ Geographic pattern
✘ IP range pattern
✘ Based on user type (freemium vs premium). Feature toggles
✘ Based on some application logic. Feature toggles
22. What about release gates?
Gates allow automatic collection of health signals
from external services, and then promote the
release when all the signals are successful at the
same time or stop the deployment on timeout
https://docs.microsoft.com/en-
us/azure/devops/pipelines/release/approvals/gates?view=azure-devops
23. ✘ 10% canary users
✘ 25% canary users
✘ 100% canary users
Define several stages, and a gate to transition between them
Transition will only happen if
there are no alerts related to
Failed request, response time
and availability
27. Deployment rings were first discussed in Jez Humble's
book. They support the production-first DevOps
mindset and limit impact on end users, while gradually
deploying and validating changes in production.
Impact (also called blast radius), is evaluated through
observation, testing, analysis of telemetry, and user
feedback.
31. The facebook case
During the two weeks prior to launch we began what we call a
"dark launch" of all the functionality on the backend. Essentially a
subset of user queries are routed to help us test, by making
"silent" queries to the code that, on launch night, will have to
absorb the traffic. This exposes pain points and areas of our
infrastructure that needs attention prior to the actual launch.
Increasing the demand on one subsystem may generate more
logs than anticipated and overwhelm analysis processes, or
unexpected network bottlenecks may appear.
https://www.facebook.com/note.php?note_id=96390263919
32. How the code looks?
Execute new code
but show the user
the result of old
code
Hablar de que quizas deberia llamarse “escenarios menos conocidos”
Also know as canary deployment
Sometimes it is referred to as a phased rollout or an incremental rollout
Canaries were once regularly used in coal mining as an early warning system. Toxic gases such as carbon monoxide, methane or carbon dioxide in the mine would kill the bird before affecting the miners. Signs of distress from the bird indicated to the miners that conditions were unsafe. The use of miners' canaries in British mines was phased out in 1987.
- Wikipedia
Enseñamos la aplicación
Desplegamos una nueva versión canary… esta versión tiene problemas: falla la petición al detalle de un superheroe, y el listado tarda mucho.
Lo muestro navegando y con curl
Hay muchas opciones, opensource y de pago. Ese repo es una buena recopilación.
Datadog, stackify…
Teniendo en cuenta los errores generados en la anterior demo (falla la petición al detalle de un superheroe, y el listado tarda mucho)
Mostrar como vemos los errores en las peticiones
Mostrar como aumenta el response time
Hacemos rollback: dejamos de enrutar al canario y vemos que en unos segundos todo vuelve a la normalidad.
Ojo que estamos ejecutando dos versions del software en paralelo.
Si hemos migrado el esquema de bbdd podemos tener un problema, aunque siempre podríamos tirar de un backup.
Aquí viajamos a la charla de Zero downtime deployment y el cambio de esquema…
Experiencia consistente: Ojo no estemos cambiando de pantalla al usuario cada vez q recarga (random pattern). Usemos al menos afinidad de sesión
Monitorización: Insights? Azure monitor? Otros?
Puede que no haya rollback directo y fácil
No cometamos el error de reemplazar buenas prácticas porque tenemos canarios
Si nos metemos mucho en el tema de los tipos de usuario nos acercamos a despliegue en anillos.
Parece razonable no experimentar con los que pagan y si con los que van gratis (premium vs freemium)
Pero si hablamos de features nuevas? Quizás sea al reves. Aquí los grises important
Importante tener en cuenta que no siempre podremos usar todos los tipos de enrutado.
En app service por ejemplo estamos limitados a random pattern
En istio podríamos usar una cabecera…
Con Azure application gateway patrones geofráficos
Mostramos un pipeline que además del despliegue del canario tiene un stage que despliega a producción la nueva version (la misma q el canario) y despues enruta todos los usuarios allí
Requiere una aprobación manual
Es tan fácil? Depende de lo que os propongáis… adoptar un despliegue en anillos puede ser muy complejo o muy sencillo, dependiendo de donde queramos llegar
Por ejemplo:
Canarios: usuarios de mi empresa, los consultores . Controlo por IP de salida de mis oficinas (si viajas VPN). Van una copia de la infra en la que todos los feature flags están siempre on
Partners: son empresas que son más que clientes. Mediante feature flags les habilito las cosas antes
Resto
Testing en prod: ojo no la liemos. No dejemos de usar buenas prácticas. Podríamos “liarla”
La forma mas fácil de liarla es no tener en cuenta que tengo dos versiones en paralelo… lo que “rompería” a los no canarios. Justo lo que no queremos
Monitoriza, obtén feedback rápido, y actúa en consecuencia