Stanislav Komanec, VP of Engineering at Kiwi.com, discussed how the company prepared for and responded to a datacenter disaster at OVHcloud. Kiwi.com uses a distributed architecture across multiple datacenters to ensure high availability. When the OVHcloud Strasbourg datacenter caught fire in March 2021, Kiwi.com was able to quickly reroute traffic away from affected servers to avoid disruption. The presentation reviewed Kiwi.com's architecture, best practices for data resiliency, and lessons learned for improving incident response plans.
8. ➔ Our technology unlocks our key features
➔ Best inventory in the world
➔ Great search
➔ Features like multi-city, nomad, good deals...
Kiwi.com Key Features
9. ➔ Cloud native
◆ Infrastructure as a code
➔ Micro-services oriented architecture
➔ 600+ microservices, aligned in specific domains
Kiwi.com under the Hood – Architecture
10. ➔ Main database – Scylla
➔ 400K+ /s reads, 200K+ /s writes; we are rewriting whole DB once in
10 days
➔ Infrastructure
◆ OVH main bare-metal provider
◆ Megaport
◆ GCP as the main cloud provider – web services
Kiwi.com under the Hood – Infrastructure
14. ➔ Strasbourg, France
➔ Wednesday, 10 March 2021
➔ Fire breaks out 00:47 CET
OVHcloud Fire
OVHcloud’s Strasbourg SBG2 Datacenter engulfed
in flames.
(Image: SDIS du Bas Rhin)
15. OVHcloud’s Strasbourg SBG2 Datacenter
the next morning. (Image via Twitter)
➔ Strasbourg datacenter impact
■ SBG2 totally consumed
■ SBG1 4 of 12 rooms gutted
■ SBG3 & SBG4 proactively taken offline
➔ Internet impact (as per Netcraft)
■ 3.6 million websites
■ 464,000 domains
■ 1 in 50 sites in all of .fr TLD
Damage Assessment
16. “Websites that went offline during the fire included online
banks, webmail services, news sites, online shops selling PPE
to protect against coronavirus, and several countries’
government websites.”
— Netcraft
20. Monitoring the Problem
Latencies briefly rise
until unavailable servers
are taken out of cluster
10 of 30 servers
are suddenly
unavailable
Requests per
second per server;
note how some
drop towards zero
then blip out of
existence
21. Timeline of Fire
00:47 CET Fire breaks out in OVHcloud Strasbourg SBG2
01:12 CET Kiwi.com nodes in Strasbourg start falling off the cluster
01:15 CET All 10 Strasbourg nodes offline; traffic diverted to 2x other Kiwi.com datacenters (20
servers remaining)
02:23 CET Production operational, we manually need to tweak some services around the main
database.
08:54 CET Tweaks deployed, we are fully operational
22. ➔ Degraded performance on some services
◆ Trying to rebalance load
➔ Moving some affected service to different place
We were up & running
Kiwi.com Impact
25. What if... Kiwi.com Customer’s Impact
➔ Customer perspective
■ They could not use the service
■ They could not changes bookings
■ We could not process changes in itineraries
● Customers might be at the airports waiting for flights
26. What if. Kiwi.com Technical Impact
➔ Micro-services – domino effect
➔ Other teams
■ Issues will be cascading: in order to mitigate it, we would need to stop
the services in specific order
➔ Inconsistencies
■ We might end up with lot of inconsistencies even for current customers
➔ Customer support overloaded
27. What if…. How to Handle the Situation
➔ Stop services, in right order
➔ Spin off new cluster
➔ Let it sync
➔ Run data refreshers
➔ Slowly start web services for customer
28. What if…. Estimation
➔ Revenue loss
➔ Reputation loss
■ Customers would buy elsewhere
➔ Inconsistencies
■ A lot of manual work
30. ➔ Choice of technology
■ High availability architecture
■ Data replication for resiliency
➔ Choice of cloud vendor
■ Geographic distribution of datacenters
■ Capability to manage SLAs
➔ Having procedures in place
➔ Right environment
Long Before the Fire Broke Out
31. ➔ Requirements
■ High resiliency – to provide best value to customer
■ Low latency – to enable products like Nomad, Multicity search...
➔ History
■ PostgreSQL databases - consistency issues
■ Cassandra - performance issues
What experience did we have?
32. ➔ Peer-to-peer leaderless architecture
■ No single point-of-failure
➔ User-controllable replication factor (RF)
■ RF=1; We have 3 data centres
➔ Per-operation tunable consistency levels
■ One, Quorum, All, etc.
➔ Automatic multi-datacenter replication
■ Keeps different sites in sync
➔ Rack-aware and datacenter-aware
■ Ensures replicas are physically and
geographically distributed
Scylla’s High Availability Architecture
34. Beginnings of a Plan
Goals:
3 datacenters
3 cities
geographically
separated
35. ➔ You need to unlock technology advantages via the great team
➔ The best way is to setup culture & procedures
■ Creates the right environment
Team & Process Plan
36. ➔ Proper monitoring in place
➔ Proper alerting
➔ Incident management system
➔ Postmortems
Incident Management
37. ➔ Learning from each incident
➔ Making our systems more robust
➔ Building the culture
■ Present your mistakes
■ Wheels of misfortune
Blameless Culture
42. ➔ Invest where it matters
■ Time to time it’s about overscaling the whole datacenter, not just an
instance or two
■ Critical path
Where to Start – Overscale?
48. ➔ It’s important to build the right
environment
➔ Thank you to all members of the
team
Get the Greatest Team
49. ➔ Find the partners, who consider your problems their own
■ OVH
■ GCP
■ Scylla
● Initial setup, great support over the years
■ Megaport
■ Cloudflare…
Get the Great Partners
51. A Good Year (2006)
Uncle Henry:
“It's inevitable to lose now and
again. The trick is not to make a
habit of it.”
Takeaways
52. ➔ Outages are inevitable. It's just up to us to be prepared
➔ Plan for the worst, hope for the best
➔ Get right balance between proactivity and reactivity
➔ Get the great team & cultivate blameless culture
■ Drives the innovation most effectively
Lessons Learned