"Постмортемы" Ю.Литвиненко

Постмортемы
Юрий Литвиненко

Постмортем?
• Посмертная фотография (осторожно крипота!)

• ad-hoc ретроспектива

• ad-hoc ретроспектива
• Инвентаризация  
граблей

Зачем?
Извлечь уроки
• Избежать повторения
• Смягчить последствия
• Быстрее обнаружить в следующий раз
• В идеале - автоматически

Когда
• После любого “инцидента”
• Правило у нас: на каждый Critical или Blocker
• В идеале – сразу как починили, по горячим
следам

Use case
• Приложение сначала стало работать очень
медленно, а потом и вовсе вырубилось.

Проведение: онлайн
• Собрались, поговорили, записали
• Удобно если все участники в одном офисе

Проведение: оффлайн
• Кто-то собирает начальную информацию в один
документ
• Рассылает всем кто вовлечен
• Комментарии, уточнения, добавления
• ???
• PROFIT!

Этапы
• Оценка ущерба aka Impact analysis
• Реконструкция событий aka Reconstruction
• Анализ причин aka Root Cause Analysis
• Превентивные меры aka Mitigation aka Remediation

Оценка ущерба
• В деньгах или в KPI
• Прямой, косвенный или потенциальный
• Приблизительная оценка (хотя бы порядок)
• Важно для понимания усилий по
предотвращению в будущем

Примеры
• Конверсия упала на 10%
• 50% пользователей не смогли войти в личный
кабинет
• Пострадала репутация компании
• Команда разработки целый день не могла
пользоваться репозиторием

Use case
• Timeframe:  
14:11 – 15:54 - медленная работа приложения 
15:54 – 15:59 - downtime
• Пострадали пользователи, которые пытались
воспользоваться нашим сайтом в этот момент,
особенно в период downtime
• Репутационные потери
• Падение конверсии

Реконструкция
• Перечисление ключевых событий с временными
метками
• Нужно для понимания как быстро среагировали и
решили
• Кто был вовлечен
• Что сделали чтобы починить

Use Case
14:11
Support engineer reports a 504 Cloudflare issue in the
engineering channel on Slack.
14:20 Ticker of priority Blocker is created
14:21 Infrastructure department notified about the incident
14:25 Support engineer showcases the issue to Infrastructure
guys
14:28
Infrastructure tries to reproduce the issue but gets a File
not found error. Engineering channel reports a deploy was
triggered.
14:37 Deploy is finished

Use Case (contd.)
14:39
Infrastructure guys continue to investigate the problem after
confirming that deployment is done but application is working really
slow. Slowness confirmed by Support engineer
14:45
Infrastructure verifies that the load on the machines is low despite the
application being slow, discarding machine overload as the source of
the problem.
14:50
Infrastructure verifies that there are hundreds of spurious requests
coming from chinese sites. A firewall rule is added demanding a
challenge for China. Amount of spurious requests falls a bit down but
picks up quickly from other IP addresses.
15:00
Infrastructure starts adding challenge requests to Chinese IPs on all
domains. Engineering team is notified to not perform any deploys if
the environment is not stable.
15:17 Lots of slow requests detected between 2 internal systems.
15:40
The issue is traced to the MongoDB instance which is working really
slow.

Use Case (contd.)
15:42
A suggestion to drop big unused collections from the
MongoDB instance if any to try and ﬁx the problem.
15:51 2 journaling collections totalling in ~50GB are dropped
15:57
MongoDB performance is still not good. Application is
virtually not accessible as reported by Zabbix. MongoDB
instance is restarted
15:58
MongoDB instance is back online. Application
performance is back to normal.
16:12
Support engineer conﬁrms that the application is working
as expected. Issue solved.

Анализ причин
• “Повзучий детермінізм”
• 5 “почему”
• Без тыкания пальцем
• Без обвинений

Анализ причин
Use case
• The MongoDB instance for Zend was overloaded
with I/O operations.
• The instance had around 200GB of data but the
data ﬁles are using around 900GB.
• MongoDB seems to be having problems
sometimes most likely due to fragmentation.
• The I/O wait was below the threshold of 80% so it
was not detected by the monitoring system.

Превентивные меры
• Что сделать чтобы избежать, смягчить,
среагировать раньше
• Усилия – сопоставимы с ущербом

Примеры
• Добавить алерт на конкретный тип ошибки
• Добавить авто/юнит/интеграционные/etc. тесты
• Изменить что-то в процессе

Use case
• Block suspicious IPs from accessing the application
• Add another node to the MongoDB setup to fix the data files
problem at an existing host
• then recycle the data files in the existing host
• Engineering must suspend deploys when there are production
issues being investigated in a live environment.
• Lower Zabbix I/O monitoring threshold for MongoDB servers
• Add monitoring template to Zabbix that checks all important
performance indicator from MongoDB instances

Где подвох?
• Очень сложно избежать hindsight bias
• Документ заполняется для галочки
• Воспринимается как наказание
• Внедрять нужно аккуратно

Ссылки
• https://en.wikipedia.org/wiki/Hindsight_bias
• http://www.startuplessonslearned.com/2008/11/
ﬁve-whys.html
• Как это делают в Etsy: https://vimeo.com/
77206751
• Инструмент Etsy: https://github.com/etsy/morgue

"Постмортемы" Ю.Литвиненко

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (15)

Similaire à "Постмортемы" Ю.Литвиненко

Similaire à "Постмортемы" Ю.Литвиненко (20)

Plus de Fwdays

Plus de Fwdays (20)

"Постмортемы" Ю.Литвиненко