Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Do not panic! (dealing with major incidents)

80 vues

Publié le

Index:
- Examples of incidents
- How to be prepared
- How to react

Publié dans : Internet
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Do not panic! (dealing with major incidents)

  1. 1. do not panic! dealing with major incidents 1 Sergio Arcos Sebastian 2017-07-06
  2. 2. challenge failed 2
  3. 3. Everything started here... > SELECT a.id, c.id FROM accounts a JOIN credentials c… 550 rows > SELECT string_agg(c.id::text, ‘,’) FROM accounts a JOIN credentials c… 1,2,3... $ a = Account.where(:id => [1,2,3, ...]) $ a.count 350 rows $ a.destroy_all 3
  4. 4. 4
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. challenge considered 8
  9. 9. Define a Major Incident = << we don’t care about toilet paper, as long as there’s at least one roll left >> 9 Urgency Impact Severity Priority
  10. 10. Alert & monitoring system 10
  11. 11. Incident notification platform (phone, sms, push, ..) 11
  12. 12. Incident repository / Status page 12 GithubNewRelic
  13. 13. Landing page 13
  14. 14. Minimum contingency plan << The backup plan cost more than fix the incident >> 14 Model Affected Guests Business Repercussion Team Members ... Doorkeeper All Critical 1 AdminPanel Internal Low 1 Permitted Partners High 1 Uploads Paying High 2 Notifications Free Low 1
  15. 15. Follow best code practices - Version your endpoints - Split your endpoints (add/remove) (micro-service) - Apply small changes at once - Roll out frequency - Idempotency - Flag as deleted - Be paranoid 15
  16. 16. Follow best infrastructure practices - Defense in depth (also known as Castle Approach) - Use canaries (blue/green deployment) & rollback - Automatic fallbacks (reboot if is down) - Use API gateways - Backups, replication, redundancy, … - Dead letter queues - Logs (when, where, who, what) - Infrastructure by code (even ENV variables!) - Disaster-recovery testing (ex. Chaos Monkey) 16
  17. 17. challenge accepted 17
  18. 18. Workflow (template) 1. Stop! 2. Delay worse consequences 3. Communicate to your team 4. Pair 5. Write next steps 6. Log everything 7. Fix it 8. Add asserts 18
  19. 19. Easiests mistakes - Do not keep it hidden - Do not bypass your CI - Do not fix it at any cost - Interrupt your boss’ meeting if needed - Experience makes you feel more comfortable - Knowledge makes you fix the issue - Your stakeholders should be informed - Do not finger point 19
  20. 20. Iterate your custom process - Do a retrospective with your team - Survey your stakeholders - Review your statistics to ensure you don’t underestimate it - Do a post-mortem - Create or update your documentation - Increase your number of assertions - Automate 20
  21. 21. martes13.net 21 hjdl.space

×