Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Learning from failures

51 vues

Publié le

Learning from failures

Publié dans : Internet
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Learning from failures

  1. 1. Learning from failures Yoshinobu ‘maz’ Matsuzaki <maz@iij.ad.jp> bdNOG12 maz@iij.ad.jp 1
  2. 2. Reliability is getting important • More use of the Internet • COVID-19 has been pushing digitalization • Bandwidth is a key • When congestion occurs, the experience gets worse • But enough bandwidth just is not enough • Even if you have it set up wrong, you can still use it somehow • Reasonability, stability and resiliency is the other key bdNOG12 maz@iij.ad.jp 2
  3. 3. Risk prediction training 1. Understanding the situation • Discuss imaginable hazard scenario in the given situation. 2. Determining risks • Identify the hazards that need to be addressed 3. Establishing countermeasures • Discuss possible measures to solve the hazards 4. Setting goals • Selecting possible measures to implement bdNOG12 maz@iij.ad.jp 3
  4. 4. Example1: Routing • An ISP assigns /24 for a customer • ISP set up a static route for the link • The customer set up a default route to the uplink • The customer uses /28 out of the /24 static route static route default bdNOG12 maz@iij.ad.jp 4
  5. 5. Example1: Risks • If a packet comes to an address other than the /28 out of the /24, the packet will be looped • If the customer's LAN-side interface is down, all packets destined for the /24 will be looped. • Routing loop! static route static route default A packet to: bdNOG12 maz@iij.ad.jp 5
  6. 6. Example1: Measures • Implementing dynamic routing between ISP and the customer • Configuring a static route on the customer's router that directs the same /24 to null static route static route default bdNOG12 maz@iij.ad.jp 6
  7. 7. Example1: Adopting • Configuring a static route on the customer's side router that directs the same /24 to null static route static route default static null route bdNOG12 maz@iij.ad.jp 7
  8. 8. Example2: Port assignments • Removing a cable from port X • Just to be safe, make sure the LED is off before pulling it out • But can you spot the right port for sure? bdNOG12 maz@iij.ad.jp 8
  9. 9. 1 2 3 4 5 6 Straight forward Starting from port 0The left LED is for LC status More efficient but confusable A little clearer Port 21 is the SFP now bdNOG12 maz@iij.ad.jp 9
  10. 10. And more... • We may see a different implementation in the future • Assumptions are the source of accidents! • Different products have different port/LED assignments • These caused confusion bdNOG12 maz@iij.ad.jp 10
  11. 11. The more you know, the more you can see • A variety of experience helps us to better consider the hazards • and to identify risks • Technical education and proper training are necessary to improve operational skills • bdNOG workshops and tutorials are helpful • There is always a need for appropriate educational materials bdNOG12 maz@iij.ad.jp 11
  12. 12. Mistakes! • Mistakes can be a very good teaching tool • There is a lot to learn from mistakes in the case studies • There are some special cases, but there are also many common failures and lessons to be learned by comparing them to your own situation • But as a business, we need to stop repeating failures in our service facilities • It damages reliability bdNOG12 maz@iij.ad.jp 12
  13. 13. Build a database of mistakes • It can be a great teaching tool for engineers! • not to reproduce the similar mistakes • You may find common and frequent mistakes • If you can find the root cause of the failure, you can come up with a more effective solution bdNOG12 maz@iij.ad.jp 13
  14. 14. Mistake trend analysis • Identify the high-impact mistakes • Minimize the bad effects • Reduce mistakes bdNOG12 maz@iij.ad.jp 14 effects of mistakes frequency of mistakes should not be happened problemsmatters problems
  15. 15. Accident investigation committee • In some industries, Accident Investigation Committees conduct detailed investigations and compile reports in order to prevent the repeating of serious accidents • Maybe bdNOG can do this as a community activity • For the healthy development of the Internet in Bangladesh • Regular reports of accident cases during bdNOG meetings bdNOG12 maz@iij.ad.jp 15
  16. 16. Summary • To have a reliable network, we need to continuously improve our operations • The use of failure cases allows for more effective risk analysis and countermeasures • As bdNOG community, I believe the following are worth considering • Collection of failure and mistake cases • Trials of accident analysis bdNOG12 maz@iij.ad.jp 16