Contenu connexe
Similaire à The Top Outages of 2022: Analysis and Takeaways
Similaire à The Top Outages of 2022: Analysis and Takeaways(20)
The Top Outages of 2022: Analysis and Takeaways
- 2. 2
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Featured Speaker
Mike Hicks
Principal Solutions Analyst
- 3. 3
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Before We Begin...
• If you have any questions, please type them in the Questions window.
• If you have any audio problems, please chat us for help.
• A recording of this presentation will be sent to you in a few days.
3
@ThousandEyes
© 1992–2023 Cisco Systems, Inc. All rights reserved.
- 4. 4
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Agenda
• About ThousandEyes
• Noteworthy Outages of 2022
• Primer: Digital Service Building Blocks
• Top Ten Outage Countdown
• Lessons & Takeaways
• Q&A
4
@ThousandEyes
- 5. 5
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Actionable Insight for Internet, Cloud, and SaaS
Correlated Insights
Quickly isolate issues to app, network,
or service
Network Visibility
Overlay, hop-by-hop underlay, ISP
performance, and BGP routing
App Experience
SaaS, API, and internal app
performance and user experience
- 6. 6
© 1992–2023 Cisco Systems, Inc. All rights reserved.
2022 Noteworthy Outages
Major
Significant
Shadow
British
Airways
(2/25)
Twitter
prefixes
hijacked
(3/28)
Atlassian
services
unavailable
(4/5)
Rogers
routing
failure
(7/8)
AWS AZ
Failure
(8/9)
Zoom
Outage
(9/15)
Zscaler
Internet
Access
Failure
(10/25)
WhatsApp
Outage
(10/25)
AWS
packet
loss
(12/5)
- 7. 7
© 1992–2023 Cisco Systems, Inc. All rights reserved.
CDN
Cloud
BGP
DNS
The Building Blocks of Today’s Digital Services
SaaS
- 8. 8
© 1992–2023 Cisco Systems, Inc. All rights reserved.
DNS
BGP
Many Options, Complex Dependencies
ISP
Users
CDN
Your App
Security
- 9. 9
© 1992–2023 Cisco Systems, Inc. All rights reserved.
DNS
BGP
Many Options, Complex Dependencies
ISP
Users
CDN
Your App
Cloud APIs
Data Center
Cloud IaaS
Security
- 10. 10
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Step 1: DNS – Where are We Going?
Users CDN Your App
BGP
ISP
DNS
Root Server
TLD Server
Authoritative
Server
- 11. 11
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Step 2: How do We Get There?
Users BGP
ISP
DNS CDN Your App
- 12. 12
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Step 3: CDNs - Do We Have to Travel So Far?
Users Your App
CDN
BGP
ISP
DNS
- 13. 13
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Step 4: Rinse and Repeat For Services & API Calls
Your App
SaaS Apps
Cloud APIs
Data
Center
Backend
Services
- 15. 15
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Atlassian, Apr 5, 2022
#9
#8
#10
#7
#6
Zscaler Internet Access, Oct 25, 2022
WhatsApp, Oct 25, 2022
AWS, Dec 5, 2022
Rogers, Jul 8, 2022
~24 hours
App + routing issues
~2.5 days
Service unavailable/data loss
Rogers withdrew its prefixes due to an internal routing issue,
rendering it unreachable across the Internet for nearly 24 hours.
Lesson: No provider is immune to outages. Plan for a backup
network provider that can alleviate the length and scope of an
outage.
Customers using Zscaler Internet Access (ZIA) experienced
connectivity failures or high latency in reaching Zscaler proxies.
Lesson: Having network-agnostic data for complex scenarios like
this can enable quicker attribution and remediation.
~30 minutes
Network traffic loss
~2 hours
Failure to send/receive messages
~1 hour
Network traffic/packet loss
Significant packet loss between 2 global locations and AWS' us-
east-2 region. Lesson: it’s important to monitor not just the
applications, but also the cloud infrastructure components and
any dependent cloud software services.
The two-hour outage left WhatsApp users unable to send or
receive messages. Lesson: A thriving SaaS business relies on
continuous improvement, which is why an immediate feedback
loop—whereby mistakes can be rectified quickly—is necessary.
Due to a maintenance script error, Atlassian services
experienced a days-long outage. Lesson: One cannot rely on
status pages alone to communicate about outages. Customers
can be left worrying with no answer as to how serious an outage
is and when it will be fixed.
Outage
Blog
Outage
Blog
- 16. 16
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Zoom, September 15th, 2022
#5
• Service unavailable ~20
minutes
• Users were unable to
log in or join meetings
• Most of the HTTP errors
seen were 503 Bad
Gateway responses,
indicative of potential
CDN issues
• The service would
appear to be available if
just testing via IP, but
looking at HTTP
results/service status
tells a different story
Lesson: It may be that the app itself is causing issues rather than
the network. Having visibility into which it is can prevent confusion
and finger-pointing during root cause analysis.
- 17. 17
© 1992–2023 Cisco Systems, Inc. All rights reserved.
British Airways, February 25, 2022
#4
• Service unavailable
~20 minutes
• Outage caused
hundreds of flight
cancellations and
disruptions in the
airline's operations
• Network paths to the
airline’s online services
(and servers) were
reachable, but server
and site responses
were timing out
Lesson: Architecting backends that avoid single points of failure
can reduce the likelihood of a chain of events
- 18. 18
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Google, August 9, 2022
#3
• Service unavailable for
~60 minutes
• Outage affected Google
search and maps
• During this time, Google
web servers responded
with HTTP 500 Internal
Server Error messages,
502 bad gateway errors,
and timeouts
Lesson: It is important to monitor not just your application front
ends but also the performance-critical dependencies that power
your app. Outage Blog
- 19. 19
© 1992–2023 Cisco Systems, Inc. All rights reserved.
AWS AZ Failure, July 28th, 2022
#2
• Service unavailable ~20
minutes, ~3 hours for
customers to recover
• Caused by an
Availability Zone power
failure
• Impacted applications
such as Webex, Okta,
and Splunk.
• Affected EC2 instances
and EBS volumes as
well as traffic routing
Lesson: Be sure to have redundant AZ architecture as
they are typically active/active and remove the need to
execute a backup plan. Outage Blog
- 20. 20
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Twitter, March 28th, 2022
#1
• Service unavailable ~45
minutes
• Twitter was rendered
unreachable for some users
when JSC RTComm.RU
(AS 8342) announced one
of Twitter’s prefixes and
subsequently blackholed
traffic
• Since Twitter’s service is not
located within RTComm’s
network, any Twitter traffic
destined to RTComm would
have failed.
Lesson: Though your company might have RPKI implemented to
fend off BGP threats, it's possible that your telco won't. Something
to consider when selecting ISPs. Outage Blog
- 21. 21
© 1992–2023 Cisco Systems, Inc. All rights reserved.
Lessons and Takeaways
• BGP powers the Internet, but can also be misused and abused.
Visibility and planning is needed to protect your network.
• Public cloud is ubiquitous and reliable. But, ensure that you are
monitoring all cloud dependencies.
• Avoid single points of failure. Your apps are only as resilient as your
architecture.
• Security is essential, but it can add great complexity that requires
continuous end-to-end visibility.
• Whenever the infrastructure is touched, failures can occur. Visibility is
critical before and after each network change to avoid impacts.
- 22. © 1992–2023 Cisco Systems, Inc. All rights reserved. 22
@ThousandEyes
Learn
more
Free
Trial /
Demo
Next Steps
Copyright ©2023 ThousandEyes
• Subscribe! https://blog.thousandeyes.com
• Get a real-time view of the health of the Internet
https://thousandeyes.com/outages
• Sign up for a Free Trial:
https://www.thousandeyes.com/signup
• Request a demo:
https://www.thousandeyes.com/request-demo