SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

•Télécharger en tant que PPTX, PDF•

2 j'aime•321 vues

How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!

Ingénierie

Building Disaster Recovery via
Resilience Engineering
Michael Kehoe
Staff SRE - LinkedIn

Tonight’s
agenda
1 Introductions
2 What is Resilience Engineering
3 The Problem Statement
4 Project Overview
5 Testing Process
6 Project Outcomes
7 Key Takeaways
8 Q&A

Michael Kehoe
/USR/BIN/WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland

Who are we?
PRODUCTION-SRE TEAM AT LINKEDIN
• Disaster Recovery Planning and
Automation
• Incident Response and Automation
• Visibility Engineering
• Reliability Principles

LinkedIn
EVOLUTION OF THE INFRASTRUCTURE
2003 2010 2011 2013 2014 2015
Active &
Passive
Active &
Active
Multi-colo 3-
way Active &
Active
Multi-colo n-
way Active &
Active

LinkedIn
2018
4 Data Centers 21 PoPs 1000+ services

What is Resilience Engineering?
• Projects that directly demand increased
resilience from our applications and
infrastructure.
• Application Injection Failure
• Infrastructure Injection Failure
• Full Disaster-Recovery Tests

How often have you heard stories where someone
thought they had a disaster strategy, never tested it and
it fails when you need it the most?

Problem Statement
• How do we ensure that we always have
disaster recovery ability without incident?
• How do we consistently test for disaster
recovery ability without disrupting the
company?

Project Overview
1
• Build a process (with Automation) to facilitate disaster recovery
• Operate the process on regular cadence
• Provide reporting on outcomes of tests with engineering executives

What is Load Testing?
5x a week Peak hour traffic Fixed SLA

LinkedIn Traffic-Tier
Border
Router IPVS ATS ATS Frontend
EDGE FABRIC
Stickyrouting

LinkedIn Traffic-Tier
Fabric
Buckets
1
91
2 3 10
92 93 100

LinkedIn Traffic-Tier
EDGE FABRIC
DC1
DC2
DC1 in Cookie
Got DC2 as secondary fabric
Gets
secondary
fabric for userStickyrouting

TrafficShift Architecture
Web
application
Salt master
Stickyrouting
ServiceCouchbase Backend Worker
Processes
FABRIC
BUCKETS

Load Testing
FABRIC
DC3
DC1 DC2
60%
Traffic
Percentage

Benefits of Load-testing
Capacity
Planning
Identify Bugs Confidence

Benefits of Load-testing
CAPACITY PLANNING
• Through this process, we continuously validate our infrastructure
capacity
• This is the best signal we can possibly get since we’re simulating a
real disaster

Benefits of Load-testing
IDENTIFY BUGS
2
• Some bugs are only found at high load (under duress)
• Helps find inefficiency’s that otherwise may not be found until it’s too late
• Gives us clues on how to make our code more resilient to potential failure

Benefits of Load-testing
CONFIDENCE
2
• Through load-testing, we’ve built confidence in our disaster recovery
strategy
• We understand exactly:
• What process to follow
• How long it takes to avert disaster
• What are the risks associated with a disaster incident

Key Takeaways
• Resilience Engineering is a must for
LinkedIn
• Design infrastructure to facilitate disaster
recovery
• Disaster-test regularly to avoid surprises
• Automate your testing/ process to reduce
engagement time

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

Contenu connexe

Tendances

DevOps Continuous Integration & Delivery - A Whitepaper by RapidValue

RapidValue

As companies have adopted faster development methodologies a new constraint has emerged in the journey to digital transformation: data. Data has long been the neglected discipline, the weakest link in the tool chain, with provisioning times still counted in days, weeks, or even months. In addition, most companies are still using decades-old processes to manage and deploy database changes, further anchoring development teams.

How to plug the data gap in DevOps

Deborah Schalm

ATAGTR2017 Security Testing / IoT Testing in Real World

Agile Testing Alliance

DevOps the Big Picture for Testers by Joseph Ours

QA or the Highway

Achieving continuous testing is a daunting task for many test teams still struggling with combining agile, test automation, and increased speed. We know that change is rarely easy. Fixing or getting rid of some practices is tough. However, one-step-at-a-time change can take you far and fast. To jumpstart your team, Michael Hackett shares learnings from four LogiGear clients in various stages of continuous integration, continuous testing, and continuous delivery. Failures in one organization ranged from naively thinking that automating every manual script was a good thing to misusing agile principles; this team needed an overhaul. Michael began with better test design, got rid of old style automation, and defined four sets of automated suites for different purposes, environments, and execution times. Very quickly the test team was contributing faster and providing more useful feedback to the whole development team. Join Michael and get moving to higher levels of continuous testing.

Continuous Testing in DevOps

TechWell

To successfully implement continuous delivery in an enterprise, there are specific needs and obstacles which must be addressed. In this webinar, we’ll address the pain points that most enterprises face, and how they can be overcome: Enabling developers to use the latest technology tools and practices Get build resources on demand, w/o disruptions or downtimes Unify processes across different teams and business silos Secure IP assets that are in the development process to ensure compliance

Scaling Enterprise DevOps with CloudBees

DevOps.com

Github Copilot and tools that help us code better are cool. But I’m lucky if I spend 90 minutes a day writing code. We really need to optimize the hours we spend reviewing code, updating tickets and tracing where our code is deployed. Learn how I save an hour a day streamlining non-coding tasks. This talk is unique because 99% of developer productivity tools and hacks are about coding faster, better, smarter. And yet the vast majority of our time is spent doing all of this other stuff. After I started focusing on optimizing the 10 hours I spend every day on non-coding tasks, I found I my productivity went up and my frustration at annoying stuff went way down. I cover how to save time by reducing cognitive load and by cutting menial, non-coding tasks that we have to perform 10-50 times every day. For example: Bug or hotfix comes through and you want to start working on it right away so you create a branch and start fixing. What you don’t do is create a Jira ticket but then later your boss/PM/CSM yells at your due to lack of visibility. I share how I automated ticket creation in Slack by correlating Github to Jira. You have 20 minutes until your next meeting and you open a pull request and start a review. But you get pulled away half way through and when you come back the next day you forgot everything and have to start over. Huge waste of time. I share an ML job I wrote that tells me how long the review will take so I can pick PRs that fit the amount of time I have. You build. You ship it. You own it. Great. But after I merge my code I never know where it actually is. Did the CI job fail? Is it release under feature flag? Did it just go GA to everyone? I share a bot I wrote that personally tells me where my code is in the pipeline after it leaves my hands so I can actually take full ownership without spending tons of time figuring out what code is in what release.

HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB

DevOpsDays Tel Aviv

Building DevOps Toolchain

IBM UrbanCode Products

DevOps promises to make better software faster and more safely and many organizations begin by practicing Continuous Integration and moving on to Continuous Delivery and sometimes even extending as far as Continuous Deployment - but this is only the tip of the iceberg. DevOps demands a fundamental shift in the way we work and requires all participants in an organization to live its principles. It’s much more than a tool chain. When you are delivering software in an Agile manner in fortnightly sprints, are you still funding in an annual manner? Are you adhering to The Third Way? I.e. are you practicing Continuous Experimentation? Continuous Learning? How are you doing Continuous Testing? Are you including security in that? Have you have Continuous Improvement in your organization for years? When does Continuous Everything turn into Continuous Apathy?

DevOps and All the Continuouses w/ Helen Beal

Sonatype

Join Mirko Novakovic, Co-founder and CEO of Instana and Hannes Lenke, CEO of Checkly as they discuss the Reliability (R)evolution. Hear how freeing DevOps teams from complexity, empowers them to scale and accelerate with end-to-end testing and monitoring. Testing and monitoring was previously seen as slow, flaky, and costly, but this is no longer the case. Technologies like Headless, Puppeteer, and Jamstack are changing the way we ensure reliability. Combined with the ability to integrate APM into testing, you can also monitor the transactions simulated. This enables DevOps teams to push into production quicker with more reliability.

Reliability (R)evolution: Turning the DevOps World Upside Down (Again).

Hannes Lenke

According to service scale, there are hundreds or thousands of running containers in your service. Should we monitor each container by microscope or monitor each microservice by magnifier? This depends which granularity can help us find and solve the problems. In this sharing, I will introduce how to use cAdvisor, Icinga2, InfluxDB and Grafana to build a self-hosted monitoring system. In addition, I also discuss with how to embrace open source and share some practical experiences.

The Art of Container Monitoring

Derek Chen

Silos. Lack of visibility. Some agile teams… some not. Manual handoffs. Bottlenecks. This summer, it’s time to get outside (your old processes) and take some time off (your application release cycle). Take back your weekends and spend more time by the pool. We’ll show you how to automate, orchestrate, and facilitate continuous everything – and that includes continuous testing – one of the biggest bottlenecks of all. You’ll learn how to: Automatically shift quality left: Orchestrate and automate testing in every phase of the SDLC with automated promotion and feedback loops Accelerate testing in the cloud: Test web and mobile apps in parallel – achieve up to 10X improvement in testing time. Use tools of choice while optimizing every aspect of your complex, interdependent multi-application pipelines. Get started in less than 1 hour…. and for free! Achieve truly automated, continuous delivery (including continuous testing!!!) in the cloud with CA and Sauce Labs. Try Continuous Delivery Director free: https://cddirector.io/#/home Try Sauce Labs free: https://saucelabs.com/

Drive Continuous Delivery With Continuous Testing

CA Technologies

A True Story of Why QA Loves DevOps

IBM UrbanCode Products

Kku2011

ทวิร พานิชสมบัติ

Secure your Azure and DevOps in a smart way

Eficode

¿Qué es DevOps y por qué es importante en el Ciclo de Software? por michelada.io

Software Guru

Engineering Trust in Your Automated Tests

Jyoti Mittal

Where Testers and QA Fit in the Story of DevOps Continuous delivery. CI. GitHub. Scrum. CD. Jenkins. Continuous testing. Continuous integration. These are just some terms that are supposed to describe the word soup that is DevOps. Chances are that you have heard some or all of these words being passed around at your daily stand ups or company meetings. However, where does QA and testing fit into the story of DevOps? Some would say that developers and operations teams are all you need for a successful DevOps pipeline, while others show that Dev, Test and Ops need to be included to ensure quality at every step in your pipeline. In this webinar, Ryan Yackel, QASymphony's Director of Product Marketing, and Sunil Sehgal, Managing Partner at TechArcis, will share their experiences as they navigate you through the DevTestOps waters. In this webinar you will learn: Overview of the State of DevOps Common misconceptions of DevOps and QA How testers must adapt to the DevOps process The tools testers need for continuous testing Can't make the webinar? Sign up and we will send you the recording.

Where Testers & QA Fit in the Story of DevOps

QASymphony

We explored Page Object design pattern to some of the more common, and sometimes frustrating, object configurations found on the internet. Learn how proper application of this pattern enables you to leverage Selenium’s power to produce concise, readable, and maintainable automated tests. We tackled challenging DOM configurations such as Messy tables Frames Random identifiers Third part frameworks like JQuery and Moment HTML5 video players and more with Java and Selenium 3. Learn how solving these tricky problems with the correct techniques leads to more robust tests while saving scripting time! For more information, please visit www.QualiTestGroup.com

Designing for the internet - Page Objects for the Real World

Qualitest

Why Serverless is scary without DevSecOps and Observability

Eficode

Tendances (20)

DevOps Continuous Integration & Delivery - A Whitepaper by RapidValue

How to plug the data gap in DevOps

ATAGTR2017 Security Testing / IoT Testing in Real World

DevOps the Big Picture for Testers by Joseph Ours

Continuous Testing in DevOps

Scaling Enterprise DevOps with CloudBees

HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB

Building DevOps Toolchain

DevOps and All the Continuouses w/ Helen Beal

Reliability (R)evolution: Turning the DevOps World Upside Down (Again).

The Art of Container Monitoring

Drive Continuous Delivery With Continuous Testing

A True Story of Why QA Loves DevOps

Kku2011

Secure your Azure and DevOps in a smart way

¿Qué es DevOps y por qué es importante en el Ciclo de Software? por michelada.io

Engineering Trust in Your Automated Tests

Where Testers & QA Fit in the Story of DevOps

Designing for the internet - Page Objects for the Real World

Why Serverless is scary without DevSecOps and Observability

Similaire à SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks? This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.

The Next Wave of Reliability Engineering

Michael Kehoe

LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load. Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

Michael Kehoe

Microdeployments for microservices dev ops nashville

Nathaniel (Ned) Bauerle

Performance Metrics Driven CI/CD - Introduction to Continuous Innovation and ...

Mike Villiger

The values from the Agile Manifesto don’t seem to say much about the craft of software engineering. In fact, they don’t say anything about engineering at all. However, digging a little bit deeper, one quickly realizes that the benefits of Agile methods and practices cannot be realized with low quality software. Agile depends on engineering excellence. So forget about Agile for a moment, at least the process side of things, and pay attention to the craft of software engineering; or in other words pay attention to building software the right way. Because only then you will be able to rapidly and continuously build the right software.

Forget about Agile

Software Guru

DevSecOps - It can change your life (cycle)

Qualitest

Quality 4.0 and reimagining quality

Dr. Anish Cheriyan (PhD)

High performing organizations don't trade off quality, throughput, and reliability: they work to improve all of these and use their software delivery capability to drive organizational performance. In this talk, Jez presents the results from DevOps Research and Assessment's five-year research program, including how continuous delivery and good architecture produce higher software delivery performance, and how to measure culture and its impact on IT and organizational culture. They explain the importance of knowing how (and what) to measure so you focus on what’s important and communicate progress to peers, leaders, and stakeholders. Great outcomes don’t realize themselves, after all, and having the right metrics gives us the data we need to keep getting better at building, delivering, and operating software systems. More details: https://confengine.com/agile-india-2019/proposal/8524/building-and-scaling-high-performing-technology-organizations Conference link: https://2019.agileindia.org

Building and Scaling High Performing Technology Organizations by Jez Humble a...

Agile India

Software Lifecycle

Soumen Sarkar

Continuous Delivery is a proven set of practices for reliable software releases through build, test, and deployment automation. Organisations around the world have adopted Continuous Delivery (CD) to increase speed and safety of software changes whilst reducing errors and problems in Production. This talk is an overview of Continuous Delivery for people who do not write code. If you are a delivery manager, project manager, release manager, operations person, business analyst, or anyone else involved in the building, testing, releasing, and running of software systems, this talk will give you an understanding of what Continuous Delivery is about and how it feels to be part of a CD organisation. (From a talk given in Leeds, UK on 24 Sept 2018)

Continuous Delivery for people who do not write code - Matthew Skelton - Conflux

Matthew Skelton

Building Next Gen Applications and Microservices

Paula Peña (She, Her, Hers)

Getting Started with ThousandEyes Proof of Concepts

ThousandEyes

Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...

Simon Storm

DevOps in Practice

Derek Chen

While many tech startups have adopted modern DevOps practices, mid-size and large enterprises are still barely scratching the surface. How do you start this journey on the right footing? In this webinar we discuss DevOps and what it really means when it comes to implementing and managing it for complex application architectures. In this session you can expect to learn about: The pitfalls organizations typically encounter as they deploy DevOps A practical approach to implementing DevOps at scale for an entire organization Managing dozens of application environments to certify through your pipeline How automated testing platforms, such as Sauce Labs, can be integrated in this process Using case studies and real-world experiences, we cover how to move beyond the first project to get all of your Dev and Test teams to embrace DevOps concepts. We also run a live demo.

implanting DevOps at scale using dynamic test environments

QualiQuali

While many tech startups have adopted modern devops practices, mid-size and large enterprises are still barely scratching the surface. They have the management mandate and some resources, but the results are less than convincing when the rubber hits the road: small scale initiatives seems to be the norm, but often have encountered difficulties when trying to expand beyond their initial scope. How do you start this journey on the right footing? Join Pascal Joly, Director of Technology Partnerships at Quali, as he discusses DevOps and what it really means when it comes to implementing and managing it for complex application architectures. Using case studies and real world experiences, he will cover how to move beyond the first project to get all of your Dev and Test teams to embrace DevOps concepts. He will also run a live demo. In this session you can expect to learn about: -The pitfalls organizations typically encounter as they deploy DevOps -A practical approach to implementing DevOps at scale for an entire organization -Managing dozens of application environments to certify through your pipeline -Defining how much resources can be used by your development and test team -How automated testing platforms such as Sauce Labs can be integrated in this process Can’t make the webinar? Register anyway and we’ll send out an email with a link to the recording and slides after the event.

Implementing DevOps at Scale Using Dynamic Environments

Sauce Labs

The DevOps journey in an Enterprise - CoDe-Conf. Stockholm September 14, 2017

Anders Lundsgård

Operating a High Velocity Large Organization with Spring Cloud Microservices

Noriaki Tatsumi

This webinar was co-hosted by Xray and Curiosity Software on 18th May 2021. Watch the on demand recording here: https://opentestingplatform.curiositysoftware.ie/xray-in-sprint-testing-webinar In-sprint testing must tackle three pressing problems: 1. You must know exactly what needs testing before each release. There’s not time to test everything. 2. You need up-to-date and aligned test assets, including test cases, data, scripts and CI/CD artefacts. 3. Test teams must know what needs testing, when, and have on demand access to environments, tests and data. These problems are near-impossible to crack at organisations who struggle with application complexity, rapid system change, and overly-manual testing processes. Challenges include: 1. Test creation time. Manually creating test cases, data and scripts is slow and unsystematic, resulting in low coverage tests. 2. Slow test maintenance. Changes break tests, with little time in sprints to check test cases, scripts, and data. 3. Knowing when testing is “done”. There is little measurability or peace of mind when systems “go live”. This webinar will set out how maintaining a “digital twin” of the system under test prioritises testing time AND maintains rigorous tests in-sprint. You will see how: 1. Intuitive flowcharts generate optimised test cases, scripts, and data. 2. Feeding changes into the models maintains up-to-date tests. 3. Pushing the tests to agile test management tooling then makes sure that teams know which tests to run, when, with full traceability and a measurable definition of ‘done’. James Walker, Curiosity’s Director of Technology, and Sérgio Freire, Head of Product Evangelism for Xray, will set out this cutting-edge approach to in-sprint testing. Günther-Matthias Bär, Test Automation Engineer at Sogeti, will then draw on implementation experience to discuss the value of the proposed approach.

Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...

Curiosity Software Ireland

Every company is under increased pressure to deliver software faster and better. The question is: “How do I get started?” Continuous firefighting is definitely not the answer! XebiaLabs and Dynatrace share a practical step-by-step approach to optimizing your delivery process so you can deploy better quality software faster! Learn: • Why you should move to a metric-driven pipeline! • Which key quality metrics to measure and how to integrate them to catch problems earlier • How to use, measure and report on these metrics • How finding architectural/quality issues earlier reduces cost spent investigating them

How to Build a Metrics-optimized Software Delivery Pipeline

Dynatrace

Similaire à SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering (20)

The Next Wave of Reliability Engineering

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

Microdeployments for microservices dev ops nashville

Performance Metrics Driven CI/CD - Introduction to Continuous Innovation and ...

Forget about Agile

DevSecOps - It can change your life (cycle)

Quality 4.0 and reimagining quality

Building and Scaling High Performing Technology Organizations by Jez Humble a...

Software Lifecycle

Continuous Delivery for people who do not write code - Matthew Skelton - Conflux

Building Next Gen Applications and Microservices

Getting Started with ThousandEyes Proof of Concepts

Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...

DevOps in Practice

implanting DevOps at scale using dynamic test environments

Implementing DevOps at Scale Using Dynamic Environments

The DevOps journey in an Enterprise - CoDe-Conf. Stockholm September 14, 2017

Operating a High Velocity Large Organization with Spring Cloud Microservices

Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...

How to Build a Metrics-optimized Software Delivery Pipeline

Plus de Michael Kehoe

eBPF Workshop

Michael Kehoe

eBPF Basics

Michael Kehoe

Code Yellow: Helping operations top-heavy teams the smart way

Michael Kehoe

In 2016, Susan Fowler released the 'Production Ready Microservices' book. This book sets an industry benchmark on explaining how microservices should be conceived, all the way through to documentation. So how does this translate actionable items? This session will explore how to expertly deploy your microservice to production. The audience will learn best practice for designing, deploying, monitoring & documenting application. By the end of the session, attendees should feel confident that they have the knowledge to deploy a service that will be reliable and scalable.

QConSF 2018: Building Production-Ready Applications

Michael Kehoe

All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success. We will look at the process for Code Yellow, the term we use for this process of “righting the ship”, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.

Helping operations top-heavy teams the smart way

Michael Kehoe

The National Transport Safety Bureau is one of the most widely known Government bodies in the world. It’s their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress. This session will detail how the NTSB’s approach to its work and the procedure that drives it, is transferable to us as incident responders. We’ll talk about the NTSB’s pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. We’ll consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.

AllDayDevops: What the NTSB teaches us about incident management & postmortems

Michael Kehoe

Linux Container Basics

Michael Kehoe

Network failures continue to plague datacenter operators as their symptoms may not have direct correlation with where or why they occur. We introduce 007, a lightweight, always-on diagnosis application that can find problematic links and also pinpoint problems for each TCP connection. 007 is completely contained within the end host. During its two month deployment in a tier-1 datacenter, it detected every problem found by previously deployed monitoring tools while also finding the sources of other problems previously undetected.

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Michael Kehoe

What the NTSB teaches us about incident management & postmortems

Michael Kehoe

PyBay 2018: Production-Ready Python Applications

Michael Kehoe

SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SRE’s, we’re programmed to keep fighting through the issues, when sometimes it’s best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. We’ll discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed.

Helping operations top-heavy teams the smart way

Michael Kehoe

Building Production-Ready Microservices: DevopsExchangeSF

Michael Kehoe

LinkedIn’s production stack is made up of over 900 applications, 2200 internal API’s and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service. We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...

Michael Kehoe

All of us depend on the underlying network to be stable whether in the datacenter or in the cloud. We all have a basic knowledge of how traditional networks run, however in the past 10 years, we’ve moved to building redundant physical topologies in our networks, optimized the routing methodologies accordingly, moved into the cloud and gotten greater visibility and tuneables in the Linux kernel network stack. A lot has changed! However, the way we troubleshoot the network in relation to the applications we support hasn’t adapted. In this session, we’ll review the progress that network infrastructure has made look at specific examples where traditional troubleshooting responses fail us and demonstrate our need to rethink our approach to making applications and the network interact harmoniously.

SRECon-Europe-2017: Networks for SREs

Michael Kehoe

LinkedIn’s production stack is made up of over 900 applications and over 2200 internal API’s. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service. We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.

Reducing MTTR and False Escalations: Event Correlation at LinkedIn

Michael Kehoe

LinkedIn serves traffic for its 467 million members from four data centers and multiple PoPs spread geographically around the world. Serving live traffic from from many places at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center or PoP out of rotation and redistribute its traffic to a healthy one within minutes, with virtually no visible impact to users. The geographical distribution of our infrastructure also allows us to optimize the end-user's experience by geo routing users to the best possible PoP and datacenter. This talk provide details on how LinkedIn shifts traffic between its PoPs and data centers to provide the best possible performance and availability for its members. We will also touch on the complexities of performance in APAC, how IPv6 is helping our members and how LinkedIn stress tests data centers verify its disaster recovery capabilities.

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...

Michael Kehoe

Good monitoring can be the difference between a great night's sleep or hearing your phone go off at 2:37 a.m. because of a production outage. Couchbase Server provides a large number of metrics which can be overwhelming if you do not know the critical things to focus on or how to expose that information to your monitoring system. In this talk we will look at example production incidents, going in depth around specific things to monitor, and how this information can be used to find issues, work out root cause, and discover trends.

Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn

Michael Kehoe

Couchbase Connect 2016

Michael Kehoe

Using SaltStack to Auto Triage and Remediate Production Systems

Michael Kehoe

Just over two years ago, I completed college, moved countries and started as a SRE at LinkedIn. Over those two years, I’ve gone from a Junior engineer to a Senior Engineer, however, I wouldn’t have been able to do it without a great onboarding experience and the mentors that have guided me This session will take the perspective of myself as a New College Graduate and Nina Mushiana (SRE manager) on how to onboard new SREs and help them reach their full potential. We will deep dive into onboarding, mentoring and training topics and reflect on lessons learnt over my two year experience. Finally I’ll present advice for ELTs on how to become an important and successful part of their organization.

SRECon USA 2016: Growing your Entry Level Talent

Michael Kehoe

Plus de Michael Kehoe (20)

eBPF Workshop

eBPF Basics

Code Yellow: Helping operations top-heavy teams the smart way

QConSF 2018: Building Production-Ready Applications

Helping operations top-heavy teams the smart way

AllDayDevops: What the NTSB teaches us about incident management & postmortems

Linux Container Basics

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

What the NTSB teaches us about incident management & postmortems

PyBay 2018: Production-Ready Python Applications

Helping operations top-heavy teams the smart way

Building Production-Ready Microservices: DevopsExchangeSF

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...

SRECon-Europe-2017: Networks for SREs

Reducing MTTR and False Escalations: Event Correlation at LinkedIn

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...

Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn

Couchbase Connect 2016

Using SaltStack to Auto Triage and Remediate Production Systems

SRECon USA 2016: Growing your Entry Level Talent

Dernier

KubeKraft presentation @CloudNativeHooghly

sanyuktamishra911

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Chance Of Getting Into My Sexy Boobs? Booking Contact Details WhatsApp Chat: +91-8250192130 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 30-april-2024(v.n)

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...

ranjana rawat

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts Booking Contact Details WhatsApp Chat: +91-7001035870 Nagpur Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts Nagpur understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 27-april-2024(v.n)

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

ranjana rawat

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts Booking Contact Details WhatsApp Chat: +91-7001035870 nashik Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts nashik understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 27-april-2024(v.n)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

ranjana rawat

UNIT - IV - Air Compressors and its Performance

sivaprakash250

UNIT-III FMM. DIMENSIONAL ANALYSIS

rknatarajan

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Girls Waiting For You To Fuck Booking Contact Details WhatsApp Chat: +91-6297143586 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 01-may-2024(v.n)

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

Call Girls in Nagpur High Profile

Result Management System Report for College Project The Result Management System is a crucial tool for educational institutions to efficiently manage and publish student results online. This project was developed as part of the 6th semester project for Pokhara University, with the aim of providing a user-friendly platform for students, faculty, and administrators to access and manage academic results. The system allows for easy input of student grades, calculation of final results, and generation of result reports. It also provides secure access to results for students through a login system, ensuring data privacy and confidentiality. To improve and increase the effectiveness of the Result Management System, several enhancements can be implemented. These include: 1. Integration of automated result processing algorithms to reduce manual data entry and calculation errors. 2. Implementation of a notification system to alert students and faculty when results are published or updated. 3. Addition of features such as result analysis tools, graphical representations of student performance, and trend analysis. 4. Enhancing the user interface to make it more intuitive and user-friendly for all stakeholders. 5. Incorporating a feedback mechanism to gather suggestions and improve the system based on user input. By incorporating these enhancements, the Result Management System can become a more robust and efficient tool for managing academic results online, benefiting both students and faculty alike.

result management system report for college project

Tonystark477637

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working

rknatarajan

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineers. By Dr. Costas Sachpazis. A Technical Report provides information on Geotechnical Exploration and testing procedures, analysis techniques, allowable criteria, design procedures, and construction consideration for the selection, design, and installation of sheet pile walls. "Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineers" by Dr. Costas Sachpazis provides an in-depth look into the engineering, design, and construction of sheet pile walls. The book details geotechnical exploration, testing procedures, and analysis techniques essential for determining soil properties and stability under various conditions, including seismic activity. It also covers the impact of groundwater on wall design and offers methods for controlling it during construction. Practical considerations for confined space work and the use of emerging technologies in sheet pile construction are discussed. The guide serves as a comprehensive resource for civil engineers aiming to enhance their expertise in creating durable and effective sheet pile wall solutions for complex engineering projects.

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...

Dr.Costas Sachpazis

LIST OF EXPERIMENTS: 1. Implement simple vector addition in Tensor Flow. 2. Implement a regression model in Keras. 3. Implement a perception in TensorFlow/Keras Environment. 4. Implement a Feed Forward Network in TensorFlow/Keras. 5. Implement an image classifier using CNN in TensorFlow/Keras. 6. Improve the deep Learning model by fine tuning hyper parameters. 7. Implement a Transfer Learning concept in image classification. 8. Using a pre trained model on Keras for transfer learning. 9. Perform Sentimental Analysis using RNN. 10. Implement an LSTM based Auto encoding inTensorflow/Keras. 11. Image generation using GAN. ADDITIONAL EXPERIMENTS 12. Train a deep Learning model to classify a given image using pre trained model. 13. Recommendation system from sales data using Deep Learning. 14. Implement Object detection using CNN. 15. Implement any simple Reinforcement Algorithm for an NLP problem.

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

Asst.prof M.Gokilavani

IEEE stands for Institute of Electrical and Electronics Engineers. It is the largest technical professional association dedicated to advancing innovation and technological excellence for the benefit of humanity. It is designed to build industry standards, serve professionals involved in every aspect of electrical, electronic and computing field. It also organizes conferences and provides a platform for publications.

Introduction to IEEE STANDARDS and its different types.pptx

upamatechverse

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts Booking Contact Details WhatsApp Chat: +91-7001035870 Nagpur Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts Nagpur understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 27-april-2024(v.n)

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

Call Girls in Nagpur High Profile

The Educational Administration: Theory and Practice publishes prominent empirical and conceptual articles focused on timely and critical leadership and policy issues of educational organizations. The journal embraces traditional and emergent research paradigms, methods, and issues. The journal particularly promotes the publication of rigorous and relevant scholarly work that enhances linkages among and utility for educational policy, practice, and research arenas. The goal of the editorial team and the journal’s editorial board is to promote sound scholarship and a clear and continuing dialogue among scholars and practitioners from a broad spectrum of education. Educational Administration: Theory and Practice presents prominent empirical and conceptual articles focused on timely and critical leadership and policy issues facing educational organizations. As an editorial team, we embrace traditional and emergent theoretical frameworks, research methods, and topics. We particularly promote the publication of rigorous and relevant scholarly work with utility for educational policy, practice, and research. The journal’s primary focus is on studies of educational leadership, organizations, leadership development, and policy as they relate to elementary and secondary levels of education. Examinations of leadership and policy that fall outside K-12 are considered insofar as there are meaningful connections to the K-12 arena (e.g., college pipeline). International comparative investigations are welcome to the extent they have implications for a broad audience.s.

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

Christo Ananth

AKTU Computer Networks notes --- Unit 3.pdf

ankushspencer015

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts Booking Contact Details WhatsApp Chat: +91-7001035870 Nagpur Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts Nagpur understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 27-april-2024(v.n)

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

Call Girls in Nagpur High Profile

Introduction and different types of Ethernet.pptx

upamatechverse

N-Grade deals with the maintenance of university, department, faculty, student information within the university. N-Grade is an automation system, which is used to store the department, faculty, student, courses and information of a university. Starting from registration of a new student in the university, it maintains all the details regarding the attendance and marks of the students. The project deals with retrieval of information through an INTRANET based campus wide portal. It collects related information from all the departments of an organization and maintains files, which are used to generate reports in various forms to measure individual and overall performance of the students.

University management System project report..pdf

Kamal Acharya

Data security is rapidly gaining importance as the volume of data companies collect, analyze and monetize grows exponentially. New data processing tools and platforms are emerging at an increasing rate, as are the ways in which an organization consumes data. In this presentation Mukund Sarma and Feni Chawla talk about the unique technical and cultural challenges of running a data security program and share some practical solutions that have worked well at our company. These slides were presented at the BSides Seattle 2024 conference.

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx

fenichawla

Online banking management system project.pdf

Kamal Acharya

Dernier (20)

KubeKraft presentation @CloudNativeHooghly

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

UNIT - IV - Air Compressors and its Performance

UNIT-III FMM. DIMENSIONAL ANALYSIS

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

result management system report for college project

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

Introduction to IEEE STANDARDS and its different types.pptx

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

AKTU Computer Networks notes --- Unit 3.pdf

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

Introduction and different types of Ethernet.pptx

University management System project report..pdf

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx

Online banking management system project.pdf

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

1. Building Disaster Recovery via Resilience Engineering Michael Kehoe Staff SRE - LinkedIn

2. Tonight’s agenda 1 Introductions 2 What is Resilience Engineering 3 The Problem Statement 4 Project Overview 5 Testing Process 6 Project Outcomes 7 Key Takeaways 8 Q&A

3. Introduction

4. Michael Kehoe /USR/BIN/WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland

5. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Disaster Recovery Planning and Automation • Incident Response and Automation • Visibility Engineering • Reliability Principles

6. LinkedIn EVOLUTION OF THE INFRASTRUCTURE 2003 2010 2011 2013 2014 2015 Active & Passive Active & Active Multi-colo 3- way Active & Active Multi-colo n- way Active & Active

7. LinkedIn 2018 4 Data Centers 21 PoPs 1000+ services

8. What is Resilience Engineering?

9. What is Resilience Engineering? • Projects that directly demand increased resilience from our applications and infrastructure. • Application Injection Failure • Infrastructure Injection Failure • Full Disaster-Recovery Tests

10. Problem Statement

11. How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most?

12. Problem Statement • How do we ensure that we always have disaster recovery ability without incident? • How do we consistently test for disaster recovery ability without disrupting the company?

13. Project Overview

14. Project Overview 1 • Build a process (with Automation) to facilitate disaster recovery • Operate the process on regular cadence • Provide reporting on outcomes of tests with engineering executives

15. Testing Process

16. What is Load Testing? 5x a week Peak hour traffic Fixed SLA

17. LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC Stickyrouting

18. LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92 93 100

19. LinkedIn Traffic-Tier EDGE FABRIC DC1 DC2 DC1 in Cookie Got DC2 as secondary fabric Gets secondary fabric for userStickyrouting

20. TrafficShift Architecture Web application Salt master Stickyrouting ServiceCouchbase Backend Worker Processes FABRIC BUCKETS

21. Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage

22. Load Testing 22

23. Project Outcomes

24. Benefits of Load-testing Capacity Planning Identify Bugs Confidence

25. Benefits of Load-testing CAPACITY PLANNING • Through this process, we continuously validate our infrastructure capacity • This is the best signal we can possibly get since we’re simulating a real disaster

26. Benefits of Load-testing IDENTIFY BUGS 2 • Some bugs are only found at high load (under duress) • Helps find inefficiency’s that otherwise may not be found until it’s too late • Gives us clues on how to make our code more resilient to potential failure

27. Benefits of Load-testing CONFIDENCE 2 • Through load-testing, we’ve built confidence in our disaster recovery strategy • We understand exactly: • What process to follow • How long it takes to avert disaster • What are the risks associated with a disaster incident

28. Key Takeaways

29. Key Takeaways • Resilience Engineering is a must for LinkedIn • Design infrastructure to facilitate disaster recovery • Disaster-test regularly to avoid surprises • Automate your testing/ process to reduce engagement time

30. Q&A

Notes de l'éditeur

Anil TrafficShift is a two part application - A web application provides easy way for engineers to create planned and emergency offline plans. We leverage couchbase as our key/value persistence store Python backend worker processes talks to Salt Master via Salt API And instructs stickyrouting service to turn buckets online and offline We leverage this toolset to run load tests or stress tests of our datacenters Uff that’s a lot of talk, how to mitigate issues by doing trafficshift. But if you keenly observe, we are migrating live traffic across datacenter, why not leverage the same to stress test datacenter ? How awesome is that ? Not stress test single service, stress the whole system. I am gonna talk about load testing next.
Anil As you can see by turning precise number of buckets offline in US-West and US-East - we can reroute that extra traffic to Target datacenter We do this in a pretty controlled manner in steps until the threshold level of 50% is reached. If for any reason, an alert fires during this stress test, our TrafficShift tool acknowledges that automatically rebalances the site traffic, sends out the stress test report to SREs

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

Similaire à SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering (20)

Plus de Michael Kehoe

Plus de Michael Kehoe (20)

Dernier

Dernier (20)

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

Notes de l'éditeur