2. What is Site Reliability Engineering (SRE)
Site reliability engineering (SRE) is a set of principles and practices that
incorporates aspects of software engineering and applies them to
IT infrastructure and operations.The main objectives are to create highly
reliable and scalable software systems. Site reliability engineering has been
described as a specific implementation of DevOps.
Site reliability engineering, as a job role, may be performed by solo practitioners or
organized in teams usually being responsible for a combination of the following
within a broader engineering organization:
System availability, latency, performance, efficiency, change
management, monitoring, emergency response, and capacity planning.Site
reliability engineers often have backgrounds in software engineering, system
engineering, or system administration. Focuses of site reliability engineering
include automation, system design, and improvements to system resilience.
3. What is Site Reliability Engineering (SRE)
Site reliability engineering (SRE) is a software engineering approach to IT
operations. SRE teams use software as a tool to manage systems, solve
problems, and automate operations tasks.
SRE takes the tasks that have historically been done by operations teams, often
manually, and instead gives them to engineers or operations teams who use
software and automation to solve problems and manage production systems.
SRE is a valuable practice when creating scalable and highly reliable software
systems. It helps manage large systems through code, which is more scalable
and sustainable for system administrators (sysadmins) managing thousands or
hundreds of thousands of machines.
SRE helps teams find a balance between releasing new features and ensuring
reliabilty for users.
SRE supports teams that are moving their IT operations from a traditional
approach to a cloud-native approach.
4. DevOps vs. SRE
DevOps is an approach to culture, automation, and platform design
intended to deliver increased business value and responsiveness
through rapid, high-quality service delivery. SRE can be considered an
implementation of DevOps.
However, SRE differs from DevOps because it relies on site reliability
engineers within the development team who also have an operations
background to remove communication and workflow problems.
The site reliability engineer role itself combines the skills of
development teams and operations teams by requiring an overlap in
responsibilities.
SRE can help DevOps teams whose developers are overwhelmed by
operations tasks and need someone with more specialized operations
skills.
When coding and building new features, DevOps focuses on moving
5.
6. Site Reliability Engineering (SRE) Tools
Understand SRE approach not written in stone, its all on
organization means see on which tech organization is
working, and accordingly adopt the required tools.
•Jenkins
•CircleCI
•JIRA
•Git
•Terraform
•Ansible
•Grafana
•Kibana
7. Benefits of SRE
SRE is almost obsessively focused on reliability – it’s in the name.
This focus on reliability across the implementation means that
operational expenses are minimized, points of failure are eased and
mitigated, and repeated functions that waste time and resources are
automated. All of this together results in great economic savings.
•Higher levels of application reliability and resiliency
•Increased efficiency through automation
•Improved customer satisfaction and retention
•Driving a culture of continuous improvement
•Manage on-call and emergency support.
•Ensure software has good logging and diagnostics.
8. Drawbacks of SRE
There are some drawbacks to the SRE approach, however. Perhaps
the largest one is that its still a relatively unproven concept. DevOps,
by contrast, is a well-tested, battle-hardened option that is as common
as it is understood. SRE, on the other hand, is still relatively recent
and has a lower adoption rate. As such, it’s not as proven, and fixes to
the multiple potential cracks may not be obvious.
SRE also has a weakness in its requirement for strong and directive
management. Because SRE rides a very thin line in terms of business
logic and implementation, it’s very easy for an SRE team to “fall off the
track” so to speak. The only fix to this is a stronger management body,
which can result in micromanagement and loss of efficiency.
9. SRE Best Practices
•Ensuring reliability - getting systems back to steady-state as quickly
as possible
•Eliminating toil - automating wherever possible
•Blameless postmortems - driving better cross-team collaboration
•Observing what matters - gaining full visibility into system health
•Being pro-active - living and breathing SLOs to identify and
remediate issues before SLAs are violated
•Architecting for resiliency - Informing architectural design decisions to
build more reliable systems
10. What does a site reliability engineer do?
A site reliability engineer is a unique role that requires either a
background as a sysadmin, a software developer with additional
operations experience, or someone in an IT operations role that also
has software development skills.
SRE teams are responsible for how code is deployed, configured, and
monitored, as well as the availability, latency, change management,
emergency response, and capacity management of services in
production.
SRE teams determine the launch of new features by using service-
level agreements (SLAs) to define the required reliability of the
system through service-level indicators (SLI) and service-level
objectives (SLO).
11. Key Site Reliability Engineering Skills
The type of skills required will differ organization to organization, as is
widely based on the type of application a particular organization is
using, and how and where it is deployed and monitored. The other
essential skills for SREs are to be more focused on application
Monitoring and Diagnostics.
•Know version control.
•Knowledge of Linux (most preferably).
•Automate things over the manual work.
•CI/CD Knowledge.
•Knows how to troubleshoot.
12. Summary
Site reliability engineers split their time between operations tasks and
project work. According to SRE best practices from Google, site
reliability engineers can only spend a maximum of 50% of their time
on operations—and they should be monitored to ensure they don’t go
over.
SRE team is responsible for availability, latency, performance, efficiency,
change management, monitoring, emergency response, and capacity
planning.
SRE is fundamentally doing work that has historically been done by
an operations team, but using engineers with software expertise and
banking on the fact that these engineers are inherently both
predisposed to, and have the ability to, substitute automation for