Site Reliability Engineering (SRE) is a practice that combines software development skills and IT operations into a single job function. Automation and continuous integration and delivery are used to reach the goal of improving highly dynamic systems. The concept originated with Google in the early 2000s and was documented in a book with the same name, Site Reliability Engineering (a must read). SRE shares many governing concepts with DevOps—both domains rely on a culture of sharing, metrics and automation. SRE can be thought of as an extreme implementation of DevOps. The role of the SRE is common in cloud first enterprises and gaining momentum in traditional IT teams. Part systems administrator, part second tier support and part developer, SREs require a personality that is by nature inquisitive, always acquiring new skills, asking questions, and solving problems by embracing new tools and automation.
An SRE contributes to a business by automating tasks with the aim to eliminate unnecessary work and roles and helping to reduce overall cost through optimizing resources and improving mean time to repair (MTTR). They live with the following 2 mottos:
- Let us stop doing the machines’ work for them.
- Let us stop feeding the machines with human blood.
From the Google Site Reliability Engineering book, the key areas of SRE focus are:
- Reliability—Maintaining a high level of network and application availability
- Monitoring—Implementing performance metrics and establish benchmarks in order to monitor the systems.
- Alerting—Readily identifying any issues and ensure that there is a closed loop support process in place to solve them.
- Infrastructure—Understanding cloud and physical infrastructure scalability and limitations.
- Application Engineering—Understanding all application requirements including testing and readiness needs.
- Debugging—Understanding the systems, log files, code, use case and troubleshooting, so they can debug as needed.
- Security—Understanding common security issues, as well a tracking and addressing vulnerabilities, to ensure the systems are properly secured.
- Best Practices Documentation—Prescribing solutions, production support playbooks, etc.
- Best Practice Training—Promoting and evangelizing SRE best practices through production readiness reviews, blameless postmortems, technical talks, and tooling.
There are other practices that overlap with the SRE’s role such as DevOps, IT Service Management (ITSM), Agile Software Development Life Cycle (SDLC) and other organizational frameworks. SRE and DevOps teams are complementary and by providing monitoring solutions that address the needs of both, information is facilitated across teams so that collaborative troubleshooting quickly leads to problem resolution.
Interested in learning more about SRE?
Join our free webinar Improve the Reliability of Your Infrastructure with Site Reliability Engineering. Whether you’re still figuring out how to create a site reliability practice at your company or you’re trying to improve the processes and habits of an existing SRE team, join CCSI’s Cloud Practice Manager as he discusses the principles and philosophy behind Site Reliability Engineering. Register Today!
Author Bio: Joe Goldberg is the Senior Cloud Program manager at CCSI. Over the past 15+ years, Joe has helped companies to design, build out, and optimize their network and data center infrastructure. As a result of his efforts, major gains in ROI have been realized through virtualization, WAN implementation, core network redesigns, and the adoption of cloud services. Joe is also ITIL certified.