Running operations at scale has always been a fruit salad of uphill battles, especially in our expanding world of distributed teams across many geographic locations. We as an IT community are routinely reminded that if we continue to look at our IT operation as a cost center instead of the backbone of innovation, then we will not only be slow to improve but also miss out on transformative opportunities in the marketplace. Along comes site reliability engineering (SRE) and DevOps.
Site Reliability Engineering 101
Pioneered by Google, site reliability engineering (SRE) is defined as the following by its founders:
“SRE is what happens when you ask a software engineer to design an operations team.”
SRE doubles as both a job role and an ongoing practice that fights operational issues with well-thought-out software paradigms and cutting-edge software tools.
7 SRE Methods You Can Implement
- Identify Operations as a Software Problem
“Doing operations well is a software problem,” according to Google’s The Site Reliability Workbook. In order for businesses to keep the pace of the market, software approaches and paradigms must be leveraged to achieve operational efficiency.
- Manage by Service Level Objectives
There is no such thing as a service being 100% available. This misplaced target essentially halts innovation, is utterly impossible and would bring great strain to all involved. Instead, the business stakeholders should collaborate on acceptable service level objectives (SLOs) based on user happiness and business objectives.
- Systematically Minimize Toil
“We believe that if a machine can perform the desired operation, then a machine often should,” again, according to The Site Reliability Workbook. Engineers are worth their weight in gold and should only be working on the most value-added sophisticated task that pushes the business forward.
- Automate This Year's Job Away
At Google – the pioneers of SRE – and other pure SRE teams, an engineer is capped on how much time they can spend on toil. Instead of doing the mundane over and over, which adds little value to the business and limits scale, site reliability engineers (SREs) look to automate away all routine tasks within a service so they can move on to bigger and better things for the business in which they serve.
- Move Fast, But Contain the Blast Radius
SRE also looks to improve product output. This is achieved by reducing the mean time to recovery, which is an industry term to quantify how quickly we bounce back from routine service disruptions. The automated discovery and remediation of issues is the lifeblood of an organization’s ability to take on more velocity.
- Share the Burden with Developers
SREs help take the burden off of the development team by focusing primarily on availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning for services. The SRE team should have visibility of all components in order to collaborate, and more importantly, share responsibility with the product development team.
- Share Tools
Unify the codebase, and the team will follow. One of the biggest low-hanging fruits for collaboration is shared tooling. A mixed bag of tools lends to operational divergence and makes the ability to improve tooling extremely costly and difficult as you require more specialists.
Although SRE is made up of a multifaceted role and practices, it also usually results in a cultural shift similar to SRE’s complimentary companion, DevOps.
DevOps 101
In a nutshell, DevOps is the merger of methods, cultural doctrine and tools that increase an organization’s capability to deliver applications and services at both velocity and scale. This cultural evolution drastically enhances products at a break-neck pace compared to organizations leveraging traditional software development and IT management processes. This haste permits organizations to surpass customer expectations and heighten effectiveness in the marketplace.
5 DevOps Methods You Can Implement
- Do Away with Silos
Operations and developers used to be a house divided – similar to what you see in Game of Thrones. This casserole of knowledge isolation, one-sided optimizations and overall lack of organizational collaboration lends to toxic environments and limits business innovation.
- Normalize Accidents
Whether it be an end user intentionally using a service in an unexpected way or an accidental fat-fingered command, things will inevitably break. Our job is not to prevent things so much that we can’t push the business forward. Our job is to make sure recovery is so quick that it goes unnoticed or is, at worst, a minor inconvenience.
- Piecemeal Changes
Operations should look to make small but frequent changes. It’s better to have consistent, slow-risk changes versus gigantic updates that are complicated, multifaceted and expensive to roll back.
- Intertwine Tooling and Culture
Our entire change management process lives and breathes based on our sophisticated tooling stacks. However, DevOps is a cultural shift and not just the tools we use day to day. As spoken in Google’s The Site Reliability Workbook, “A good culture can work around broken tooling, but the opposite rarely holds true.”
- Measure
If we don’t measure our progress, then we won’t be able to improve. Reality requires data. Objective management guided by data eliminates ego and pushes the organization forward.
Operation Super Fruits: SRE vs. DevOps
If we think of SRE and DevOps as super fruits, there are similarities and differences. Like the saying goes: You can’t compare apples and oranges. While SRE and DevOps are inherently different, they are both good for us.
SRE and DevOps agree on the following:
- Change is unavoidable for improvement.
- Cross organizational collaboration is key to marketplace success.
- Change management should leverage automation for testing and deployment with a guideline of deploying small batches of value continuously.
- Measuring your success at all layers is key.
- Postmortems should be blameless to organize genuine team efforts and fix root issues.
- Both methodologies are holistic in nature. Anything less than 100% adoption won’t work.
SRE and DevOps differ on the following:
- DevOps has broad impactful principles that apply to the entire business but don’t dictate specifically how you run operations of services. SRE on the other hand is very regimented and user-oriented, which lends to it being very opinionated on how we run operations and services in favor of the end customer.
- SRE is meant to be an actual role, whereas DevOps is a cultural phenomenon and philosophy adopted by the broader team or entire organization.
Better Consumed Together: Where Do You Go from Here?
My hope is that this not only reinvigorated your love for DevOps but highlighted the sheer greatness of site reliability engineering. If you’re interested in diving deeper into DevOps, I recommend The DevOps Handbook by Gene Kim. If you want to learn more about site reliability engineering, I recommend Google’s The Site Reliability Workbook – which is the main source of this article.
Get the in-demand skills you need with CompTIA certifications and training solutions. Download the exam objectives to get started.
Joshua “TechDev” Walker is a certified Cloud Engineering Subject Matter Expert, Author of “Venti Fried Chicken” and VP of Black Orlando Tech.