Measuring What Matters: The Benefits of DevOps Metrics
How we define DevOps significantly impacts the value it can bring to an organization. Is it just the combination of development and operations? Does it include other areas like security, architecture, etc.? Is it a methodology? So what is DevOps … a cultural shift … the adoption of new principles and practices to overcome friction in an organization to improve value delivery to the business.
DevOps is everything you do to overcome the friction between silos. All the rest is plain engineering.
— Patrick Debois, a.k.a. “Godfather of DevOps”
A traditional view of the DevOps process is illustrated as an infinity loop showing each accepted phase. Each phase represents an area of opportunity to improve the software development lifecycle and execution speed of an organization.
DevOps complements agile methodologies and builds upon foundational industry work like the Theory of Constraints that Dr. Eliyahu Goldratt formalized. The Theory of Constraints (TOC) is broken down into three major areas:
- The Five Focusing Steps: a methodology for identifying and eliminating constraints
- The Thinking Processes: tools for analyzing and resolving problems
- Throughput Accounting: a method for measuring performance and guiding management decisions
For an entertaining approach to learning more about TOC, see Dr. Goldratt’s novel “The Goal” which tells the story of a plant manager working to improve operational efficiency.
Metrics are defined as quantitative measurements used to evaluate, compare, and track the performance, progress, or quality of a specific aspect, process, or entity. They are crucial for setting benchmarks, identifying areas for improvement, and making informed decisions.
So how should we define DevOps metrics given this general definition? Any measurement related to the operation and efficiency of a DevOps teams process. How do we decide what to measure? So the real question is what measurements or metrics are vital to understanding the health of a team’s DevOps workflow or lifecycle.
The metrics need to go beyond just measuring a characteristic of the work delivered by a team. The data needs to inform the team of their current health while allowing them to assess potential changes in their process, behavior, cross-team interactions, etc., that could be translated into performance improvements. One challenge with collecting metrics about a process is the need to understand the context and a method to evaluate the value. Otherwise, the value is just a number or data point, not actual knowledge about the team or process being analyzed.
While many different metrics can be used to evaluate the performance of a software development team, it can be challenging and downright daunting to figure out where even to start. Fortunately, we have a set of metrics that the DevOps Research and Assessment (DORA) team have identified as being key indicators of a software team’s performance.
DORA has identified four metrics that have been proven to predict the performance of a software team:
- Deployment Frequency—How often an organization successfully releases to production
- Lead Time for Changes—The amount of time it takes a commit to get into production
- Change Failure Rate—The percentage of deployments causing a failure in production
- Time to Restore Service—How long it takes an organization to recover from a failure in production
The DORA team surveys industry professionals yearly as part of an ongoing research project started in 2014. For most years, the research team has classified the responses into three clusters (High, Medium, and Low). While some survey years have revealed a fourth cluster called Elite, the latest results from 2022 returned to the standard three clusters.
The differences between Elite and High are not essential for teams just getting started with DevOps and improving delivery methods to focus on. Instead, the team needs to focus on the journey to understand their current performance and opportunities for improvement. As a team matures, the goal of reaching Elite status can be revisited to decide if the effort is worth the organization’s investment.
So now what? Start with the four key metrics and the working definition to build a working model based on the current operating model to collect metrics about the current process. This requires collecting data on the current process by injecting instrumentation points if they don’t already exist.
DORA metrics explained
The most straightforward metric to start with is Deployment Frequency, which only requires collecting a single data point, the deployment time for each successful deployment into production. This should be the average number of deployments over a given timeframe. Depending on the current deployment model, the time horizon could be days, months, quarters, or even years.
The goal of increasing deployment frequency can feel counterintuitive to most people initially as it would seem to increase the risk of failure. This would be true if we continued to release large software deployments with many changes. The higher deployment frequency is enabled by reducing the batch size (see TOC) or the number of actual changes in a software release. This allows the teams involved to understand the scope of each change better and reduce the risk of failure.
Lead Time for Changes
Now we add the collection of one new data point commit time to start calculating the lead time for a change. We must capture the actual start time for the change deployed to production to determine the period for implementing a change and then calculate the average lead time over a given timeframe. Lead Time for Changes will range from daily, weekly, or monthly.
Improving lead time for changes creates a more efficient, responsive, and effective software development process.
Reducing lead time typically requires improved processes and practices related to quality assurance and testing, resulting in higher software quality.
Time to Restore Service
The Time to Restore Service (TTRS) metric requires two new data points: the start time for a failure and the corresponding restoration time. Use the time to restore period to calculate the mean or median time, or both to restore service for a given time frame. The time frame must be long enough to provide a meaningful metric for the team.
Improving the Time to Restore Service (TTRS) aims to minimize the time it takes to recover from a service outage, failure, or degradation and restore the system to its normal functional status. TTRS is an important metric because it measures the organization’s ability to quickly detect, diagnose, and resolve issues, ensuring service availability and reliability.
Change Failure Rate
Now that we have captured data for the number of deployments using Deployment Frequency and the number of resolved failures based on Time to Restore Service data, we can calculate the Change Failure Rate from the ratio of failures to the number of deployments for a given time frame. This metric reflects the effectiveness of the software development, testing, and deployment processes.
Benefits that may be observed:
- Faster deployment and release cycles: Organizations with a low Change Failure Rate can deploy changes more confidently and frequently.
- Efficient resource utilization: Reducing the Change Failure Rate allows teams to spend less time troubleshooting and fixing issues.
- Improved software quality: A lower Change Failure Rate indicates that the organization is effectively identifying and addressing defects before deployment.
Required Cultural Changes
Fundamental Cultural changes are required to embrace DevOps and to optimize the various metrics:
- Embrace a DevOps Mindset: Cultivate a DevOps culture where everyone is responsible for the entire software development lifecycle, from planning to deployment and monitoring.
- Prioritize Agile and Lean Principles: Adopt agile and lean principles, such as iterative development, frequent feedback, and focusing on customer value.
- Value Automation: Encourage automation in all aspects of the development lifecycle, from testing to deployment.
- Encourage Continuous Improvement: Foster a culture that values continuous learning and improvement.
- Emphasize Quality and Security: Ensure that quality and security are priorities at every stage of the software development process.
- Collaboration and Communication: Encourage open communication and collaboration between development, operations, and other teams. Break down silos and foster a shared understanding of goals, responsibilities, and expectations.
The collecting, measuring, and reporting of metrics is not the goal of DevOps. Instead, the goal is the journey to understand the software delivery process better and improve the ability to deliver value to the organization. Focusing on metrics is one way the DevOps team continues to learn and grow.