In the book "Accelerate, The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations," authors Jez Humble, Gene Kim, and Nicole Forsgren use data collected by thousands of professionals worldwide in the State of DevOps Report to emphatically demonstrate that software delivery performance really produces competitive advantages in companies.
The authors also determined that there are four main metrics for measuring software delivery performance.
These four metrics are:
- Lead Time for Change: how long it takes a team to go from code committed to code successfully running in production
- Deployment Frequency: the frequency with which increments of code are deployed to production
- Mean Time To Restore: The average time used to recover from a problem as a result of a software deployment.
- Change Failure Rate: The percentage of deployments that cause a production failure.
The first two metrics are related to the speed of change. And since innovation comes from experimentation at high speed, it is recognized that we are inevitably going to make mistakes. DevOps proposes “fail fast”, and for this it is important that we monitor the last two metrics, which measure how many failures we introduce and how quickly we can remedy them. Improving business agility involves increasing the Lead Time for Change and Deployment Frequency, along with reducing the Mean Time To Restore and Change Failure Rate.
How can we improve these four metrics with Red Hat OpenShift?
A developer has just made the last commit of the source code of an application MVP and the Lead Time for Change starts running! We are talking about cloud native applications, so the first thing we need is an enterprise Kubernetes cluster where we can deploy the application. If we don't have one available in our organization yet, we can think about options on-premises and cloud. In particular, OpenShift offers several managed options on the main public clouds: OpenShift Dedicated, Azure Red Hat OpenShift, Red Hat OpenShift Service on AWS and Red Hat OpenShift on IBM Cloud. With any of these options we can have a productive cluster available in a couple of hours to start deploying the applications.
We already have a productive enterprise Kubernetes cluster. How do we continue now?
In traditional organizations, when a developer wants to test his code in a low environment (such as Development), first he must ask the Infrastructure team for resources in a virtual machine where he can run it. This usually requires creating a request in some internal workflow system, which hopefully can be resolved in 1 hour, although we know that it can take more than a week. This waiting time unnecessarily delays the Time-To-Market of a new feature or MVP.
OpenShift proposes to empower the developers by giving them access in self-service mode to create environments and deploy workloads, at the same time that it gives to the Infrastructure team all the control so that they can limit the computing resources that the developer can use. With a single click or command line, a new project or namespace is created within OpenShift.
OpenShift 4 clusters use OpenShift SDN NetworkPolicy as the default networking solution. The default behavior is wide-open network across projects/namespaces (no policies applied). It is possible to install the cluster in a multitenant mode as an install-time configuration option. In this scenario, all projects/namespaces are network isolated from each other by default. This avoids any interference between loads from different environments inside the same OpenShift cluster.
After the new namespace is created, there is a very simple way to build the container image from the source code repository using source-to-image, and deploy it to any environment. This simple beginning strategy can be helpful in quickly building an MVP with a satisfying Lead Time For Change initial metric.
Now we must improve all the four metrics!
The next important thing is to incorporate CI/CD pipelines, which we can implement using OpenShift Pipelines (Tekton) and OpenShift GitOps (ArgoCD).
CI/CD pipelines clearly improve Lead Time for Change and Deployment Frequency metrics, since by automating, we will be able to deploy in a couple of minutes in any environment, and so it will allow us to make more deployments. And it also improves the Change Failure Rate, not only because by automating we avoid all errors made by manual deployment processes, but also because the pipelines contain a large number of automated tests (unit tests, static code analysis, vulnerability analysis and integration tests) which drastically reduce the possibility of introducing errors in deployments. These automatic tests are key for the "shift left" concept: many controls are carried out early in each pipeline execution, which avoid surprises when deploying into Production, which also allows us to improve Lead Time for Change .
A correctly implemented CI/CD strategy also allows for rollbacks, so that if a problem is found after deployment, a previous version can be deployed in a couple of minutes, thus improving Mean Time for Recovery. At this point we have to mention the value of using GitOps in our strategy, allowing us to keep the version of the container image synchronized along with the rest of the necessary configurations for the application to work (deployment descriptors, configMaps, routes, etc.).
OpenShift will also allow us to easily perform route-based deployments, such as A/B deployments, where the percentage of traffic that will use a new version of an application is specified, or canary deployments, where a small set of users is selected to test a new version. These types of deployments allow us to experiment in production by drastically limiting the impact of errors, without waiting long QA processes and with the agility of being able to immediately direct all traffic back to the stable version of the service. This quick experimentation advantage is reflected in the Mean Time for Recovery metric.
How can we monitor these metrics?
Pelorus is a Grafana-based dashboard that allows these four metrics to be monitored at the organizational level. This tool is key to implementing a transformational, continuous improvement and metric-based process such as the one proposed by Trevor Quinn.
In the following table we can see a synthesis of how the different OpenShift functionalities mentioned above improve the 4 metrics: