Introduction

We are used to thinking of monitoring as that process that can answer the question: Is a given service up or down?

At Google, where the Site Reliability Engineering (SRE) movement was born, monitoring helps answer the question: What percentage of requests are being successfully served?


This change of perspective, from a binary (up/down) approach to a more quantitative approach to monitoring, better captures the quality of the service from the service consumer point of view. It also modernizes the concept of availability, for which, in most cases, the expectation is that a globally deployed service will be always up, but might have localized outages.

In this article, we will examine how to implement monitoring like an SRE. Although this technique can be applied to a variety of contexts, we are going to assume we have a service deployed in the OpenShift ServiceMesh (Istio). This allows us to rely on automatically collected and uniform metrics for all services in the mesh, and that, in turn, enables automation.

Service Level Indicator and Service Level Objective

To be able to compute how many requests are being served successfully, we need to agree on a set of measurable characteristics (metrics) that determine success and/or availability of our service. This special set of metrics is called Service Level Indicator (SLI).

For an HTTP-based service (a REST service, for example), the HTTP response code is generally considered a simple but effective approach.

Prometheus is the standard de facto metrics collector for container native workload and used by the platform monitoring stack and ServiceMesh observability stack. To see, aggregate, and filter the metrics collected in it, Prometheus features a query language called PromQL. Explaining the PromQL syntax is beyond the scope of this article; please refer to the official documentation for it.

The aforementioned SLI can be translated to the following PromQL expressions:

Successful Requests in a given time interval:

sum(increase(istio_requests_total{response_code!~"5.*"}[$time_interval]))

Total requests:

sum(increase(istio_requests_total{}[$time_interval]))

Service Level Objective (SLO) is a threshold of successful requests calculated based on the chosen SLI that a team managing a service has agreed upon.

So, given the previous SLI, a reasonable SLO might be: 99.99 percent of requests will be successful.

In order to meet the SLO in a given time interval, the ratio between successful requests and total requests must be higher than the given threshold. This can be calculated using the following PromQL expression:

(sum(increase(istio_requests_total{response_code!~"5.*"}[$time_interval])) / sum(increase(istio_requests_total{}[$time_interval])) ) > 0.9999

Notice how this definition of availability changes the perspective. Traditionally, one would calculate the ratio between uptime and total service time. With SLI and SLO, we instead calculate the ratio between successful request and total request, which better captures consumer satisfaction.

This approach works well with continuous, zero-downtime deployments.

In the case of an HTTP service, it is common to use two SLIs: response code and latency. An SLO based on those two SLIs can be represented as: 99.99 percent of requests will be successful and served within 1 second.

The formula for the successful request in a given time interval now becomes:

sum(increase(istio_request_duration_seconds_bucket{response_code!~"5.*",le="1"}[$time_interval]))

We are going to use this SLO in our examples for the remainder of this article.

The line of business (LOB), the developer, and the SRE teams should all agree on the definition of SLI and SLO.

Error Budget

Given an SLO, a certain observation window (typically one month) and a certain amount of incoming requests, we can calculate the number of requests that may fail without passing the SLO threshold. This value is the error budget.

With that definition of error budget, it is a best practice of an SRE to establish the following agreement (a social contract between LOBs, developers, and SREs): If a team is within their error budget, they can release new features. If not, all the team members must work on backlog items that stabilize the service. This generally means Non Functional Requirements (NFR)-related items such as more tests, more automation, and better alerts.

This approach brilliantly dissolves the age-old tension between developers and operations on the velocity of changes. The tiebreaker is data driven and based on agreed upon metrics (SLI) and criteria (SLO). This rule rewards teams that produce more stable code and can, as a consequence, release more new features.

In PromQL we can calculate the error budget as follows (this shows the ratio of successful requests and total request over 30 days):

sum(increase(istio_request_duration_seconds_bucket{response_code!~"5.*",le="$latency"}[30d]))/sum(increase(istio_request_duration_seconds_count{}[30d]))

Given the aforementioned PromQL query, we can easily create a Grafana dashboard to display the error budget:

Monitoring Services like an SRE in OpenShift ServiceMesh-9

This simple dashboard depicts a service with an SLO of 95 percent and an average availability in the observation window of 82 percent, so the error budget is depleted.

Here is another possible view of the same service:

 

In this view, we see that given the SLO and the amount of requests received in the observation window, the average error budget would be 100 failed requests, but we actually have an average of 500 failed requests.

Alerting

Alerting is probably the most important capability brought to bear by a good monitoring stack. Alerting is what notifies us when a problem arises or is about to arise, so that action can be taken.

But creating timely and meaningful (good signal-to-noise ratio) alerts is not easy. Failing to do so will induce alert fatigue on the individuals on call, which may lead them to start ignoring alerts.

Well-crafted alerts have the following characteristics:

Precision: the proportion of alerts that are generated because of a significant event. 100 percent precision means that every alert corresponds to a significant event.

Recall: the proportion of significant events that are detected and turned into alerts. 100 percent recall means that every significant event generates an alert.

Detection Time: how long it takes for a condition to be detected and turned into alerts.

Reset Time: how long it takes for alerts to stop firing after the root condition has been resolved.

So, how do we alert on our SLIs in order to minimize the chance of passing the SLO threshold?

At Google, the SRE team has developed a sophisticated alerting approach based on SLI and SLO. I recommend reading this chapter to understand how this technique has evolved. Here I will just describe the final result.

Here are the SRE best practices on alerting:

  1. Alert on error budget burn rate, not on error rate.
  2. Alert on different burn rates (slow vs fast) and different observation window lengths.
  3. Alert with different priorities based on burn rate (page vs ticket). If the burn rate is fast, send an urgent alert, maybe via page. If the burn rate is low, send a non-urgent ticket. Alert delivery can be controlled in Alertmanager.
  4. Use a shorter (1/12th of the main observation window) observation window as a form of a control set for the main one to improve the reset time.

Here are the magic numbers which most of the SRE community (not just Google) agree upon:

 

Here is how you read the first line of the above table: We want to be alerted if 2 percent of the error budget is consumed in an hour (long window column). This implies a burn rate factor of 14.4 (this can be deduced mathematically). The short control window will be 1.12th of an hour, which is five minutes.

The for duration column indicates the for duration of the Prometheus alert rule. Not everyone among the SRE community agrees on whether it should be used.

Using again the first line of the above table, we can analyze the alert dynamic with a simple example:

 
In this graph above, we can observe the following:
  • At minute 10, an error situation occurs, which brings the error rate at 15 percent. Almost immediately, the short observation window (5m) starts firing. No alert is raised because both windows need to be firing.
  • At minute 15, also the long observation window (60m) crosses the threshold and starts firing. At this point, an alert is sent. The detection time was 5 minutes.
  • At minute 20, the SRE on call resolves the issue and errors drop to zero.
  • At around minute 23, the short observation window goes below the threshold, inhibiting the alert. The reset time was 3 minutes.
  • The long observation window goes under the threshold only at around minute 80, about 60 minutes after the error condition has been fixed. Without the shorter observation window, this would have been the reset time.

An aside regarding this diagram: I took this diagram with a lot of trust from the SRE book. If you are a statistician, you should be able to recreate that graph and customize it as necessary by changing the magic numbers of the table above (I didn’t invest time in creating that model).

Here is how you write the alert for the first row in PromQL, assuming an SLO of 99.99 percent:

expr: (
       job:slo_errors_per_request:ratio_rate1h{...} > (14.4*0.001)
     and
       job:slo_errors_per_request:ratio_rate5m{...} > (14.4*0.001)
     )
severity: page

The creation of these alert rules can be automated. Within this repository, you can find an example on how this can be accomplished. As an alternative, you can also check out this online prometheus rule generator for error budget alerts.

Installation

An automation of the configuration needed to set up an Error Budget dashboard and SLO-based alerts is provided at this repository.

It was necessary to deploy a parallel Prometheus/AlertManager/Grafana to work around some of the limitations of the current ServiceMEsh observability stack. This may improve in future releases.

Limitations

SLO, error budget, and relative alerts are all statistical in nature and assume a normal distribution of errors. When errors are not normally distributed, important error events might go unnoticed. For example, let us assume you have 1M requests per month and an SLO of 99 percent, so the error budget is 10,000 requests. Here are some example of non-normally distributed errors that would not trigger alerts:

  1. If you have 100 customers all making 10,000 requests and we assume there is one customer for which all of the requests fail (10,000). You are still in the error budget; it might be helpful to be notified and investigate this particular situation.
  2. If you have 100 servers each serving 10,000 requests, one server always fails … you see where this is going.

Also, again, because of the statistical nature of this approach, it works well for services which receive a relatively high number of requests. For services with sparse requests, other techniques apply.

Video

Check out this hour long video from our Twitch channel explaining many of the concepts here. 

 

Conclusion

In this article, we saw how to calculate an error budget and configure relative alerts for a service running in the ServiceMesh. It’s worth noting that we were able to automate this process thanks to the standardization on metrics that the Service Mesh is able to offer. However, the concepts of SLI, SLO and error budget are applicable to any service.

Implementing a monitoring and alerting approach is one of the fundamental tasks of an SRE (as showcased in the SRE pyramid of needs). However, it is just the beginning of the journey in the SRE new world. I can only warmly recommend reading the SRE books to find out more.


About the author

Raffaele is a full-stack enterprise architect with 20+ years of experience. Raffaele started his career in Italy as a Java Architect then gradually moved to Integration Architect and then Enterprise Architect. Later he moved to the United States to eventually become an OpenShift Architect for Red Hat consulting services, acquiring, in the process, knowledge of the infrastructure side of IT.

Read full bio