Application Introduction

Prometheus is a free software application used for event monitoring and alerting. It records real-time metrics in a time series database built using a HTTP pull model, with flexible queries and real-time alerting. (Source: Wikipedia)

The main positive aspects of this software are that it is able to store metrics very efficiently and that it is very easy to run and maintain, even in large deployments. With the new PromQL language, it is in addition equally powerful as InfluxDB.

Prometheus is not meant for long-term storage though and other projects are able to collect the Prometheus data via the so-called remote-writer functionality.

Prometheus Deployment Options and Trade-offs

By default, OpenShift deploys two Prometheus instances, which work independently of each other and are based on EmptyDir volumes. Therefore theoretically data is not lost when one Prometheus instance is down, since it can be queried from the other instance.

The EmptyDir volumes store data for the Pod for as long as it exists on the local disk of the node. If the Pod is deleted or the Node is lost, the collected metrics are lost with it and Prometheus will start with an empty data set. It is possible to change the backend volume to a memory-based EmptyDir, which theoretically improves performance, but costs a lot more and data is lost on every OpenShift node reboot. What we saw in this test is that the performance of the memory-based EmptyDir is the same as the regular EmptyDir. The other alternative is to base the Prometheus Time Series Database (TSDB) volume on OCS-backed storage, which has similar performance characteristics as the regular EmptyDir, while the resilience is improved.

 

 

Simple Query

Query with one PromQL Function

Multiple PromQL Functions

Summary

Test configuration

1000 queries, 100 in parallel

100 queries, 10 in parallel

600 queries, 100 in parallel

 

OpenShift Container Storage

Requests/sec: 65.22

Mean time per request: 15.33 ms

Requests/sec: 1.65

Mean time per request: 604.56 ms

Requests/sec: 0.77

Mean time per request: 1,294.13 ms

Performance: 👍

Resilience: 👍👍👍

Cost: 👍

EmptyDir

Requests/sec: 71.41

Mean time per request: 14 ms

Requests/sec: 1.76

Mean time per request: 569.56 ms

Requests/sec: 0.86

Mean time per request: 1,165.30 ms

Performance: 👍👍

Resilience: 👍

Cost: 👍👍

EmptyDir based on ramdisk

Requests/sec: 70.68

Mean time per request: 14.15 ms

Requests/sec: 1.69

Mean time per request: 590.02 ms

Requests/sec: 0.83

Mean time per request: 1,209.35 ms

Performance: 👍👍

Resilience: 👎

Cost: 👎

 

Key Measures of Perf and Resilience for Prometheus

We captured the following key measures of performance and resilience to inform this brief:

  • Query performance with a simple query of node_load1
  • Query performance with one PromQL function
  • Query performance with a complex interaction of multiple PromQL functions

Workload Benchmarking Results Summary

Key observations of Prometheus performance.

 

Appendix

Benchmark Overview

For the automatic provisioning of Prometheus we used three different deployments that each deploy a single Prometheus instance with the different storage backends.

To measure the performance of the Prometheus TSDB, we used the ApacheBench software. This software was developed to measure the performance of websites and has many useful features for us. Since Prometheus has an HTTP API, we point ApacheBench to prepared URLs which trigger a TSDB lookup. ApacheBench will then tell us how long each lookup took. To decrease any effects by networking, we ran ApacheBench on Pods in the same OpenShift cluster and connected to Prometheus via the OpenShift service address.

For every database there are some queries that are simpler to run and some that are harder to run. For our test run we prepared three different queries and asked Prometheus to give us all metrics in a 9 day time window. We had to increase the sample size to 10 minutes to be below the Prometheus data point limit that a single query could return. Instead, we increased the total count of queries and the number of parallel queries to stress Prometheus and the underlying storage enough.

Benchmark Environment Summary

Software

 

OCP Version

v4.2

OCP Infra

VMware

Master Nodes

3 x 

Compute nodes

3 x 16 vCPU & 64GB RAM

OCS Storage Nodes

3 x 16vCPU & 64GB RAM

OCS Storage Devices

3 x 1 TB vSAN based PVCs on NVMes

OCS Version

v4.2

Table 1 : OCP and OCS Infra Details

 

Prometheus version

2.14.0 (Container image prom/prometheus:latest)

ApacheBench version

2.3 (Container image jordi/ab)

Table 2: Deployed versions details

Measurements:

graphs1

Raw material available here: https://gist.github.com/mulbc/33d25cfd3b31fff307c7ce23352f1efd

Additional Resources

Feedback

To find out more about OpenShift Container Storage or to take a test drive, visit https://www.openshift.com/products/container-storage/.

If you would like to learn more about what the OpenShift Container Storage team is up to or provide feedback on any of the new 4.3 features, take this brief 3-minute survey.