Application Introduction
Prometheus is a free software application used for event monitoring and alerting. It records real-time metrics in a time series database built using a HTTP pull model, with flexible queries and real-time alerting. (Source: Wikipedia)
The main positive aspects of this software are that it is able to store metrics very efficiently and that it is very easy to run and maintain, even in large deployments. With the new PromQL language, it is in addition equally powerful as InfluxDB.
Prometheus is not meant for long-term storage though and other projects are able to collect the Prometheus data via the so-called remote-writer functionality.
Prometheus Deployment Options and Trade-offs
By default, OpenShift deploys two Prometheus instances, which work independently of each other and are based on EmptyDir volumes. Therefore theoretically data is not lost when one Prometheus instance is down, since it can be queried from the other instance.
The EmptyDir volumes store data for the Pod for as long as it exists on the local disk of the node. If the Pod is deleted or the Node is lost, the collected metrics are lost with it and Prometheus will start with an empty data set. It is possible to change the backend volume to a memory-based EmptyDir, which theoretically improves performance, but costs a lot more and data is lost on every OpenShift node reboot. What we saw in this test is that the performance of the memory-based EmptyDir is the same as the regular EmptyDir. The other alternative is to base the Prometheus Time Series Database (TSDB) volume on OCS-backed storage, which has similar performance characteristics as the regular EmptyDir, while the resilience is improved.
Simple Query |
Query with one PromQL Function |
Multiple PromQL Functions |
Summary |
|
Test configuration |
1000 queries, 100 in parallel |
100 queries, 10 in parallel |
600 queries, 100 in parallel |
|
OpenShift Container Storage |
Requests/sec: 65.22 Mean time per request: 15.33 ms |
Requests/sec: 1.65 Mean time per request: 604.56 ms |
Requests/sec: 0.77 Mean time per request: 1,294.13 ms |
Performance: π Resilience: πππ Cost: π |
EmptyDir |
Requests/sec: 71.41 Mean time per request: 14 ms |
Requests/sec: 1.76 Mean time per request: 569.56 ms |
Requests/sec: 0.86 Mean time per request: 1,165.30 ms |
Performance: ππ Resilience: π Cost: ππ |
EmptyDir based on ramdisk |
Requests/sec: 70.68 Mean time per request: 14.15 ms |
Requests/sec: 1.69 Mean time per request: 590.02 ms |
Requests/sec: 0.83 Mean time per request: 1,209.35 ms |
Performance: ππ Resilience: π Cost: π |
Key Measures of Perf and Resilience for Prometheus
We captured the following key measures of performance and resilience to inform this brief:
- Query performance with a simple query of node_load1
- Query performance with one PromQL function
- Query performance with a complex interaction of multiple PromQL functions
Workload Benchmarking Results Summary
Key observations of Prometheus performance.



Appendix
Benchmark Overview
For the automatic provisioning of Prometheus we used three different deployments that each deploy a single Prometheus instance with the different storage backends.
To measure the performance of the Prometheus TSDB, we used the ApacheBench software. This software was developed to measure the performance of websites and has many useful features for us. Since Prometheus has an HTTP API, we point ApacheBench to prepared URLs which trigger a TSDB lookup. ApacheBench will then tell us how long each lookup took. To decrease any effects by networking, we ran ApacheBench on Pods in the same OpenShift cluster and connected to Prometheus via the OpenShift service address.
For every database there are some queries that are simpler to run and some that are harder to run. For our test run we prepared three different queries and asked Prometheus to give us all metrics in a 9 day time window. We had to increase the sample size to 10 minutes to be below the Prometheus data point limit that a single query could return. Instead, we increased the total count of queries and the number of parallel queries to stress Prometheus and the underlying storage enough.
Benchmark Environment Summary
Software
OCP Version |
v4.2 |
OCP Infra |
VMware |
Master Nodes |
3 x |
Compute nodes |
3 x 16 vCPU & 64GB RAM |
OCS Storage Nodes |
3 x 16vCPU & 64GB RAM |
OCS Storage Devices |
3 x 1 TB vSAN based PVCs on NVMes |
OCS Version |
v4.2 |
Table 1 : OCP and OCS Infra Details
Prometheus version |
2.14.0 (Container image prom/prometheus:latest) |
ApacheBench version |
2.3 (Container image jordi/ab) |
Table 2: Deployed versions details
Measurements:
Raw material available here: https://gist.github.com/mulbc/33d25cfd3b31fff307c7ce23352f1efd
Additional Resources
- OpenShift Container Storage: openshift.com/storage
- OpenShift | Storage YouTube Playlist
- OpenShift Commons βAll Things Dataβ YouTube Playlist
Feedback
To find out more about OpenShift Container Storage or to take a test drive, visit https://www.openshift.com/products/container-storage/.
If you would like to learn more about what the OpenShift Container Storage team is up to or provide feedback on any of the new 4.3 features, take this brief 3-minute survey.
Categories
Storage, How-tos, Operators, logging, OCS 4.3, Prometheus, Monitoring