With the observability service enabled, you can use Red Hat Advanced Cluster Management for Kubernetes (RHACM) to gain insight about and optimize your managed clusters. In this blog, I introduce some new advanced configuration options for the observability service in RHACM 2.3.

Overview

Before or after you install the observability service, you can customize the service by updating the configurations in the MultiClusterObservability resource. In RHACM 2.3, there was a reorganization of some existing parameters and more configuration parameters were added in MultiClusterObservability. The new advanced configuration parameters can assist users with customizing the observability service to fit their production environment, better than before.

The new advanced configuration options for the observability service include the following fields:

  • Resource requests and limits for observability components
  • Replica size for observability components
  • Retention configuration for metrics data
  • Storage size for observability components

Best practices to set advanced configurations for observability service

Continue reading each component section for details on how to set the advanced configurations.

Resources, requirements, and limits for observability components

For each observability component, you can set required and allowed computing resources (CPU/memory) for it. There are default values in the observability service for requests of each component to meet the minimum requirement for that component. There is no default value used for limits, which means each component has no restriction to consume the compute resources in the node. In some scenarios, some observability components may use massive resources and might impact other components on the same node, even make the node crash. To avoid that, you need to set the resource limit for the following components:

  • Receiver (receive) is one of the core observability components, which receives the metrics data from all managed clusters which have observability enabled, store the metrics temporarily, and send the metrics to the backend object storage. By default, it stores recent metrics data in memory, which can lead to more memory consumption. The receive component uses around 80Gi memory in this test environment, where there are 1,000 single node clusters imported. Be careful if you want to set the memory limit for receive because the pods of the receive component becomes stuck in the Crash status if there is not enough memory.

  • Querier (query) is also one of the core observability components that handles the query requests from the client-side. Its resource consumption depends on the usage scenarios. A simple query should not use too many resources. A heavy query, e.g. one query to get thousands of time series data in a large time range might lead to resources being used up, and the located node to crash directly. It is best practice to set the memory limits for queries to avoid a system crash because of too many queries.

  • Besides the components in the hub cluster, you can also set resource requests and limits for the metrics collector (metrics-collector), which is located on the managed clusters. Metrics collector is the component that scrapes metrics data from the local Prometheus instance, then processes the data in memory such as remove or add labels to the metrics, and finally pushes the data to the hub cluster. If the managed cluster is a large one, which has a mass of nodes and pods, it means the metrics collector scrapes more metrics. As a result, the managed cluster requires more memory to store and process the metrics.

Replica size for observability components

For most observability components, you can set the replica size for them. By default, an observability component that has two replicas is a Kubernetes Deployment, and if it has three replicas it is a Kubernetes StatefulSet. Along with scaling the environment, you can customize the value of the replica size for specific observability components so that they can handle the different volume workloads. View the following description to change the replica size for the following components:

  • Receiver components always store multiple copies of the received metrics data, and the value of the copy number is called the receiver.replication-factor. When the receiver replicas is no more than 3, the replication-factor is the same value of the replicas field. If the receiver replicas is more than 3, the replication-factor is 3. It means that each receiver replicas do not store all of received metrics data, only partial data. In a large-scale environment that has a large amount of metrics data, you can increase the replicas value for the receiver so that each receiver requires less memory. This is because it only stores a subset of metrics data.

  • Compactor is the component that is designed to compact the metrics data in the backend object storage, and downsample the data. It is generally not semantically concurrency safe, and must be deployed as a singleton against a bucket in the object storage. There is no option for you to set the replica size for the compactor.

  • Store gateway is the component that implements the Store API, along with historical data in an object storage bucket; it acts primarily as an API gateway. For most components, when you specify the replica size, it leads to different replicas values for a Deployment or Statefulset. Store gateway is the only exception. When you increase the replica size, the replicas value in the existing Store StatefulSet does not increase, but it creates a new Store StaefulSet. Each Store StatefulSet has only one replica, and handles a different shard or data bucket.

Retention configuration for metrics data

In the previous version, the observability service provided some configuration options that are retention related. In RHACM 2.3, those parameters are reorganized, and moved to the spec.advanced.retentionConfig section. View the following descriptions of the important, new retention-related parameters added:

  • retentionResolutionRaw, retentionResolution5m and retentionResolution1h: These parameters represent the amount of time to retain raw metrics data, 5-minute resolution data, and 1-hour resolution data. These parameters also exist in the previous RHACM versions, but there are now new default values in RHACM 2.3. The default values for those parameters are the following, retentionResolutionRaw: 30d (30 days), retentionResolution5m: 180d (180 days), retentionResolution1h: 0d (The 0d value means that the related 1-hour resolution data is never removed).

  • retentionInLocal: This parameter represents the time to retain raw metrics data in some observability components such as receiver. The default value is 24h (24 hours). This is used to avoid data loss. For example, there is a temporary issue that blocks the receiver to push data to the object storage; this data is kept in the receiver and is pushed after the issue is fixed within 24 hours.

  • deleteDelay: This parameter represents the time before a block is marked for deletion and is deleted from your bucket. The default value is 48h (48 hours). If the value is a non-zero, blocks are marked for deletion and the compactor component deletes the marked blocks from the bucket. If the value is zero, blocks are deleted immediately. Note that deleting blocks immediately can cause query failures in the following scenarios: If the store gateway still has the block loaded, or if the compactor is ignoring the deletion because it is compacting the block at the same time.

Storage size observability components

There are new configuration options for the storage used by the observability components in RHACM 2.3. For each observability component that is a StatefulSet, such as a receiver, you can specify the size of the persistent volume, which is mounted to the StatefulSet. The receiver and compactor components use the most amount of storage; so you should mainly focus on those components when you set the storage size. Those two components need large amounts of storage to store the metrics data. The default value for the storage size for them is 100Gi. You might need to increase that value when handling a large amount of metrics.

Deploy or update observability service with advanced configurations

To deploy the observability service in RHACM 2.3, you need to create the MultiClusterObservability custom resource as mentioned earlier. You can also set more advanced configuration options in this custom resource now. The following YAML sample file deploys observability with some advanced configurations:

apiVersion: observability.open-cluster-management.io/v1beta2
kind: MultiClusterObservability
metadata:
name: observability
spec:
advanced:
receive:
replicas: 5
resources:
limits:
cpu: 1000m
memory: 100Gi
requests:
cpu: 300m
memory: 1Gi
query:
replicas: 3
resources:
limits:
cpu: 1000m
memory: 5Gi
requests:
cpu: 300m
memory: 1Gi
retentionConfig:
retentionResolutionRaw: 30d
retentionResolution1h: 365d
retentionInLocal: 48h
storageConfig:
receiveStorageSize: 250Gi
compactStorageSize: 150Gi
metricObjectStorage:
name: thanos-object-storage
key: thanos.yaml
observabilityAddonSpec: {}

In the previous YAML file sample, the replica size and resources for Receiver and Querier are defined, storage size for Reciver and Compactor are defined, and the values for some retention parameters were updated. It does not contain all possible advanced options, refer to the Observability API to get details about all supported configurations. You can also update the observability service after the deployment. The updated configurations are applied to the running observability components.

Conclusion

The default configurations for the observability service can fit most environments. In some cases, such as large-scale environments with massive metrics data collected and stored, advanced configurations on observability service are probably required. Before you update the values of those advanced configurations, you need to investigate the observability service related details, and then set the correct values based on your requirement.