With the observability service enabled, you can use Red Hat Advanced Cluster Management for Kubernetes (RHACM) to gain insight about and optimize your managed clusters. In this blog, I introduce some new advanced configuration options for the observability service in RHACM 2.3.
Before or after you install the observability service, you can customize the service by updating the configurations in the
MultiClusterObservability resource. In RHACM 2.3, there was a reorganization of some existing parameters and more configuration parameters were added in
MultiClusterObservability. The new advanced configuration parameters can assist users with customizing the observability service to fit their production environment, better than before.
The new advanced configuration options for the observability service include the following fields:
- Resource requests and limits for observability components
- Replica size for observability components
- Retention configuration for metrics data
- Storage size for observability components
Best practices to set advanced configurations for observability service
Continue reading each component section for details on how to set the advanced configurations.
Resources, requirements, and limits for observability components
For each observability component, you can set required and allowed computing resources (CPU/memory) for it. There are default values in the observability service for
requests of each component to meet the minimum requirement for that component. There is no default value used for
limits, which means each component has no restriction to consume the compute resources in the node. In some scenarios, some observability components may use massive resources and might impact other components on the same node, even make the node crash. To avoid that, you need to set the resource limit for the following components:
receive) is one of the core observability components, which receives the metrics data from all managed clusters which have observability enabled, store the metrics temporarily, and send the metrics to the backend object storage. By default, it stores recent metrics data in memory, which can lead to more memory consumption. The
receivecomponent uses around 80Gi memory in this test environment, where there are 1,000 single node clusters imported. Be careful if you want to set the memory limit for
receivebecause the pods of the
receivecomponent becomes stuck in the
Crashstatus if there is not enough memory.
query) is also one of the core observability components that handles the query requests from the client-side. Its resource consumption depends on the usage scenarios. A simple query should not use too many resources. A heavy query, e.g. one query to get thousands of time series data in a large time range might lead to resources being used up, and the located node to crash directly. It is best practice to set the memory limits for queries to avoid a system crash because of too many queries.
Besides the components in the hub cluster, you can also set resource requests and limits for the metrics collector (
metrics-collector), which is located on the managed clusters. Metrics collector is the component that scrapes metrics data from the local Prometheus instance, then processes the data in memory such as
addlabels to the metrics, and finally pushes the data to the hub cluster. If the managed cluster is a large one, which has a mass of nodes and pods, it means the metrics collector scrapes more metrics. As a result, the managed cluster requires more memory to store and process the metrics.
Replica size for observability components
For most observability components, you can set the replica size for them. By default, an observability component that has two replicas is a Kubernetes
Deployment, and if it has three replicas it is a Kubernetes
StatefulSet. Along with scaling the environment, you can customize the value of the replica size for specific observability components so that they can handle the different volume workloads. View the following description to change the replica size for the following components:
Receiver components always store multiple copies of the received metrics data, and the value of the copy number is called the
receiver.replication-factor. When the receiver
replicasis no more than
replication-factoris the same value of the
replicasfield. If the receiver
replicasis more than
3. It means that each receiver
replicasdo not store all of received metrics data, only partial data. In a large-scale environment that has a large amount of metrics data, you can increase the
replicasvalue for the receiver so that each receiver requires less memory. This is because it only stores a subset of metrics data.
Compactor is the component that is designed to compact the metrics data in the backend object storage, and downsample the data. It is generally not semantically concurrency safe, and must be deployed as a singleton against a bucket in the object storage. There is no option for you to set the replica size for the compactor.
Store gateway is the component that implements the
StoreAPI, along with historical data in an object storage bucket; it acts primarily as an API gateway. For most components, when you specify the replica size, it leads to different
replicasvalues for a
Statefulset. Store gateway is the only exception. When you increase the replica size, the
replicasvalue in the existing
StatefulSetdoes not increase, but it creates a new
StatefulSethas only one replica, and handles a different shard or data bucket.
Retention configuration for metrics data
In the previous version, the observability service provided some configuration options that are retention related. In RHACM 2.3, those parameters are reorganized, and moved to the
spec.advanced.retentionConfig section. View the following descriptions of the important, new retention-related parameters added:
retentionResolution1h: These parameters represent the amount of time to retain raw metrics data, 5-minute resolution data, and 1-hour resolution data. These parameters also exist in the previous RHACM versions, but there are now new default values in RHACM 2.3. The default values for those parameters are the following,
retentionResolutionRaw: 30d(30 days),
retentionResolution5m: 180d(180 days),
0dvalue means that the related 1-hour resolution data is never removed).
retentionInLocal: This parameter represents the time to retain raw metrics data in some observability components such as
receiver. The default value is
24h(24 hours). This is used to avoid data loss. For example, there is a temporary issue that blocks the
receiverto push data to the object storage; this data is kept in the
receiverand is pushed after the issue is fixed within 24 hours.
deleteDelay: This parameter represents the time before a block is marked for deletion and is deleted from your bucket. The default value is
48h(48 hours). If the value is a non-zero, blocks are marked for deletion and the compactor component deletes the marked blocks from the bucket. If the value is zero, blocks are deleted immediately. Note that deleting blocks immediately can cause query failures in the following scenarios: If the store gateway still has the block loaded, or if the compactor is ignoring the deletion because it is compacting the block at the same time.
Storage size observability components
There are new configuration options for the storage used by the observability components in RHACM 2.3. For each observability component that is a
StatefulSet, such as a
receiver, you can specify the size of the persistent volume, which is mounted to the
compactor components use the most amount of storage; so you should mainly focus on those components when you set the storage size. Those two components need large amounts of storage to store the metrics data. The default value for the storage size for them is
100Gi. You might need to increase that value when handling a large amount of metrics.
Deploy or update observability service with advanced configurations
To deploy the observability service in RHACM 2.3, you need to create the
MultiClusterObservability custom resource as mentioned earlier. You can also set more advanced configuration options in this custom resource now. The following YAML sample file deploys observability with some advanced configurations:
In the previous YAML file sample, the replica size and resources for Receiver and Querier are defined, storage size for Reciver and Compactor are defined, and the values for some retention parameters were updated. It does not contain all possible advanced options, refer to the Observability API to get details about all supported configurations. You can also update the observability service after the deployment. The updated configurations are applied to the running observability components.
The default configurations for the observability service can fit most environments. In some cases, such as large-scale environments with massive metrics data collected and stored, advanced configurations on observability service are probably required. Before you update the values of those advanced configurations, you need to investigate the observability service related details, and then set the correct values based on your requirement.