Topology Spread Constraints

OpenShift Monitoring is a platform for monitoring and observability that is built on top of the Kubernetes container orchestration platform. It provides a comprehensive set of monitoring and alerting capabilities that allow you to monitor the health and performance of your applications running on OpenShift.

Since OpenShift 4.10, the monitoring component replicas are deployed with hard anti-affinity. This avoids the risk of a single node outage disrupting the cluster's monitoring functionality.

In OpenShift Monitoring 4.12, users have the ability to specify topology spread constraints for Prometheus, Alertmanager, and Thanos Ruler in addition to the existing hard anti-affinity settings. Topology spread constraints allow you to specify more complex rules that control the placement of these components on your cluster. For example, you might want to ensure that Prometheus instances are distributed across different failure domains in your cluster to further reduce the risk of a single point of failure. You can specify topology spread constraints using the openshift_monitoring_prometheus_topology_spread_constraints, openshift_monitoring_alertmanager_topology_spread_constraints, and openshift_monitoring_thanos_ruler_topology_spread_constraints variables in your OpenShift Monitoring installation configuration.

Overall, the ability to specify topology spread constraints can help improve the resiliency and availability of your monitoring and alerting infrastructure.

By using topology spread constraints, you can control the placement of pods across your cluster in order to achieve various goals. For example, you can use topology spread constraints to distribute pods evenly across different failure domains (such as zones or regions) in order to reduce the risk of a single point of failure. This can improve the resiliency of your applications and infrastructure.

Topology spread constraints can also be useful for improving network latency in certain scenarios. For example, if you have applications that need to communicate with each other over long distances, you can use topology spread constraints to ensure that the relevant pods are placed in the same zone or region in order to minimize network latency.

Overall, topology spread constraints provide you with a powerful tool for controlling the placement of pods within your cluster, which can help you optimize the performance and reliability of your applications.

Affinity

In OpenShift Observability, you can use affinity and topology constraints to control the placement of pods within your cluster. This can help you optimize the performance and reliability of your applications.

The central element of a topology spread constraint definition is the topology key. The topology key is a node label that associates a node with a particular facet of a cluster's topology. We recommend using well-known label names such as kubernetes.io/hostname and topology.kubernetes.io/region but any label will work. All nodes that have the same value for a particular topology key are considered to be in the same domain.

The label selector field specifies which existing pods are to be considered when a new pod should be scheduled. Other than that, only two more details must be specified: What should the scheduler do if it can not satisfy the constraints (whenUnsatisfiable) and whether the scheduler should tolerate any imbalance (maxSkew).

# Be sure that Alertmanager instances are evenly distributed across two failure domains (e.g., two different zones)
openshift_monitoring_alertmanager_topology_spread_constraints:

- topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
maxSkew: 1
labelSelector:
  matchExpressions:
  - key: app
    operator: In
    values:
    - alertmanager

# Be sure that Thanos Ruler instances are evenly distributed across three failure domains (e.g., three different regions)
openshift_monitoring_thanos_ruler_topology_spread_constraints:

- topologyKey: topology.kubernetes.io/region
whenUnsatisfiable: DoNotSchedule
maxSkew: 1
labelSelector:
  matchExpressions:
  - key: app
    operator: In
    values:
    - thanos-ruler

 

In these examples, the topologyKey field specifies the infrastructure level at which the topology spread constraint is applied (e.g., hostname, zone, region). The whenUnsatisfiable field specifies what should happen when it is not possible to satisfy the topology spread constraint (e.g., DoNotSchedule means that the pod should not be scheduled if the constraint cannot be satisfied). The maxSkew field specifies the maximum allowed imbalance between the number of pods scheduled in each topology. Finally, the labelSelector field specifies a label selector that is used to select the pods that the topology spread constraint should apply to.

Other updates for OpenShift Monitoring 4.12

In OpenShift Monitoring 4.12, admins have the ability to create new alerting rules based on platform metrics. This feature is available in Tech Preview, which means that it is still under development and may change in future releases.

Having the ability to create alerting rules based on platform metrics can be very useful for improving the management of alert rules. It allows admins to set up alerts that are triggered by specific metric values, which can help them detect and troubleshoot issues more quickly. This can be especially useful for monitoring the health and performance of applications running on OpenShift.

For more information check out the OpenShift Platform 4.12 release notes


About the authors

Roger Florén, a dynamic and forward-thinking leader, currently serves as the Principal Product Manager at Red Hat, specializing in Observability. His journey in the tech industry is marked by high performance and ambition, transitioning from a senior developer role to a principal product manager. With a strong foundation in technical skills, Roger is constantly driven by curiosity and innovation. At Red Hat, Roger leads the Observability platform team, working closely with in-cluster monitoring teams and contributing to the development of products like Prometheus, AlertManager, Thanos and Observatorium. His expertise extends to coaching, product strategy, interpersonal skills, technical design, IT strategy and agile project management.

Read full bio