As Edge devices proliferate and the needs of the AppSRE and admin teams grow to accommodate these devices, new challenges arise with gaining visibility into the health of these environments. As the scale grows, so does the complexity of the job to administer and view holistically from data center to the edge.

First, let's start with a bit of table-setting

With Red Hat Advanced Cluster Management for Kubernetes (RHACM) version 2.4 and later, Red Hat provides centralized observability of the fleet, which is primarily focused on displaying cluster health metrics that can readily describe control plane health, cluster optimization, and cluster utilization. For example, admins can see API latency across the fleet and compare clusters for CPU/memory under utilization.

In addition, alerts are configured for centralized management, ensuring that responders are engaged directly in the tools they are expecting, such as Slack and PagerDuty. Specific alert rules can be put in place to ensure only critical alerts fire into appropriate channels.

These capabilities provide the starting point for your Observability experience in RHACM, and while the starting point is robust and feature-full, many customers have asked for the ability to build their own dashboards (we provided), customize the allowList for cluster metrics (we provided), and expose Service Level Objective (we provided). We would be remiss to mention that the capabilities also extend to OpenShift 3.11 and OCP 4.x, along with Amazon EKS, Google Cloud GKE, Microsoft Azure AKS and IBM Cloud IKS.

We didn't want to stop there

Customers are moving workloads closer to their users, and taking advantage of edge computing to enrich their customers' experiences, and drive higher satisfaction. Our customers also started to narrow in on a specific use case that involved the need to monitor a single node OpenShift (SNO) cluster that was driving a highly specialized container workload, and desired to do all of that with one single monitoring instance. See Meet single node OpenShift for more information.

Edge devices are generally resource constrained and do not have the same access to elastic compute and memory resources that might be available, for example, in the public or private cloud model. With that in mind, customers do not want to sacrifice any additional compute and memory towards the infrastructure operators running within the cluster. We now offer the ability to surface both the platform and user workload metrics with one monitoring stack.

Taking note of this requirement, RHACM and OpenShift Monitoring worked together to ensure our customers have the ability to monitor their workload on single node OpenShift by leveraging the on-cluster Prometheus, and a ServiceMonitor to push the necessary workload metrics through the platform. Admins and developers can take advantage of the custom allow list to centrally collect this metric at the hub, leverage PromQL to explore the metric, and even make use of a sandbox Grafana environment to build custom visualizations.

But wait, there's more!

Starting with the release of RHACM 2.5, the following features were introduced to further enrich the edge observability experience:

  • Dynamic metrics for SNO clusters: Dynamic metrics collection supports automatic metric collection based on certain conditions. By default, a SNO cluster does not collect pod and container resource metrics. Once a SNO cluster reaches a specific level of resource consumption, the defined granular metrics are collected dynamically. When the cluster resource consumption is consistently less than the threshold for a period of time, granular metric collection stops.

  • Export metrics to external endpoints: Export monitoring metrics into existing corporate monitoring tools. By exporting Kubernetes cluster metrics from RHACM into streaming platforms like Kafka, and combining them with other system metrics such as event management logs, audit logs, and SNMP logs, customers get a complete picture for security and troubleshooting applications.

  • Arm (ARM, aarch64, Advanced RISC Machines) support: Did we mention you can do it all for Arm64 managed clusters, too? Even the hub is capable of running on Arm, further enabling the management reach into lower power and lower cost consumption models.

What's next?

As we continue to iterate on these features and fine tune the capacity and scale aspects, expect us to deliver enriched capabilities that help admins and developers to establish "right sized" requests and limits for their applications, further ensuring that edge workloads only use what they need in these ever more constrained environments. Monitoring for Hosted Control Planes (HyperShift) and additional deliverables in User Workload Monitoring will help you and your teams successfully observe and manage the platforms, and applications, in your growing estate of clusters.