The Five Pillars of Red Hat OpenShift Observability
It is with great pleasure that we announce additional Observability features coming up as part of the OpenShift Monitoring 4.14, Logging 5.8, and Distributed Tracing 2.9 releases. Red Hat OpenShift Observability’s plan continues to move forward: as our teams tackle key data collection, storage, delivery, visualization, and analytics features with the goal of turning your data into answers.
What are the problems you can now solve with Red Hat OpenShift Observability?
The Distributed Tracing platform 2.9 takes the OpenTelemetry collector operator Technology Preview to a whole new level. For the first time, it enables the collection of OpenTelemetry Protocol (OTLP) metrics. This also helps users to collect traces and metrics from remote clusters using OTLP/HTTP(s). Additionally, the operator abilities have been expanded to support upgrades, monitoring and alerting of the OpenTelemetry collector instances themselves as it has been promoted to level 4 (Deep Insights). Managed and unmanaged states are now supported too.
Customizable Alerting Rules for Admins
Soon, admins will have the enhanced capability to form new alerting rules based on any platform-related metrics exposed in namespaces such as openshift-, kube-, and default. This significant enhancement allows for the creation of new alerting rules that target metrics across any namespace. Additionally, admins can now clone existing rules, simplifying the rule creation process and also enabling the modification of any existing alert rules. All of this has been developed to address a clear need: enabling administrators to enrich the OpenShift Container Platform with rules tailored to their unique environments.
Advanced Customizations for Node-Exporter Collectors (Phase 2)
Our efforts in Node Exporter customizations are taking the next step forward. Users will be presented with on/off switch options for several collectors, including Systemd, Hwmon, Mountstats, and ksmd. Along with these options, there will be an introduction of general settings for node-exporter, one of which is maxprocs.
Design Scrape Profiles in CMO (Technical Preview)
In an effort to offer greater flexibility and optimization, we're unveiling the concept of optional scrape profiles for service monitors in CMO. This innovation will empower admins to influence the volume of metrics collected by the in-cluster stack. Moreover, the enhanced scaling behavior of CMO will be noticeable both in minuscule and expansive environments. The central idea propelling this change is the provision of an ability to discard non-essential metrics, thus offering admins a deeper control over the gathered data.
Specifying Resource Limits for All Components
In our next release, users will have expanded capabilities to specify resource requirements. While they can currently set limits for components like Prometheus, Alertmanager, Thanos Querier, and Thanos Ruler, we're extending this capability to other vital components such as Node exporter, Kube state metrics, OpenShift state metrics, prometheus-adapter, prometheus-operator, admission webhook, and telemeter-client.
Extend user customizable TopologySpreadConstraints to all relevant pods
A significant update to our platform allows users to configure TopologySpreadConstraints for all pods deployed by CMO. This includes a comprehensive list such as Prometheus-adapter, Openshift-state-metrics, Telemeter-client, Thanos-querier, UWM alertmanager, UWM prometheus, UWM thanos ruler, Prometheus-operator, Kube-state-metrics, and Config reloader.
Distributed Tracing 2.9 comes with several enhancements for Tempo, our new distributed tracing storage that will soon reach General Availability (GA) status. Tempo is a scalable, distributed tracing storage solution that can be used to store and query traces from large-scale microservices architectures. For now, in Tech Preview, Tempo has been expanded to ingest and store distributed traces in the following protocols when using the Distributor service: Jaeger Thrift binary, Jaeger Thrift compact, Jaeger gRPC and Zipkin. As it happens with the OpenTelemetry operator, we have worked to bring the Tempo operator to level 4 (Deep Insights).
In the same way as we did for the OpenTelemetry collector, the TempoStack custom resource now supports both managed and unmanaged states. We also want to mention that we’ve been working in the Tempo Gateway, which supports OTLP gRPC as the Query Frontend service and provides authentication and authorization capabilities. The Tempo Gateway is a separate component that can be used to query Tempo traces and data ingestion, deployed through the operator. We have also expanded the multitenancy experience to be used without the Gateway.
Logging 5.8 has quite a few improvements to Loki log storage. Customers run the OpenShift Logging stack on clusters that span multiple availability zones. Since our new stack based on Loki has built-in support for zone-aware data replication, we are making this available in Logging 5.8. With this new feature, data ingestion will span across all tenants in all availability zones, and in the case of an availability zone failure, some query capabilities are ensured.
Also as of Logging 5.8, we are introducing Cluster Restart hardening and Reliability hardening to Loki. These features increase availability and reliability to Loki - when clusters restart, Loki will keep operating, and will then recover without need for manual intervention. Also, Loki will be more aware of node placements so that no critical components will share the same node, and customers will be able to tune their Affinity/Anti-Affinity rulesets for Loki.
In case you were wondering where all those OTLP metrics collected are delivered to, our users can now, in Technology Preview, choose to forward them via OTLP/HTTP(s), OTLP/gRPC or just store them in the user-workload-monitoring via the Prometheus exporter. The new version of the OpenShift distribution for the OpenTelemetry Collector included in Distributed Tracing 2.9 also includes the resourcedetection and k8sattributesprocessor processors, which can be used to detect resource information from the host and append it or override the values in telemetry data with this information. This provides a new power to the user to enrich data on demand with a small configuration change. It will result in querying the OpenShift and Kubernetes API to retrieve the following resource attributes: cloud.provider, cloud.platform, cloud.region, k8s.cluster.name, and add them to your OpenTelemetry signals.
In Logging 5.8, one of our most exciting new features is the ability to create multiple log forwarders. This feature allows a ClusterLogForwarder to be created in any namespace, and also allows multiple, isolated ClusterLogForwarder instances so that independent groups can forward their choice of logs to their choice of destinations. With this feature, users can control their own log forwarding configurations separately, without interfering with other users’ log forwarding configurations.
As highlighted in our previous release blogs, we continue to enhance the user navigation and overall functionalities of the OpenShift Web Console - our central Observability visualization tool. Our goal is not only to simplify navigation, but also to empower you, as a user, to reduce the time spent troubleshooting individual clusters and navigate the increasing amount of data/signals. In an effort to refine the monitoring console experience, the console team has transitioned the Monitoring features into an optional plugin for the console. We'll be ensuring these resources are deployed through CMO, making monitoring console pages visible whenever CMO is present. Also from a Monitoring perspective, with OpenShift 4.14 you can benefit from a brand new Silences tab in the Developer perspective of the OpenShift Web Console. Thanks to this new feature, as a developer, you will be able to directly manage alert silences and now also expire them in bulk - automatically minimizing silence noise. Both functionalities are introduced in the video below:
With Logging 5.8, we have a series of great features. Firstly, you will be able to benefit from log-based alerts in the Developer perspective of the OpenShift Web Console. With Logging 5.7, those were in fact made available in the Admin perspective. In addition to this, developers can benefit from searching patterns across multiple namespaces. This new functionality will allow users to reduce time spent troubleshooting and track problems down within different services. Take a look at the feature described in the video below:
Within Logging 5.8, we are introducing Loki dashboards so that users can have visual insight into the performance and health of their log storage. Finally, by accessing the Developer Perspective of the Web Console, users will be able to search logs, thus patterns, across all namespaces, making it easier to debug applications.
With Logging 5.8, we are glad to communicate that you will be able to benefit from a first correlation experience directly in the OpenShift Web Console - a Dev Preview feature. As firstly introduced in KubeCon Europe 2023, the Red Hat Observability team has been working on korrel8r - an open source project that aims to make correlation across observability signals accessible to everyone. How can correlation benefit you? You will be able to reduce the amount of time spent troubleshooting individual clusters, thus the time needed to identify issues, by quickly jumping from one observability signal to another. The good news is that we have integrated korrel8r as part of our OpenShift Observability experience, meaning you will be able to quickly switch from an alert to its equivalent log and/or from a log to its equivalent metric. From now on, being able to identify problems will only be a few clicks away!
What are we planning next for Observability?
Our Observability stack is expanding. We are aware of the importance of shedding light on a variety of different metrics, including sustainability. That is why, we are glad to announce that Power Monitoring for Red Hat OpenShift (based on Kepler) will soon be available for Dev Preview, and we can’t wait to listen to your feedback.
We are also working tirelessly to enhance our OpenTelemetry support for different use cases. That means that the OpenTelemetry Operator will become GA very soon, to help our users to avoid vendor lock-in and renew their observability stack while making the right choices.
Distributed tracing will not only provide a GA version for Tempo very soon but also improve the core functionality of traces with RED metrics, auto instrumentation capabilities, ARM support and many more.