In today's increasingly distributed system landscape, ensuring comprehensive application observability has become a difficult task, especially when machines and clusters are geographically dispersed. In this blog post, I'll address the complexity of this challenge and outline possible solutions using Red Hat OpenShift Distributed Tracing Data Collection (Documentation) 2.8. Cost comparison between self-hosted and SaaS solutions is not discussed here because it depends on the team size, expected data, and other factors.
What is Red Hat OpenShift Distributed Tracing Data Collection?
Red Hat OpenShift distributed tracing includes a new product for trace and observability data collection. This product is currently available as a Technology Preview, which means users can already install and try it while it is being developed and tested. We are adding more functionality every day. In the current version, 2.8, we support collecting distributed tracing data; we are also adding the ability to collect OTLP metrics in the upcoming 2.9 release.
The product itself is based on the OpenTelemetry project and consists of a specification, provided SDKs, and various collector components, making it ideal for observability data collection in distributed systems.
Why based on OpenTelemetry?
OpenTelemetry, an incubating CNCF project since 2019, is a powerful open source observability framework that standardizes the collection, aggregation, and export of telemetry data from various sources, enabling comprehensive visibility into the health, performance, and behavior of distributed applications and systems. Due to the specification and modular structure of the OpenTelemetry collector and the adaptation of various vendors, OpenTelemetry is perfectly suited for the correlation of logs, metrics and traces, as well as collection of vendor-neutral data.
Why cross-cluster observability?
Creation of dedicated observability clusters to store and analyze telemetry data from multiple clusters offers several advantages and is often essential when operating many edge (especially far edge) clusters. However, opting for and running such a solution is affected by various factors.
When you store telemetry data in a single cluster, you get a comprehensive view of the health, performance, and behavior of your infrastructure. Especially when clusters are communicating with each other, this unified view enables the identification of patterns, correlations, and anomalies across clusters, allowing for effective troubleshooting and optimisation. Transferring this data to a central location using a vendor-neutral data collection suite not only makes the lives of administrators and developers easier, it also simplifies future migrations.
Storing telemetry data in a single cluster might have some disadvantages: Such a setup introduces a single point of failure. Data privacy and compliance requirements, multiplied by the volume of incoming data, might demand more complex traffic and access management. As the infrastructure grows, scaling a single observability cluster to handle the increased data load and resource demands might become challenging.
Since OpenTelemetry allows subsequent migration between its own observability cluster and various third-party providers, I'll focus on the OpenShift solution here.
If you prefer video content to learn more about OpenTelemetry and see examples of setting up an observability environment, check out the presentation Observability at the edge: Monitoring edge environments at scale from the Red Hat Summit 2023 conference.
How to configure a telemetry toolchain
In the following sections, I'll describe how to use Red Hat OpenShift Distributed Tracing Data Collection to send telemetry data from a remote cluster and receive it securely in the desired observability cluster. For simplicity, I'm focusing here on the transmission and storing of trace data.
The following graphic shows two clusters having a separate observability namespace. In the observability cluster, an OpenTelemetry collector is operated there as a receiver and made publicly available by OpenTelemetry Operator. In remote Cluster 1, an OpenTelemetry collector is operated in the gateway mode.
Data is sent between the two clusters via a connection established by the Open Telemetry collector on the remote Cluster 1 side. To check whether the remote collector is authorized to transmit data, you can choose between different options:
- A bearer token extension, which can be used in combination with a serviceAccount with the RHOSDT 2.9 release
- A Basic HTTP authentication extension that validates a username/password combination
- Built-in mTLS that verifies an incoming connection based on provided certificates
All the examples shown below are based on OpenShift using Basic HTTP authentication and TLS. If you are searching for a guide on how to expose the OpenTelemetry Collector on Kubernetes, read this.
- Access to two OpenShift v4.10+ clusters
- A domain name pointing to an observability cluster
- Red Hat OpenShift Distributed Tracing Data Collection v2.8+ installed on both clusters
- Cert Manager Operator for OpenShift installed on the observability cluster
TIP: To be able to authenticate later, create a Secret with the username and password in both clusters:
# username is ingest base64 encoded.
# password is “c2VjZXJ0Cg==” (secret base64 encoded) base64 encoded
If you are using Red Hat OpenShift Distributed Tracing Data Collection in version 2.9 or higher, you can use the bearer token extension for authentication. The configuration is similar to the HTTP Basic Authentication extension.
It works in a simple way: This extension compares the transmitted token with the one on the collecting site. In combination with a serviceAccount token, this allows you to manage the access on the Observability cluster site.
Note: OpenShift 4.11 serviceAccount tokens have an expiry date of 1h by default, unlike the exchanged basic authentication credentials. To learn more about Long-Lived and Time-Bound tokens, check Understanding service accounts and tokens in Kubernetes.
Collecting and sending telemetry data
Start by creating the OpenTelemetry collector on remote Cluster 1. The aim is to generate trace data with the help of telemetrygen and transmit it to the OpenTelemetry Collector, which serves as a gateway.
The gateway will then enrich the trace data with additional information, such as the original cluster name (in this case Cluster 1), and securely transmit the received data to the observability cluster.
The resourcedetection processor identifies the cluster in which the collector is running by connecting to the OpenShift infrastructure API. To allow this, a service account must first be created for the collector. The collector is then assigned a ClusterRole with the required permissions:
- apiGroups: ["config.openshift.io"]
verbs: ["get", "watch", "list"]
- kind: ServiceAccount
Next, create a gateway collector with the help of the OpenTelemetryCollector CR. Assign the previously created serviceAccount to this gateway collector to retrieve infrastructure information from the OpenShift API. Since this collector serves as a gateway for enriching data and handling authentication, use the daemonset mode here. Transmitted telemetry data can be exported directly from each node. The credentials are taken from the Secret that you created in the Prerequisites section. The main part of this CR consists of the collector configuration, which is stored as a string on the config entry. The available configurations can be found in the official OpenTelemetry Collector documentation.
- name: BASIC_AUTH_USERNAME
- name: BASIC_AUTH_PASSWORD
receivers: [opencensus, otlp, zipkin]
processors: [memory_limiter, k8sattributes, resourcedetection, batch]
To simulate generation of telemetry data, create a pod that generates data and transmits it to your gateway into the "traces" pipeline.
- name: telemetrygen-traces
Receiving telemetry data
The cert-manager CRDs, which include the ClusterIssuer resource, are not available by default in OpenShift, so make sure that cert-manager is installed. Cert-manager allows you to request certificates from a Certificate Authority (CA) easily. To use those requested certificates, you only need to add the acme.cert-manager.io/http01-ingress-class: openshift-default annotation to your OpenTelemetryCollector CR.
Next, create an OpenTelemetry collector on the receiver side. The Basic HTTP authentication credentials are again taken from the Secret that was created at the beginning. With the help of the ingress entry, you can make the configured OTLP/HTTP receiver available outside of the observability cluster.
# Value "edge" indicates that encryption should be terminated at the edge router.
# Alternative options are: "insecure", "passthrough" and "reencrypt".
otlp: # NOTE: enabled for testing in the final step
Looking at the pipeline section of the OpenTelemetryCollector CR, note that only the logging exporter is used for termination. This makes it easier to check that traces are actually transmitted.
Distributing telemetry data
Finally, you can redirect transmitted telemetry data to a desired backend storage. Since we are focusing on transmission of traces in this example, an all-in-one Jaeger instance is created, which receives and visualizes all trace information.
- name: "16686"
- name: "4317"
- name: COLLECTOR_OTLP_ENABLED
- name: LOG_LEVEL
- containerPort: 16686
- containerPort: 4317
Then, replace the logging exporter in the service section of the otlp-receiver CR with the already specified otlp exporter that points to jaeger.observability.svc.cluster.local:4317.
Afterwards, make the Jaeger-UI locally accessible by running oc port-foward and visiting 127.0.0.1:16686.
oc port-forward -n observability svc/jaeger 16686:16686
The result is a view similar to this:
In this blog post, I explored using the Red Hat OpenShift Distributed Tracing Data Collection beyond its own cluster boundary. With cross-cluster observability, you gain a comprehensive understanding of your distributed systems helping you to understand the big picture of your infrastructure. I provided an example on how to enrich the data on a remote cluster, transmit it in a secure way, and prepare it for analysis.
Furthermore, we are planning to team up with Red Hat Advanced Cluster Management for Kubernetes (ACM) to help you configure, maintain consistency and visualize distributed tracing data across a fleet of clusters from a single place. Stay tuned!
Note that the Jaeger all-in-one instance is a representative for any other trace backend that can receive data via OTLP/gRPC like Tempo.