In my previous blog, Set up an Istio Multicluster Service Mesh with Submariner in Red Hat Advanced Cluster Management for Kubernetes, I discussed Istio multicluster service mesh with a central control plane, how it could be set up with Submariner in Red Hat Advanced Cluster Management for Kubernetes (RHACM), and how to build a central management entrance for the whole mesh.

With Istio multicluster service mesh in RHACM, Istio can be extended. Given that user workloads are injected into Istio sidecars, the Istio sidecars are network proxies that intercept the traffic between the services within a service mesh, and generate detailed telemetry data or golden signals about the service behavior, eg. service latency, traffic, errors, etc. If this data can be gathered, then you can create a global view of the whole mesh. The problem is that this data is located in different managed clusters, so how can we centralize this data?

That's exactly what the metrics-collector does. Metrics-collector is an important component of Observability service in RHACM, which is deployed into each managed clusters that have the observability add-on enbled, and used to collect metrics from local Prometheus instance with configured interval and push back to Thanos in hub cluster.

By leveraging the observability stack in RHACM, you can create an end-to-end view of traffic flow and monitoring for all services in the whole service mesh across multiple managed clusters, which empowers operators to troubleshoot, maintain, and optimize their applications. Even better, you can get almost all of this instrumentation without requiring application changes.

Before moving forward with the installation of this blog, let's first take a look at the architecture of observability for the multicluster service mesh:

acm-istio-multicluster-obs-integration

Prerequisites

Make sure you read the previous blog and set up a multicluster service mesh by following the instructions in that blog.

You also need to follow these instructions to enable the RHACM Observability service before you begin the installation.

Installation

After you enable the observability service in RHACM, Thanos stack and Grafana are deployed on the hub cluster, metrics-collector is deployed in each managed cluster in the service mesh. Verify that the observability service is enabled by using the following commands:

$ oc --context=${CTX_HUB_CLUSTER} -n open-cluster-management-observability get sts,deployment
NAME READY AGE
observability-alertmanager 3/3 20m
observability-grafana 1/1 20m
observability-thanos-compact 1/1 20m
observability-thanos-query-frontend-memcached 3/3 20m
observability-thanos-receive-default 3/3 20m
observability-thanos-rule 3/3 20m
observability-thanos-store-memcached 3/3 20m
observability-thanos-store-shard-0 1/1 20m
observability-thanos-store-shard-1 1/1 20m
observability-thanos-store-shard-2 1/1 20m
$ oc --context=${CTX_MC1_CLUSTER} -n open-cluster-management-addon-observability get pod -l component=metrics-collector
NAME READY STATUS RESTARTS AGE
metrics-collector-deployment-9496686fc-q9v87 1/1 Running 0 20m
$ oc --context=${CTX_MC2_CLUSTER} -n open-cluster-management-addon-observability get pod -l component=metrics-collector
NAME READY STATUS RESTARTS AGE
metrics-collector-deployment-765f486b47-dnfs8 1/1 Running 0 20m

Now, let's begin the installation for this blog.

Install Istio Add-on

In this step, install the Istio add-ons onto the hub cluster so that you can get central view of different aspects of the multicluster service mesh. Given that the observability service in RHACM installs Grafana in hub cluster, it can be reused to visualize service mesh metrics. Install Jaeger and Kiali. Jaeger is a distributed tracing system to monitor and troubleshoot application transactions across clusters, while Kiali is a console for Istio service mesh to manage, visualize, validate and troubleshoot the service mesh by monitoring traffic flow to infer the topology and report errors. Complete the following steps:

  1. Deploy Jaeger and Kiali on the hub cluster and create Openshift Route resources for the services, so that they can be accessed externally. You also need to export the Jaeger service with ServiceExport so that it can be accessed from other managed clusters:

    oc --context=${CTX_HUB_CLUSTER} -n istio-system apply \
    -f https://raw.githubusercontent.com/istio/istio/release-1.11/samples/addons/jaeger.yaml
    oc --context=${CTX_HUB_CLUSTER} -n istio-system apply \
    -f https://raw.githubusercontent.com/istio/istio/release-1.11/samples/addons/kiali.yaml
    oc --context=${CTX_HUB_CLUSTER} -n istio-system expose svc/tracing --port http-query
    oc --context=${CTX_HUB_CLUSTER} -n istio-system expose svc/kiali --port http
    cat << EOF | oc --context=${CTX_HUB_CLUSTER} apply -n istio-system -f -
    apiVersion: multicluster.x-k8s.io/v1alpha1
    kind: ServiceExport
    metadata:
    name: zipkin
    namespace: istio-system
    EOF
  2. Verify that the pod and service for Jaeger and Kiali are up and running:

    $ oc --context=${CTX_HUB_CLUSTER} -n istio-system get pod,svc
    NAME READY STATUS RESTARTS AGE
    pod/istio-ingressgateway-86464c97f5-tp7mr 1/1 Running 0 15m
    pod/istiod-98d586c48-zgt46 1/1 Running 0 15m
    pod/jaeger-5d44bc5c5d-wrx2j 1/1 Running 0 1m
    pod/kiali-fd9f88575-xn8zz 1/1 Running 0 1m

    NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    service/istio-ingressgateway LoadBalancer 172.30.251.10 <pending> 15021:32440/TCP,80:31976/TCP,443:32119/TCP 15m
    service/istiod ClusterIP 172.30.93.244 <none> 15010/TCP,15012/TCP,443/TCP,15014/TCP 15m
    service/jaeger-collector ClusterIP 172.30.205.229 <none> 14268/TCP,14250/TCP,9411/TCP 1m
    service/kiali ClusterIP 172.30.140.130 <none> 20001/TCP,9090/TCP 1m
    service/tracing ClusterIP 172.30.23.103 <none> 80/TCP,16685/TCP 1m
    service/zipkin ClusterIP 172.30.75.236 <none> 9411/TCP 1m

Enable OpenShift Monitoring for Istio Traffic

By default, Istio sidecars generate the metrics about the application traffic, but the metrics data is not scraped. In order to make OpenShift monitoring to scrape the metrics, you need to create extra RBAC and podMonitor resources with the following steps:

  1. Create the following RHACM policy on the hub cluster. The policy creates extra RBAC and podMonitor resources to scrape the metrics in each managed cluster. Run the following command:

  2. By default, the Istio sidecar container doesn't have the http-monitoring port, which is used to serve Istio metrics. You need to edit the istio-sidecar-injector ConfigMap in the istio-system namespace of hub cluster to add the http-monitoring port for Istio sidecars. Add the following containerPort to the istio-proxy container template in istio-sidecar-injector ConfigMap:

    oc --context=${CTX_HUB_CLUSTER} -n istio-system edit cm istio-sidecar-injector
    ...
    containers:
    - name: istio-proxy
    ...
    ports:
    # add the following section in ports
    - containerPort: 15020
    protocol: TCP
    name: http-monitoring
    - containerPort: 15090
    protocol: TCP
    name: http-envoy-prom

    Note: There are two places that contain the istio-proxy container template in the istio-sidecar-injector ConfigMap. Make sure they are both added to the container port.

  3. Restart the application pods in each managed cluster with the following command, so that the new istio-proxy container configuration can be injected:

    oc --context=${CTX_MC1_CLUSTER} -n istio-apps delete pod --all
    oc --context=${CTX_MC2_CLUSTER} -n istio-apps delete pod --all

Enable Metrics Collector to Collect Istio Metrics

In order to make metrics collector collect the Istio metrics and push back to Thanos in hub cluster, you need to create a ConfigMap that contains custom metrics allow list in the hub cluster by using the following command:

cat << EOF | oc --context=${CTX_HUB_CLUSTER} apply -n open-cluster-management-observability -f -
kind: ConfigMap
apiVersion: v1
metadata:
name: observability-metrics-custom-allowlist
data:
metrics_list.yaml: |
names:
- istio_request_bytes_bucket
- istio_request_bytes_count
- istio_request_bytes_sum
- istio_request_duration_milliseconds_bucket
- istio_request_duration_milliseconds_count
- istio_request_duration_milliseconds_sum
- istio_requests_total
- istio_response_bytes_bucket
- istio_response_bytes_count
- istio_response_bytes_sum
- istio_tcp_connections_closed_total
- istio_tcp_connections_opened_total
- istio_tcp_received_bytes_total
- istio_tcp_sent_bytes_total
EOF

Create Grafana Dashboards to Visualize Istio Metrics

Given that there is a focus on the metrics of Istio applications, create the following three dashboards:

  • Istio Mesh Dashboard - This dashboard gives the global view of the mesh along with services and workloads in the mesh.
  • Istio Service Dashboard - This dashboard gives details about metrics for the service.
  • Istio Workload Dashboard - This dashboard gives details about metrics for each workload.

Complete the following steps to create Grafana dashboards:

  1. Create the three Grafana dashboards on the hub cluster to visualize different aspects of the service mesh by using the following command:

  2. Retrieve the Grafana console address with the following command:

    MULTICLOUD_CONSOLE=$(oc --context=${CTX_HUB_CLUSTER} -n open-cluster-management get route multicloud-console -o jsonpath="{.spec.host}")
    echo "https://${MULTICLOUD_CONSOLE}/grafana/dashboards"
  3. Verify the Grafana dashboards are created successfully by accessing the address returned in the previous step, it should look similar to the following diagram:

     

    istio-grafana-dashboards

Update Kiali Configuration

To make Kiali render the mesh graph with metrics data from Thanos, update the Kiali configuration by using the following command:

cat << EOF | oc --context=${CTX_HUB_CLUSTER} apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app: kiali
app.kubernetes.io/instance: kiali
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kiali
app.kubernetes.io/part-of: kiali
app.kubernetes.io/version: v1.38.0
helm.sh/chart: kiali-server-1.38.0
version: v1.38.0
name: kiali
namespace: istio-system
data:
config.yaml: |
auth:
openid: {}
openshift:
client_id_prefix: kiali
strategy: anonymous
deployment:
accessible_namespaces:
- '**'
additional_service_yaml: {}
affinity:
node: {}
pod: {}
pod_anti: {}
hpa:
api_version: autoscaling/v2beta2
spec: {}
image_name: quay.io/kiali/kiali
image_pull_policy: Always
image_pull_secrets: []
image_version: v1.38
ingress_enabled: false
instance_name: kiali
logger:
log_format: text
log_level: info
sampler_rate: "1"
time_field_format: 2006-01-02T15:04:05Z07:00
namespace: istio-system
node_selector: {}
override_ingress_yaml:
metadata: {}
pod_annotations:
sidecar.istio.io/inject: "false"
pod_labels: {}
priority_class_name: ""
replicas: 1
resources: {}
secret_name: kiali
service_annotations: {}
service_type: ""
tolerations: []
version_label: v1.38.0
view_only_mode: false
external_services:
prometheus:
thanos_proxy:
enabled: true
url: "http://observability-thanos-query-frontend.open-cluster-management-observability.svc.cluster.local:9090"
grafana:
in_cluster_url: "http://grafana.open-cluster-management-observability.svc.cluster.local:3001"
url: "http://grafana.open-cluster-management-observability.svc.cluster.local:3001"
tracing:
in_cluster_url: "http://tracing.istio-system:16685/jaeger"
custom_dashboards:
enabled: true
identity:
cert_file: ""
private_key_file: ""
istio_namespace: istio-system
login_token:
signing_key: CHANGEME
server:
metrics_enabled: true
metrics_port: 9090
port: 20001
web_root: /kiali
EOF

Then restart the Kiali pod for the new configuration to take effect:

oc --context=${CTX_HUB_CLUSTER} -n istio-system delete pod -l app=kiali

Visualizing Service Mesh Metrics with Grafana

  1. Send traffic to the Bookinfo application to have the metrics available in Grafana. Refresh the bookinfo application page a few times or run the following command to generate a small amount of traffic:

    $ export GATEWAY_URL=$(oc --context=${CTX_HUB_CLUSTER} \
    -n istio-system get route istio-ingressgateway \
    -o jsonpath="{.spec.host}")
    $ for i in $(seq 1 100); do curl -s -o /dev/null http://${GATEWAY_URL}/productpage"; done
  2. Refresh Grafana dashboards page again with the following address, https://${MULTICLOUD_CONSOLE}/grafana/dashboards. Click Istio Mesh Dashboard to enter the global view of the service mesh along with services and workloads in the mesh. View the following screen capture:

    istio-mesh-dashboard

  3. To view details about services and workloads navigate to their specific dashboards. For example, to view the metric details for the service, client workloads (workloads that are calling this service), and service workloads (workloads that are providing this service) for that service, navigate to Istio Service Dashboard. View the following screen capture:

    istio-service-dashboard

  4. To view the metric details for the workload, inbound workloads (workloads that are sending request to this workload) and outbound workloads (workloads to which this workload send requests) for that workload, navigate to Istio Workload Dashboard. It should appear similarly as the following screen capture:

    istio-workload-dashboard (1)

Verifying Service Mesh Traces with Jaeger

  1. Send traffic to Bookinfo application to verify traces in the Jaeger console. To view a trace, you need to send more requests. The number of requests depends on the Istio sampling rate. The default sampling rate is 1%, which means you have to send at least 100 requests before the first tracing is visible. To send a 100 requests to the Bookinfo application, run the following command:

    for i in $(seq 1 100); do curl -s -o /dev/null http://${GATEWAY_URL}/productpage"; done
  2. Access the Jaeger console with the following address by from your browser:

    JAEGER_HOST=$(oc --context=${CTX_HUB_CLUSTER} -n istio-system get route tracing -o jsonpath="{.spec.host}")
    echo "http://${JAEGER_HOST}"
  3. From navigation panel of the page, select productpage.istio-apps from the Service drop-down list and click Find Traces. View the following screen capture:

     

    jeager-find-tracing
  4. Click one of the found traces to view the details corresponding to the request to the Bookinfo application:

     

    jeager-tracing-details
  5. The whole trace is composed of a set of spans, where each span corresponds to a Bookinfo service that is invoked during the start of request.

Visualizing Service Mesh Graph with Kiali

In this section, use Kiali to view the service graph of the entire mesh across clusters. The graph represents traffic flowing through the service mesh for a period of time. It is generated using Istio traffic metrics. Complete the following steps:

  1. Send traffic to the Bookinfo application to create enough traffic metrics for the Kiali graph. Refresh the Bookinfo application page a few times or the following command to generate a small amount of traffic:

    for i in $(seq 1 100); do curl -s -o /dev/null http://${GATEWAY_URL}/productpage"; done
  2. Access the Kiali console with the following address from your browser:

    KIALI_HOST=$(oc --context=${CTX_HUB_CLUSTER} -n istio-system get route kiali -o jsonpath="{.spec.host}")
    echo "http://${KIALI_HOST}"
  3. To view a namespace graph, select the Graph option in the navigation menu and then select istio-apps from the Namespace drop-down menu. If Empty Graph is shown, click the Display Idle Nodes button and make sure to select the correct time period, then the Bookinfo graph should appear. The page should look similar to the following screen capture:

    kiali-graph

  4. To view service mesh using different graph types, select a graph type from the Graph Type drop-down menu. To view which cluster each service is deployed in, select the Cluster Boxes from the Display drop-down menu. The page should look similar to the following image:

    kiali-graph-cluster-boxes

As you can notice from previous diagram, the graph represents traffic flowing through the Bookinfo application during a selected time period and the cluster boxes represent which cluster is running for each Bookinfo service.

Summary

By leveraging Istio add-ons (Jaeger and Kiali) and the observability service in RHACM, you can get global views of different aspects of the whole multicluster service mesh, which empowers operators to troubleshoot, maintain, and optimize their applications.