Accelerate Cluster Upgrades with TALM's Workload Image Pre-Caching

November 10, 2023Sharat Akhoury

In the realm of managing large networks of Single Node OpenShift (SNO) clusters at the Far Edge, the challenge of upgrading systems efficiently is a task that carries substantial weight. The intricacies of this process often involve navigating tight maintenance windows, grappling with limited bandwidth (particularly in OAM networks), and coping with high round trip packet latencies. Within this context, the importance of reducing the time consumed to upgrade clusters becomes evident.

In this blog post, we will explore how TALM's (Topology Aware Lifecycle Manager) workload image pre-caching feature steps in as a crucial ally. By addressing the aforementioned challenges head-on, pre-caching holds the potential to significantly cut down the time required for cluster upgrades, offering a practical solution that aligns with the needs of system administrators and operators. We will also delve into the benefits that pre-caching offers and explain the process of configuring pre-caching for workload images leveraging the newly introduced PreCachingConfig custom resource. Additionally, we will guide you through the steps to deploy the pre-caching job with TALM and demonstrate how to verify the successful pre-caching of images. Before wrapping up, we will explore some valuable troubleshooting strategies to overcome any potential obstacles along the way.

It's important to note that TALM version 4.14 brings a substantial enhancement to cluster upgrades: the incorporation of pre-caching application-specific (workload) images, which complements its existing capability to pre-cache OpenShift platform-related images. This development represents a substantial stride in considerably reducing cluster upgrade durations.

How Does Pre-Caching Facilitate Cluster Upgrades?

By proactively fetching and storing essential images in advance, pre-caching workload images serves to reduce downtime and streamline the cluster upgrade process. Consequently, it plays a pivotal role in accelerating cluster upgrades within a limited maintenance window by addressing several key challenges and enhancing the overall efficiency of the upgrade process, as outline below:

Smaller Maintenance Windows: With faster and more efficient upgrades, the maintenance window required for upgrades can be shorter. This minimizes operational downtime and gets critical systems back online quickly. This provides operators with greater flexibility when planning upgrades, taking into account network and Far Edge site topology.
Maximize Cluster Upgrades: To maintain adequate service coverage, clusters are usually not simultaneously taken offline. However, with faster cluster upgrades, it becomes possible to accommodate multiple cluster groups within a single maintenance window.
Bandwidth Optimization: Pre-caching allows you to download and store necessary workload images ahead of time. This reduces the need to download large image files during the maintenance window, which can be particularly slow and resource-intensive in a constrained bandwidth environment. By eliminating the need for real-time downloads, the upgrade process becomes more bandwidth-efficient.
Faster Deployment: With workload images pre-cached locally, you can deploy updates more quickly. This is because the images are readily available and don't need to be fetched from remote repositories, which might be slower due to limited bandwidth or high latency.
Reduced Network Latency Impact: High round trip packet latencies can cause delays during upgrades, leading to longer upgrade times. Pre-caching eliminates much of this latency impact by ensuring that the images are already present on the managed cluster.
Predictable Performance: Pre-caching creates a more predictable and controlled upgrade environment. It reduces the variability introduced by external network conditions, making it easier to estimate how long the upgrade will take and plan accordingly.
Enhanced Reliability: Pre-caching helps mitigate the risk of network-related failures during upgrades. Since the images are stored locally, there's less dependency on external resources that might become unavailable or experience disruptions.

Configuring Pre-Caching via the PreCachingConfig CRD

In TALM version 4.14, a new Custom Resource Definition (CRD) named "PreCachingConfig'' has been introduced to facilitate the configuration of pre-caching settings for cluster upgrades. This CRD empowers users with the ability to precisely define various configurations, including the specification of additional workload images for pre-caching. Another valuable feature provided by the PreCachingConfig CRD is the option to set a safeguard parameter, enabling users to specify the minimum available disk space required on the cluster to accommodate pre-cached images.

Furthermore, the PreCachingConfig CRD offers the capability to override and/or exclude OpenShift platform-related images. While TALM automatically identifies OpenShift platform-related upgrade images, such as OpenShift platform and operator images, this feature proves beneficial when users wish to upgrade to specific OpenShift versions or alter an operator package/channel. Additionally, there might be extraneous operator images that are irrelevant to your cluster(s). To expedite the pre-caching process and prevent resource wastage, the PreCachingConfig CRD allows you to specify which platform-related images (from the automatically derived upgrade content list) should be excluded from pre-caching.

A pre-caching configuration can be defined by creating a PreCachingConfig custom resource (CR) manifest YAML file, as illustrated in the template below. The respective fields numbered as <1> to <4> can be populated. If any of these fields are not relevant to your configuration, they may simply be omitted.

apiVersion: ran.openshift.io/v1alpha1
kind: PreCachingConfig
metadata:
  name: exampleconfig
  namespace: exampleconfig-ns
spec:
  overrides: <1>
    platformImage: <1.1>
    operatorsIndexes: <1.2>
    operatorsPackagesAndChannels: <1.3>
  additionalImages: <2>
  excludePrecachePatterns: <3>
  spaceRequired: <4>

More information about the fields labeled <1> to <4> is provided as follows.

The subkeys located within the `overrides` field are associated with OpenShift platform-related images. TALM automatically derives these values based on the policies of the managed clusters. Nevertheless, you have the flexibility to define custom values for these fields. Specifically, you can override any of the following platform-related content: platformImage, operatorsIndexes, and operatorsPackagesAndChannels.
Specifies the list of additional workload images to be pre-cached.
This field designates the images to be excluded from pre-caching. The images can be defined using basic regular expressions (BRE).
This setting defines the minimum amount of disk space that must be available in the crio image cache on the cluster. The spaceRequired value should take into account the disk space requirements of both platform-related images as well as the supplementary workload images. If left unspecified, TALM will assign a default value specifically for OpenShift platform-related images. The disk space field should consist of an integer value followed by the storage unit. For example: 500 MB, 50 GiB, or 1 TiB.

The spaceRequired field serves as a critical safeguard to be sure of the effectiveness of the pre-caching process. It plays a crucial role in monitoring disk space usage both before and after pre-caching, serving as a preventative measure against potential issues like exceeding kubelet's `imageGCHighThresholdPercent` parameter. This parameter breach could trigger the unintended deletion of pre-cached images by kubelet, thus rendering the entire pre-caching job ineffective.

An example PreCachingConfig CR for pre-caching the latest nginx image is shown below. Note that, for the sake of brevity, the image is specified in the example using its tag, however, it is recommended to use the digest format instead.

apiVersion: ran.openshift.io/v1alpha1
kind: PreCachingConfig
metadata:
  name: nginx-precache-config
  namespace: config-ns
spec:
  additionalImages:
    - quay.io/nginx/nginx-ingress:latest
  spaceRequired: 1 GiB

Deploying the Pre-Caching Job in Your OpenShift Environment

In this section, we will guide you through the steps required to deploy the pre-caching job within your OpenShift environment. The pre-caching job is designed as a 'one-shot' task, managed by TALM. Its purpose is to make sure that container images necessary for an upgrade are readily available on each managed cluster before the actual upgrade takes place.

Before we delve into the deployment process, it's essential to note that these steps are to be executed on the hub cluster where the TALM operator is running. TALM will then invoke the pre-caching job on one or more of the managed clusters as defined in the configuration.

Step 1: Create and Apply (`oc apply -f <filename>`) a PreCachingConfig CR, as outlined in the preceding section.

Step 2: Create and Apply a ClusterGroupUpgrade (CGU) CR. When creating the CGU CR, be sure that it:

Specifies the clusters slated for an upgrade (`clusters`).
References policies that contain the required release and operator versions for this upgrade group (`managedPolicies`).
Enables the image pre-caching feature (`preCaching`).
References the PreCachingConfig resource generated in Step 1 (`preCachingConfigRef`).

Below is an example of a ClusterGroupUpgrade CR that references the previously defined PreCachingConfig CR.

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: upgrade-cgu
  namespace: default
spec:
  clusters:
    - site1a1
    - site1a2
    - site1a7
  enable: false
  managedPolicies:
  - du-upgrade-platform-upgrade-prep
  - du-upgrade-platform-upgrade
  - common-config-policy
  - common-subscriptions-policy
  preCaching: true
  preCachingConfigRef:
    name: nginx-precache-config
    namespace: config-ns
  remediationStrategy:
    timeout: 360

It's worthwhile emphasizing that the PreCachingConfig CR must either be created simultaneously with the CGU CR or exist before creating the CGU CR. Additionally, be sure that the PreCachingConfig CR resides in a namespace accessible to the TALM operator.

TALM will attempt to reconcile the ClusterGroupUpgrade CR after it has been successfully created. The operational flow pertaining to the workload image pre-caching job in TALM follows these stages:

1. Cluster Validation: TALM initiates the reconciliation of the ClusterGroupUpgrade by verifying the designated list of clusters. If any of the specified clusters are absent, TALM will halt further progress until all required clusters are available.

2. Policy Validation: During this stage, TALM inspects the policies defined in the ClusterGroupUpgrade. It checks for missing or invalid managed policies and automatically derives the content needed for the upgrade. This encompasses both the OpenShift platform components and any day-2 operator Subscriptions set for upgrading.

3. Pre-Caching Configuration Validation: If a `preCachingConfigRef` is provided in the ClusterGroupUpgrade, TALM proceeds to retrieve and validate it. In the event that TALM cannot access the specified PreCachingConfig CR, it will report an error in the ClusterGroupUpgrade status and terminate the pre-caching task.

4. Job Execution: Once all prior validations have been successfully completed, TALM proceeds to schedule a pre-caching job on each of the previously selected managed clusters. It patiently awaits the completion of these jobs for a duration specified in the ClusterGroupUpgrade `remediationStrategy.timeout` field. Should the elapsed time exceed the timeout value, the CGU status is updated to reflect a pre-caching failure attributed to a timeout occurrence. Conversely, if the jobs conclude successfully within the allocated time frame, the CGU status is updated with the `PrecachingSucceeded` condition and corresponding `Precaching is completed for all clusters` message as shown below:

Type:    "PrecachingSuceeded",
Status:  True,
Reason:  "PrecachingCompleted",
Message: "Precaching is completed for all clusters"

Verifying Workload Image Pre-Caching

This section offers guidance on confirming the successful pre-caching of workload images on designated clusters.

Verification on the Hub Cluster

To verify the successful completion of the pre-caching job on the hub cluster, use the following command:

$ oc get cgu upgrade-cgu -n default -o jsonpath='{.status}' | jq
conditions:
  - message: All selected clusters are valid
    reason: ClusterSelectionCompleted
    status: "True"
    type: ClustersSelected
  - message: Completed validation
    reason: ValidationCompleted
    status: "True"
    type: Validated
  - message: Precaching spec is valid and consistent
    reason: PrecacheSpecIsWellFormed
    status: "True"
    type: PrecacheSpecValid
  - message: Precaching is completed for all clusters
    reason: PrecachingCompleted
    status: "True"
    type: PrecachingSuceeded
  - message: Not enabled
    reason: NotEnabled
    status: "False"
    type: Progressing
precaching:
  spec:
    platformImage: quay.io/openshift-release-dev/ocp-release@sha256:...
    operatorsIndexes:
      - registry.redhat.io/redhat/redhat-operator-index:v4.14
    operatorsPackagesAndChannels:
      - sriov-network-operator:stable
      - ptp-operator:stable
      - cluster-logging:stable
      - local-storage-operator:stable
    excludePrecachePatterns:
    additionalImages:
      - quay.io/nginx/nginx-ingress:latest
    spaceRequired: "35"
  status:
    site1a1: Succeeded
    site1a2: Succeeded
    site1a7: Succeeded
...

Monitor the `conditions` object from the command output to gauge the overall pre-caching job progress associated with the ClusterGroupUpgrade. For a more granular, cluster-level view, inspect the `precaching.status` object to track the pre-caching progress for each managed cluster.

It's important to note that upon successful completion of the pre-caching job, TALM automatically cleans up all associated pre-caching resources created on the managed cluster. This includes resources such as ConfigMaps, pods, jobs, and namespaces.

Optional Verification Steps

While these verification steps are optional, they provide valuable assurance following a `PrecachingCompleted` status reported by TALM. To further validate the success of the pre-caching job, follow these steps:

Utilize the cluster's container engine: Access the cluster's container engine, and use it to inspect the locally cached images. For example, if the container engine running on the managed cluster is podman, execute the following command `podman image`.
Inspect cached images: Examine the list of images displayed by the above command. Look specifically for the images specified in the PreCachingConfig CR.

If you can find the specified image(s), congratulations! This confirms that the pre-caching was successful, and your workload images are pre-cached and ready for use. However, if the image(s) are not found in the list, it suggests that the pre-caching job may have encountered issues. In such cases, further examination and troubleshooting are necessary to identify and resolve the underlying problems. The subsequent section offers effective strategies for troubleshooting the pre-caching job.

As an illustrative example, here are a set of commands and expected output for verifying the nginx image from the `nginx-precache-config` PreCachingConfig CR defined above:

$ oc debug node/site1a1
$ chroot /host/
$ sudo podman images | grep nginx-ingress:latest
quay.io/nginx/nginx-ingress           latest      c97648faa8a0  3 weeks ago   306 MB

Troubleshooting Pre-Caching Issues

When dealing with pre-caching issues, it's essential to adopt a systematic approach that begins with the hub cluster and then extends to the designated managed clusters that have reported failures. In this section, we outline the steps for troubleshooting pre-caching problems at both the hub and managed cluster levels.

On the Hub Cluster
On the Managed Cluster(s):

To initiate the investigation, start by examining the ClusterGroupUpgrade's `status.conditions` and `status.precaching` fields as shown below:

$ oc get cgu upgrade-cgu -n default -o jsonpath='{.status.conditions}' | jq
...

$ oc get cgu upgrade-cgu -n default -o jsonpath='{.status.precaching}' | jq
...

These fields may contain crucial information explaining why pre-caching failed. Look out for the following potential issues:

`ClusterNotFound` error: This occurs when any of the clusters are not present in the ClusterGroupUpgrade.
Validation errors in ClusterGroupUpgrade: These errors can result from missing or invalid managed policies (`NotAllManagedPoliciesExist`) or r issues related to the platform image (`InvalidPlatformImage`).
Validation errors in the pre-caching configuration (`PrecacheSpecIncomplete`): These errors can stem from an invalid specification of the overrides field or a failure to retrieve the referenced PreCachingConfig resource .
Failures due to timeouts (`DeadlineExceeded`): This occurs when pre-caching jobs do not complete within the specified time frame. Consider re-attempting the pre-caching jobs.

If none of the above errors are encountered, proceed to inspect the `PrecachingSucceeded` condition in the ClusterGroupUpgrade status, which provides a high-level summary of pre-caching job status. This condition is set only if TALM successfully scheduled a pre-caching job on each selected managed cluster. Refer to the table below for various states and corresponding reasons and error messages. Identify clusters that failed pre-caching in the individual cluster updates under the ClusterUpgradeStatus `status.precaching.status` object.

Type	Status	Reason	Message
PrecachingSucceeded	True	PrecachingCompleted	Precaching is completed for all clusters
	True	PartiallyDone	Precaching failed for x clusters
	False	InProgress	Precaching in progress for x clusters
	False	Failed	Precaching failed for all clusters

Investigate the failed pre-caching job within the `openshift-talo-pre-cache` namespace on the managed cluster(s). Errors related to volume mounting or worker pod scheduling can lead to job failures. Examine the job's status using the following command:

$ oc describe job pre-cache -n openshift-talo-pre-cache
Name:                     pre-cache
Namespace:                openshift-talo-pre-cache
Selector:                 controller-uid=d802215d-34e9-47eb-936e-55491f98215c
Labels:                   controller-uid=d802215d-34e9-47eb-936e-55491f98215c
                          job-name=pre-cache
Annotations:              batch.kubernetes.io/job-tracking: 
                          target.workload.openshift.io/management: {"effect":"PreferredDuringScheduling"}
Parallelism:              1
Completions:              1
Completion Mode:          NonIndexed
Start Time:               Wed, 13 Sep 2023 03:52:01 +0000
Active Deadline Seconds:  28800s
Pods Statuses:            0 Active (0 Ready) / 0 Succeeded / 1 Failed
...

Check the `Pods Statuses` row to determine the exit status of the pre-caching worker pod. A non-zero exit status typically indicates a problem.

For more detailed information regarding the failure, explore the logs of the pre-caching worker pod on the managed cluster(s). Look for errors related to:

Insufficient disk space
Missing pull spec image files
Failure to pull image(s)

Identify the relevant pod and examine its logs using the following commands:

$ oc get pods -n openshift-talo-pre-cache
NAME              READY   STATUS   RESTARTS   AGE
pre-cache-xw6h8   0/1     Error    0          10h
$ oc logs -f pre-cache-xw6h8 -n openshift-talo-pre-cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3120  100  3120    0     0   138k      0 --:--:-- --:--:-- --:--:--  138k
upgrades.pre-cache 2023-09-01T16:52:53+00:00 [DEBUG]: highThresholdPercent: 85 diskSize:125293548 used:58083516
upgrades.pre-cache 2023-09-13T16:52:53+00:00 [DEBUG]: spaceRequired: 35 GiB
upgrades.pre-cache 2023-09-13T16:52:54+00:00 [DEBUG]: Release index image processing done
6d9c47bfd033912e93491f81cb84f08f9a649ba0f8a65d57ff4329d74b4b0acb
upgrades.pre-cache 2023-09-13T16:52:54+00:00 [DEBUG]: Operators index is not specified. Operators won't be pre-cached
upgrades.pre-cache 2023-09-13T16:52:54+00:00 [INFO]: Image pre-caching starting for platform-images
upgrades.pre-cache 2023-09-13T16:52:54+00:00 [DEBUG]: Pulling quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... [1/183]
...

This detailed investigation will help identify the root cause of pre-caching failures and guide further troubleshooting steps.

Wrapping it up

In the ever-evolving landscape of managing large-scale Single Node OpenShift clusters, TALM's workload image pre-caching emerges as an indispensable tool for upgrading clusters while simultaneously boosting efficiency and reliability. By proactively fetching required upgrade content and optimizing bandwidth utilization, pre-caching lays the foundation for swifter, more predictable cluster upgrades.

Within this blog post, we've explored the invaluable advantages of pre-caching, delving deep into the configuration process via the PreCachingConfig CRD. Additionally, we've furnished you with a comprehensive, step-by-step guide on deploying pre-caching jobs within your OpenShift environment. Ensuring the successful implementation of pre-caching is paramount, and we've elucidated a range of methods to validate the readiness of your workload images. Nevertheless, it's prudent to acknowledge that even with meticulous planning and configuration, occasional challenges may surface. To aid you in effectively troubleshooting pre-caching issues, we've presented a systematic approach for investigating and rectifying deployment concerns.

About the author

Sharat Akhoury

Browse by channel

Explore all channels

Platform products

Try & buy

Featured

By category

By organization type

By customer

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Communities

Recommendations

Select a language

Select a language

Accelerate Cluster Upgrades with TALM's Workload Image Pre-Caching

How Does Pre-Caching Facilitate Cluster Upgrades?

Configuring Pre-Caching via the PreCachingConfig CRD

Deploying the Pre-Caching Job in Your OpenShift Environment

Verifying Workload Image Pre-Caching

Verification on the Hub Cluster

Optional Verification Steps

Troubleshooting Pre-Caching Issues

About the author

Sharat Akhoury

More like this

Browse by channel

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links