Imagine you’re the operator of a National or International Telecommunications Provider, charged with managing tens of thousands of OpenShift clusters at Far Edge sites across one or more countries, with active customer traffic flowing through all of them. Now imagine you need to upgrade them all. The Topology Aware Lifecycle Manager (TALM) is the tool that will help you manage the lifecycle of your fleet in an orderly and predictable way across your vast provider network.

TALM’s primary responsibility is to group and sequence fleet-wide cluster upgrades. It utilizes the Red Hat Advanced Cluster Management’s (RHACM) Policy Engine and is driven by Red Hat GitOps. Your upgrade plan is captured as a declarative state and committed to git. You then schedule and trigger the upgrade process by supplying TALM with parameters for the upgrade which conform to your operational requirements. From this point TALM manages a progressive rollout of the upgrade across the fleet, based on the parameters you’ve set. The set of features which are provided by TALM help address a number of critical operational scenarios:

  • You want one or more “canary” clusters to act as a final sanity check before rolling the upgrade out more widely. TALM allows you to identify canary clusters which will be updated first. Any errors aborts the rollout.
  • Your operational rules require at most 30 clusters affected concurrently. TALM will ensure your limits are respected and run a series of batch-wise updates.
  • You only want to update a subset of your clusters for now, deferring others to a later time. TALM gives you the control to select which clusters are affected and which are not.

TALM builds on the foundation of RHACM’s Policy Engine by adding a level of control over when Policies are allowed to make changes to cluster configuration. Upgrades, or any change you want to roll out, are defined through Policies which serve two critical purposes. First, the Policy captures the desired end state – an upgrade in this case – as a set of changes to the cluster’s configuration. Because the Policy is defined once and bound to the fleet of clusters it is a succinct and scalable way to manage the state of your fleet. The second critical benefit of using Policy is that it gives you visibility into the state of your fleet. Using the RHACM interface you can use the compliance state to quickly see which clusters have been updated and which are still pending.

Following the progress of your upgrade, at multiple levels of detail, can be done through both RHACM and TALM. Through the RHACM interface the compliance state of the upgrade Policy can be reviewed. This gives you visibility into the status of the upgrade across the fleet in aggregated views, or you can drill down into the status and details of a single cluster. The status available directly through TALM gives information on how clusters are grouped into batches, progress through the batches, and completion status for individual clusters.

As TALM works to roll the upgrade across the fleet, progress and status of each cluster can be seen through Policy compliance in the UI. Note in this screenshot that one cluster has completed the upgrade process as indicated by the common and upgrade policies in compliance (green check) while the second cluster is just beginning the upgrade process. Clicking through the policies or violations gives expanded detail for individual clusters.

Screenshot of the governance GUI

The parameters for a TALM managed rollout are configured through ClusterGroupUpgrade (CGU) CRs applied to your RHACM cluster. These define the progressive rollout of a set of Policies to a group of clusters. The CGU shown below includes just two clusters in the upgrade and instructs TALM to upgrade them one at a time (maxConcurrency). Regardless of how many clusters in your fleet are bound to the listed set of policies, only these two clusters will be upgraded, and only one at a time. Note that these clusters are identified by name, however for managing larger fleets you can also select clusters by label. The CGU also contains status which can be used to monitor rollout of the upgrade. You can see in the example that the upgrade is progressing and is currently applying to cluster cnfdf19.

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: update-cgu
namespace: default
spec:
backup: false
clusters:
- cnfdf19
- cnfdf29
enable: true
managedPolicies:
- du-upgrade-platform-upgrade-prep
- du-upgrade-platform-upgrade
- common-config-policy
- common-subscriptions-policy
preCaching: true
remediationStrategy:
  maxConcurrency: 1
  timeout: 360
status:
computedMaxConcurrency: 1
conditions:
- lastTransitionTime: "2023-01-24T23:01:51Z"
  message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant
  reason: UpgradeNotCompleted
  status: "False"
  type: Ready
status:
  currentBatch: 1
  currentBatchRemediationProgress:
    cnfdf19:
      policyIndex: 1
      state: InProgress
  currentBatchStartedAt: "2023-01-24T23:11:55Z"
  startedAt: "2023-01-24T23:11:54Z"

TALM has a few noteworthy sub-features that further enhance the operational management of OpenShift clusters at the scale required by medium to large Telecommunication Providers.

TALM has the ability to pre-cache upgrade content on Single Node OpenShift clusters and then initiate the upgrade after the content is locally cached. This feature provides two significant benefits in managing clusters at the edge of the network. The first, and perhaps most obvious, is that the effects of limited bandwidth to the cluster are mitigated. The pre-caching feature can be enabled well ahead of the actual upgrade allowing the upgrade content to be pulled to the cluster slowly. The pre-caching ensures that low bandwidth connections do not cause the upgrade process to stretch out longer than the maintenance window.

The other benefit of pre-caching is mitigating risk. With the installation components cached locally the cluster is protected from transient network events/outages or registry availability that could otherwise interrupt or prolong the upgrade process.

An example of configuring TALM for pre-caching is shown below. With preCaching enabled TALM inspects the set of policies being applied during the rollout and identifies upgrade content. This includes both the OpenShift platform content, indicated in the platformImage status, as well as any day-2 operator Subscriptions being upgraded as shown in the operatorsPackagesAndChannels status. This ensures that the full set of content to achieve a platform upgrade is local to the cluster prior to initiating the upgrade.

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: update-cgu
namespace: default
spec:
backup: false
clusters:
- cnfdf19
- cnfdf29
enable: false
managedPolicies:
- du-upgrade-platform-upgrade-prep
- du-upgrade-platform-upgrade
- common-config-policy
- common-subscriptions-policy
preCaching: true
remediationStrategy:
  maxConcurrency: 1
  timeout: 360
status:
conditions:
- lastTransitionTime: "2023-01-24T22:41:44Z"
  message: Precaching is completed
  reason: PrecachingCompleted
  status: "True"
  type: PrecachingDone
  precaching:
  clusters:
  - cnfdf19
  - cnfdf29
  spec:
    operatorsIndexes:
    - registry.redhat.io/redhat/redhat-operator-index:v4.11
    operatorsPackagesAndChannels:
    - sriov-network-operator:stable
    - ptp-operator:stable
    - cluster-logging:stable
    - local-storage-operator:stable
    platformImage: quay.io/openshift-release-dev/ocp-release@sha256:...
  status:
    cnfdf19: Succeeded
    cnfdf29: Succeeded

Although we don’t expect problems upgrading the platform, it’s happened frequently enough with other platforms, that customers require a failed upgrade recovery process as a failsafe if upgrades don’t succeed. TALM delivers this feature which is activated with a simple line in YAML to instruct TALM to do a save critical state prior to upgrading a Single Node Openshift cluster. If there is a failure, a restore script can bring the cluster back to the pre-upgrade up state.

As shown in the status below, with this feature enabled, a snapshot was completed on each cluster prior to the upgrade being started.

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: update-cgu
namespace: default
spec:
backup: true
clusters:
- cnfdf19
- cnfdf29
enable: false
managedPolicies:
- du-upgrade-platform-upgrade-prep
- du-upgrade-platform-upgrade
- common-config-policy
- common-subscriptions-policy
preCaching: true
remediationStrategy:
  maxConcurrency: 1
  timeout: 360
status:
backup:
  clusters:
  - cnfdf19
  - cnfdf29
  status:
    cnfdf19: Succeeded
    cnfdf29: Succeeded
conditions:
- lastTransitionTime: "2023-01-25T17:03:05Z"
  message: Backup is completed
  reason: BackupCompleted
  status: "True"
  type: BackupDone

While the features of TALM are particularly useful when managing medium to large fleets of clusters, even small scale deployments can benefit from its additional layer of “operationalizing” features. Mitigating and managing risk through the combination of progressively rolling out changes, pre-caching upgrade content, and taking a pre-upgrade snapshot is something cluster administrators can benefit from regardless of the size of the fleet.

The Topology Aware Lifecycle Manager (TALM) solves many of the problems that Telecommunications Providers encounter when operating a fleet of OpenShift clusters, available in OpenShift 4.12!