Introduction

Red Hat Advanced Cluster Management for Kubernetes (RHACM) defines two main types of clusters: hub clusters and managed clusters. The hub cluster is the main cluster with RHACM installed on it. You can create, manage, and monitor other Kubernetes clusters with the hub cluster.

The managed clusters are Kubernetes clusters that are managed by the hub cluster. You can create some clusters by using the RHACM hub cluster, and you can also import existing clusters to be managed by the hub cluster.

Often, the following question is raised: What is happening when the hub cluster is unavailable?

While all managed clusters are still working fine in the previous situation, some features like alerting or cluster updates, based on policy configurations, are not working properly anymore. This is unacceptable for long periods of time. Once the hub cluster is unavailable you need to have a recovery plan to decide if it can be recovered, or if a new hub cluster should be deployed and the data needs to be recovered on the new hub cluster.

This blog covers the situation when the hub cluster becomes unavailable and how it can be recovered, and presents the backup and restore component that is available with RHACM; which implements the solution for recovering hub clusters. Scenarios outside the scope of this component are disaster recovery scenarios for applications running on managed clusters, or scenarios where the managed clusters become unavailable.

The blog covers the steps on how to configure an active-passive hub cluster configuration, where the initial hub cluster backs up data and one, or more passive hub clusters are on stand-by to control the managed clusters when the active cluster becomes unavailable.

It also shows how the backup and restore component sends alerts using a policy that is configured to let the administrator know when the main hub cluster is unavailable, and a restore operation may be required. The same policy alerts the administrator if the backup solution is not functioning as expected, even if the main hub cluster is active and managing the clusters. It reports any issues with the backup data not being produced, or any other issues that can result in backup data and an unavailable hub cluster.

Prerequisites

View the following prerequisites to follow along in this blog:

For both active and passive hub clusters:

  • Advanced Cluster Management for Kubernetes Operator version 2.5.x must be installed on your hub cluster. View the following screen capture:

operators

  • MultiClusterHub resource is created and displays the status of Running; the MultiClusterHub resource is automatically created when you install the RHACM Operator version 2.5.x.
  • The cluster backup and restore operator chart is not installed automatically. Enable the cluster-backup operator on the hub cluster. Edit the MultiClusterHub resource and set the cluster-backup to true. This installs the OADP operator in the same namespace with the backup chart. View the following screen capture:

mch

For passive hub clusters:

  • Before you run the restore operation on the passive hub cluster, you need to manually configure the hub cluster and install all operators as on the primary hub cluster, using the same namespace as the primary hub cluster operators.
  • You must install the RHACM operator in the same namespace as the initial hub cluster, then create the DataProtectionApplication resource, and connect to the same storage location where the initial hub cluster has backed up data.
  • If the initial hub cluster has any other operators installed, such as Ansible Automation Platform, Red Hat OpenShift GitOps, or cert-manager you have to install them before running the restore operation. This ensures that the new hub cluster is configured in the same way as the initial hub cluster.
  • The passive hub cluster must use the same namespace names as the old hub cluster when you install the RHACM operator, and any other operators configured on the previous hub cluster.

Note: The OADP Operator 1.0 has disabled building multi-arch builds and only produces x86_64 builds for the official release. This means that if you are using an architecture other than x86_64, the OADP Operator installed by the chart must be replaced with the correct version. In this case, uninstall the OADP Operator, find the operator matching your architecture, and then install it.

Product value

The Backup and restore component provides the following value:

  • A disaster recovery solution for recovering the hub cluster, in the case the hub cluster that is unavailale.

  • Support to backup all data required to restore the hub cluster on a new cluster, including support to backup third-party resources extending the RHACM solution.

  • Installs a hub-backup-pod.yaml policy, which automatically reports when the backup solution is not functioning as expected, even if the main hub cluster is active and managing the clusters. This avoids the situation when backup data from the hub cluster becomes unavailable when the disaster hits.

How it works

The cluster backup and restore operator runs on the hub cluster and depends on the OADP Operator to create a connection to a backup storage location on the hub cluster. The OADP Operator also installs Velero, which is the component used to backup and restore user created hub resources.

The cluster backup and restore operator is installed using the cluster-backup-chart file. The cluster backup and restore operator chart is not installed automatically. Starting with RHACM version 2.5, the cluster backup and restore operator chart is installed by setting the cluster-backup option to true on the MultiClusterHub resource.

The cluster backup and restore operator chart automatically installs the OADP Operator in the same namespace with the backup chart. If you have previously installed and used the OADP Operator on your hub cluster, you should uninstall the version since the backup chart works now with the operator that is installed in the chart namespace. This should not affect your old backups and previous work. Just use the same storage location for the DataProtectionApplication resource, which is owned by the OADP Operator and installed with the backup chart; you should have access to the same backup data as the previous operator. The only difference is that Velero backup resources are now loaded in the new OADP Operator namespace on your hub cluster.

Setting up the storage location

Before you can use the cluster backup and restore operator, the OADP Operator must be configured to set the connection to the storage location, where you want backups to be saved. Check out the following steps to create credential secrets for where your backups are going to be saved.

Then use your created secret when you create the DataProtectionApplication resource to setup the connection to the storage location. View the following screen capture:

data-protection

The DataProtectionApplication resource creates a Velero BackupStorageLocation resource, which is used to define the Velero storage location. Make sure the BackupStorageLocation resource has a Status - Phase of Available. View the following screen capture:

backup-storage-location

Now the hub cluster is prepared to backup to or restore resources from the specified storage location during the following scenarios:

  • Backup resource if the hub cluster is an active hub cluster.
  • Restore resources if the hub cluster is a passive hub cluster on stand-by to become the primary cluster, in case of a disaster when the active hub becomes unavailable.

Cluster backup and restore flow

The operator defines the BackupSchedule.cluster.open-cluster-management.io resource that is used to setup RHACM backup schedules, and the Restore.cluster.open-cluster-management.io resource that is used to process and restore these backups. The operator sets the options required to backup remote cluster configurations and any other hub cluster resources that need to be restored. View the following screen capture of the backup and restore architecture:

cluster-backup-controller-dataflow

Building an active-passive backup configuration

We explained how to setup the backup and restore component, and how to setup the connection to the storage location. As a reminder, it's understood that a hub cluster can be an active cluster that manages the clusters and backs up data to the storage location, while other passive hub clusters are on stand-by to restore backed up data and prepare to be the active cluster. The administrator is notified of backup issues or disaster scenarios using a backup policy.

Next, let's follow the active-passive scenario to learn how to set the active and passive hub cluster, and how to make sure the backup is enabled and functional.

Active-passive configuration

In an active-passive configuration, the following components are used:

  • One hub cluster, called active or primary hub cluster, which manages the clusters and is backing up resources at defined time intervals with the BackupSchedule.cluster.open-cluster-management.io resource.

  • One or more passive hub clusters that continously retrieve the latest backups and restore the passive data. The passive hub clusters use the Restore.cluster.open-cluster-management.io resource to continuously restore passive data posted by the primary hub, when new backup data is available. These hubs are on stand-by to become a primary hub cluster when the primary hub becomes unavailable. They are connected to the same storage location where the primary hub backs up data so they can access the primary hub backups. For more details on how to setup this automatic restore configuration see the Restore backups on the passive hubs section.

In the following image, the active hub cluster manages the remote clusters and backs up hub data at regular intervals. The passive hubs restore this data, except for the managed clusters activation data, which moves the managed clusters to the passive hub. The passive hub clusters can restore the passive data continously, or as a one-time operation. For example, see cluster_v1beta1_restore_passive_sync.yaml sample, which restores passive data continuously,along with cluster_v1beta1_restore_passive.yaml for a one-time operation.

active-passive-configuration-dataflowDisaster recovery

When the primary hub cluster is unavailable, one of the passive hub clusters are chosen by the administrator to take over the managed clusters. In the following image, the administrator decides to use Hub N as the new primary hub cluster. View the following steps to make Hub N become a primary hub cluster:

  1. Hub N restores the Managed Cluster activation data. At this point, the managed clusters connect with Hub N. See the Activate a passive hub section for more details on how to restore active data.

  2. The administrator starts a backup on the new primary, Hub N, by creating a BackupSchedule.cluster.open-cluster-management.io resource and storing the backups at the same storage location as the initial primary hub cluster. Seamlessly, all other passive hub clusters now restore passive data using the backup data created by the new primary hub. Hub N is now the primary hub cluster, managing clusters and backing up data.

disaster-recoveryNotes:

  • Step 1 is not automated since the administrator decides if the unavailable primary hub cluster needs to be replaced, or notices that there is some network communication error between the hub and managed clusters. The administrator also decides which passive hub cluster should become the primary cluster. If desired, you can automate this step by using the policy integration with Ansible jobs. The administrator can setup an Ansible job to be ran when the backup policy reports errors.
  • Although Step 2 is a manual step, the administrator is notified using the backups that are actively running as a cron job policy template, if the administrator forgets (or ommits)to start creating backups from the new primary hub.
  • More details about restore options are available with the backup project.

Passive data

Passive data is backup data such as secrets, ConfigMaps, apps, policies and all the managed cluster custom resources that do not activate the connection between managed clusters and hub clusters, where these resources are being restored on. These resources are stored by the credentials backup file and resource backup files.

Managed cluster activation data

Managed cluster activation data or activation data, is backup data that results in managed clusters being actively managed by the cluster when it is restored on a new hub cluster. Activation data resources are stored by the managed cluster backups, and by the resources-generic backup using the cluster.open-cluster-management.io/backup: cluster-activations label. More details about the activation resources are available with the backup project.

Enable a backup schedule on the active hub

To enable a backup schedule on the active hub cluster, you have to create a BackupSchedule.cluster.open-cluster-management.io resource in the same namespace where the OADP Operator is installed. View the following image:

create-backupschedule

The backup schedule status should in the Enabled state. In the previous sample, there is a defined backup schedule that creates backups every two hours while the expired backups are deleted after 10 days. See the following image:

restore-enabledOnce the backup schedule is enabled, you should see backup schedules created and owned by the BackupSchedule.cluster.open-cluster-management.io resource. See the following image:

schedulesThese schedules create backups every two hours. After the Velero backups are complete, the resources are stored at the storage location set by the DataProtectionApplication resource. These backups can be accessed by any hub cluster connecting to the same storage location. The passive hub clusters should use a DataProtectionApplication resource that points to the same storage location as the active hub. From this perspective, you can access and restore these backups. View the following image:

backupsRestore backups on the passive hubs

To restore a backup on the passive hub cluster, you have to create a Restore.cluster.open-cluster-management.io resource in the same namespace where the OADP Operator is installed, and set to restore the latest resources while skipping the managed cluster data. As a result, the managed clusters continue to be managed by the primary hub cluster. For a passive hub cluster, you should restore only passive resources. The managed cluster data is restored at the very end, when the active hub is unavailable and the administrator decides to make this passive hub an active configuration. Continue reading to learn how managed cluster data is restored.

In the following image, a restore passive resource with sync is created. You can choose to keep looking for new backups and restore them as they are found, using the syncRestoreWithNewBackups option. If this option is set to true, then the Restore.cluster.open-cluster-management.io is Enabled and still running after a restore is ran. Use the restoreSyncInterval parameter to set the duration for checking for new backups. If this restoreSyncInterval property is not set, the default is 30 minutes:create-restore

After the Restore.cluster.open-cluster-management.io resource is created, the resource is in an Enabled state if the syncRestoreWithNewBackups=true is used. Otherwise, the Restore resource is set to Finished and the restore is not run again. Notice that only passive data is being restored. The syncRestoreWithNewBackups option is valid only when passive data is restored. As soon as you restore the managed clusters data, the hub cluster becomes the active hub cluster, so no restores are run after that. View the following image:

restore-enabled

After the previous steps, the clusters still show as active on the primary hub and they are managed by the primary hub cluster. The following image shows the managed clusters on the primary hub cluster in a Ready state:

managed-cls-active-hub

The following image shows the managed clusters on the passive hub cluster. Note, that only clusters that are created using the Hive API appear at this step. Managed clusters imported on the primary hub using the Import cluster operation displays only when the activation data is restored on the passive hub cluster:

managed-cls-passive-1The managed clusters console and login information is available on the passive hub cluster:

managed-cls-detachedView restore details

Use the oc describe -n <oadp-n> <restore-name> command to get information about restore events. View the following sample:

oc describe restore -n open-cluster-management-backup example
Name: example
Namespace: open-cluster-management-backup
Labels: <none>
Annotations: <none>
API Version: cluster.open-cluster-management.io/v1beta1
Kind: Restore
Metadata:
Creation Timestamp: 2022-03-26T19:44:49Z
Generation: 1
Managed Fields:
API Version: cluster.open-cluster-management.io/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:cleanupBeforeRestore:
f:restoreSyncInterval:
f:syncRestoreWithNewBackups:
f:veleroCredentialsBackupName:
f:veleroManagedClustersBackupName:
f:veleroResourcesBackupName:
Manager: Mozilla
Operation: Update
Time: 2022-03-26T19:44:49Z
API Version: cluster.open-cluster-management.io/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:lastMessage:
f:phase:
f:veleroCredentialsRestoreName:
f:veleroResourcesRestoreName:
Manager: __debug_bin
Operation: Update
Subresource: status
Time: 2022-03-26T19:45:21Z
Resource Version: 12151491
UID: 761c725e-731d-42f7-b0ef-547af0789a3b
Spec:
Cleanup Before Restore: CleanupRestored
Restore Sync Interval: 20m
Sync Restore With New Backups: true
Velero Credentials Backup Name: latest
Velero Managed Clusters Backup Name: skip
Velero Resources Backup Name: latest
Status:
Last Message: Velero restores have run to completion, restore will continue to sync with new backups
Phase: Enabled
Velero Credentials Restore Name: example-acm-credentials-schedule-20220406171919
Velero Resources Restore Name: example-acm-resources-schedule-20220406171920
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-credentials-hive-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-credentials-cluster-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-credentials-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-resources-generic-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-resources-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-credentials-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-resources-generic-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-resources-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-credentials-cluster-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-credentials-hive-schedule-20220406155817
Normal Prepare to restore: 64m Restore controller Cleaning up resources for backup acm-resources-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-credentials-hive-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-credentials-cluster-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-credentials-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-resources-generic-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-credentials-cluster-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-credentials-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-resources-generic-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-resources-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-credentials-hive-schedule-20220406165328
Normal Prepare to restore: 38m Restore controller Cleaning up resources for backup acm-resources-generic-schedule-20220406171920
Normal Prepare to restore: 38m Restore controller Cleaning up resources for backup acm-resources-schedule-20220406171920
Normal Prepare to restore: 36m Restore controller Cleaning up resources for backup acm-credentials-hive-schedule-20220406171919
Normal Prepare to restore: 36m Restore controller Cleaning up resources for backup acm-credentials-cluster-schedule-20220406171919
Normal Prepare to restore: 36m Restore controller Cleaning up resources for backup acm-credentials-schedule-20220406171919
Normal Velero restore created: 36m Restore controller example-acm-credentials-cluster-schedule-20220406171919
Normal Velero restore created: 36m Restore controller example-acm-credentials-schedule-20220406171919
Normal Velero restore created: 36m Restore controller example-acm-resources-generic-schedule-20220406171920
Normal Velero restore created: 36m Restore controller example-acm-resources-schedule-20220406171920
Normal Velero restore created: 36m Restore controller example-acm-credentials-hive-schedule-20220406171919

Activate a passive hub

In the case of a disaster, when the active hub becomes unavailable, the administrator is notified using the backup policy that the hub cluster is no longer active and producing backups. In this case, the administrator decides what passive hub to become the active one. On this hub, the administrator creates a Restore.cluster.open-cluster-management.io resource and sets the veleroManagedClustersBackupName to latest; if you have an active Restore resource in an Enabled state, you should update the veleroManagedClustersBackupName on that resource and set it to latest. This moves the managed clusters to the new hub cluster.

Note: Only managed clusters that are created using the Hive API are automatically connected with the new hub cluster when the activation data is restored on the passive hub cluster. All other managed clusters show up as Pending Import and must be imported back on the new hub cluster. The Hive managed clusters can be connected with the new hub cluster because Hive provides the kubeconfig to connect to the managed cluster, and it is being backed up and restored on the new hub. The import controller then updates the bootstrap kubeconfig on the managed cluster using the restored configuration. This information is only available for managed clusters created using the Hive API.

The administrator then must create a BackupSchedule.cluster.open-cluster-management.io resource on the hub cluster to start backing up data.

This concludes the steps needed to change the active hub cluster to a passive cluster.

Backup validation using a policy

The cluster backup and restore operator chart installs the backup-restore-enabled policy, which is used to inform on issues with the backup and restore component.

The policy has a set of templates that check for the following constraints and informs when any of them are violated:

policyPod validation

The following templates check the pod status for the backup component and dependencies:

  • acm-backup-pod-running template checks if backup and restore operator pod is running
  • oadp-pod-running template checks if OADP operator pod is running
  • velero-pod-running template checks if the Velero pod is running

Data Protection Application validation

data-protection-application-available template checks if a DataProtectionApplication.oadp.openshift.io resource is created. This OADP resource sets up Velero configurations.

Backup storage validation

backup-storage-location-available template checks if a BackupStorageLocation.velero.io resource is created and the status is Available. This implies that the connection to the backup storage is valid.

Backups are actively running as a cron job

This validation is done with the backup-schedule-cron-enabled template. It checks that a BackupSchedule.cluster.open-cluster-management.io is actively running and creating new backups at the storage location. The template verifies that there is a Backup.velero.io resource with a velero.io/schedule-name: acm-validation-policy-schedule label as the storage location. The acm-validation-policy-schedule backups are set to expire after the time set for the backups cron schedule. If no cron job is running to create backups, the old acm-validation-policy-schedule backup is deleted because it expired and a new one is not created. So if no acm-validation-policy-schedule backup exists at any moment in time, it means that there are no active cron jobs generating acm backups.

BackupSchedule collision validation

acm-backup-clusters-collision-report template checks if a BackupSchedule.cluster.open-cluster-management.io resource exists on the current hub cluster and its state is not BackupCollision. This verifies that the current hub cluster is not in collision with any other hub when writing backup data to the storage location. For a definition of the BackupCollision state read the Backup Collisions section.

BackupSchedule and Restore status validation

  • acm-backup-phase-validation template checks if a BackupSchedule.cluster.open-cluster-management.io exists on the current cluster and the status is not in a Failed, or empty state. This ensures that if this cluster is the primary hub and is generating backups, the BackupSchedule.cluster.open-cluster-management.io status is healthy. It also checks if a Restore.cluster.open-cluster-management.io resource exists on the current cluster and the status is not Failed or a empty state. This ensures that if this cluster is the secondary hub and restores backups, the Restore.cluster.open-cluster-management.io status is healthy.

Backups exist validation

acm-managed-clusters-schedule-backups-available template checks if Backup.velero.io resources are available at the location specified by the BackupStorageLocation.velero.io resource, and if the backups were created by a BackupSchedule.cluster.open-cluster-management.io resource. This validates that the backups have been ran at least once, using the backup and restore operator.

Backups are running to completion

acm-backup-in-progress-report template checks if Backup.velero.io resources are stuck in InProgress state. This validation is added because with a large number of resources, the velero pod restarts as the backup runs and the backup stays in progress without completing. During a normal backup though, the backup resources are in progress at some point of the start of backup, but they don't get stuck in this phase and run to completion. It is normal to see the acm-backup-in-progress-report template reporting a warning during the time that the schedule is running and backups are in progress.

This policy is intended to help notify the administrator of any backup issues as the hub cluster is active, and expected to produce or restore backups. You can setup an automatic response to any of the policy violations by setting up an Ansible job with the following policy:

ansible-policyConclusion

This blog describes how to use the cluster backup and restore operator available with Red Hat Advanced Cluster Management 2.5.x to setup a disaster recovery configuration for your hub cluster. It shows how to setup an active-passive configuration consisting of one primary hub cluster, managing clusters, and one or more passive hubs ready to take over when the primary hub is unavailable.

The cluster backup and restore operator provides a way for you to setup an automatic schedule of backups, as well as an automatic restore of passive resources, in preparation for the passive hub cluster to take over. The backup solution is extendable so that third-party components that are installed on the same hub cluster can be included into the Red Hat Advanced Cluster Management backup.

The backup solution provides a policy that informs on any issues with the backup configuration, and provides a way for the administrator to manage the disaster solution and further automate the configuration using Ansible tasks.


Categories

How-tos, Red Hat Advanced Cluster Management, Multi-Cluster, disaster recovery, backup

< Back to the blog