Red Hat Advanced Cluster Management for Kubernetes (RHACM) defines two main types of clusters: hub clusters and managed clusters. The hub cluster is the main cluster with RHACM installed on it. You can create, manage, and monitor other Kubernetes clusters with the hub cluster.
The managed clusters are Kubernetes clusters that are managed by the hub cluster. You can create some clusters by using the RHACM hub cluster, and you can also import existing clusters to be managed by the hub cluster.
Often, the following question is raised: What is happening when the hub cluster is unavailable?
While all managed clusters are still working fine in the previous situation, some features like alerting or cluster updates, based on policy configurations, are not working properly anymore. This is unacceptable for long periods of time. Once the hub cluster is unavailable you need to have a recovery plan to decide if it can be recovered, or if a new hub cluster should be deployed and the data needs to be recovered on the new hub cluster.
This blog covers the situation when the hub cluster becomes unavailable and how it can be recovered, and presents the backup and restore component that is available with RHACM; which implements the solution for recovering hub clusters. Scenarios outside the scope of this component are disaster recovery scenarios for applications running on managed clusters, or scenarios where the managed clusters become unavailable.
The blog covers the steps on how to configure an active-passive hub cluster configuration, where the initial hub cluster backs up data and one, or more passive hub clusters are on stand-by to control the managed clusters when the active cluster becomes unavailable.
It also shows how the backup and restore component sends alerts using a policy that is configured to let the administrator know when the main hub cluster is unavailable, and a restore operation may be required. The same policy alerts the administrator if the backup solution is not functioning as expected, even if the main hub cluster is active and managing the clusters. It reports any issues with the backup data not being produced, or any other issues that can result in backup data errors or an unavailable hub cluster.
View the following prerequisites to follow along in this blog:
For both active and passive hub clusters:
Advanced Cluster Management for KubernetesOperator version 2.5.x must be installed on your hub cluster. View the following screen capture:
MultiClusterHubresource is created and displays the status of
MultiClusterHubresource is automatically created when you install the RHACM Operator version 2.5.x.
- The cluster backup and restore operator chart is not installed automatically. Enable the
cluster-backupoperator on the hub cluster. Edit the
MultiClusterHubresource and set the
true. This installs the
OADP operatorin the same namespace with the backup chart. View the following screen capture:
For passive hub clusters:
- Before you run the restore operation on the passive hub cluster, you need to manually configure the hub cluster and install all operators as on the primary hub cluster, using the same namespace as the primary hub cluster operators.
- You must install the RHACM operator in the same namespace as the initial hub cluster, then create the
DataProtectionApplicationresource, and connect to the same storage location where the initial hub cluster has backed up data.
- If the initial hub cluster has any other operators installed, such as
Ansible Automation Platform,
Red Hat OpenShift GitOps, or
cert-manageryou have to install them before running the restore operation. This ensures that the new hub cluster is configured in the same way as the initial hub cluster.
- The passive hub cluster must use the same namespace names as the old hub cluster when you install the RHACM operator, and any other operators configured on the previous hub cluster.
Note: The OADP Operator 1.0 has disabled building multi-arch builds and only produces
x86_64 builds for the official release. This means that if you are using an architecture other than
x86_64, the OADP Operator installed by the chart must be replaced with the correct version. In this case, uninstall the OADP Operator, find the operator matching your architecture, and then install it.
The Backup and restore component provides the following value:
A disaster recovery solution for recovering the hub cluster, in the case the hub cluster that is unavailale.
hub-backup-pod.yamlpolicy, which automatically reports when the backup solution is not functioning as expected, even if the main hub cluster is active and managing the clusters. This avoids the situation when backup data from the hub cluster becomes unavailable when the disaster hits.
How it works
The cluster backup and restore operator runs on the hub cluster and depends on the OADP Operator to create a connection to a backup storage location on the hub cluster. The OADP Operator also installs Velero, which is the component used to backup and restore user created hub resources.
The cluster backup and restore operator is installed using the
cluster-backup-chart file. The cluster backup and restore operator chart is not installed automatically. Starting with RHACM version 2.5, the cluster backup and restore operator chart is installed by setting the
cluster-backup option to
true on the
The cluster backup and restore operator chart automatically installs the OADP Operator in the same namespace with the backup chart. If you have previously installed and used the OADP Operator on your hub cluster, you should uninstall the version since the backup chart works now with the operator that is installed in the chart namespace. This should not affect your old backups and previous work. Just use the same storage location for the
DataProtectionApplication resource, which is owned by the OADP Operator and installed with the backup chart; you should have access to the same backup data as the previous operator. The only difference is that Velero backup resources are now loaded in the new OADP Operator namespace on your hub cluster.
Setting up the storage location
Before you can use the cluster backup and restore operator, the OADP Operator must be configured to set the connection to the storage location, where you want backups to be saved. Check out the following steps to create credential secrets for where your backups are going to be saved.
Then use your created secret when you create the
DataProtectionApplication resource to setup the connection to the storage location. View the following screen capture:
DataProtectionApplication resource creates a Velero
BackupStorageLocation resource, which is used to define the Velero storage location. Make sure the
BackupStorageLocation resource has a Status - Phase of
Available. View the following screen capture:
Now the hub cluster is prepared to backup to or restore resources from the specified storage location during the following scenarios:
- Backup resource if the hub cluster is an active hub cluster.
- Restore resources if the hub cluster is a passive hub cluster on stand-by to become the primary cluster, in case of a disaster when the active hub becomes unavailable.
Cluster backup and restore flow
The operator defines the
BackupSchedule.cluster.open-cluster-management.io resource that is used to setup RHACM backup schedules, and the
Restore.cluster.open-cluster-management.io resource that is used to process and restore these backups. The operator sets the options required to backup remote cluster configurations and any other hub cluster resources that need to be restored. View the following screen capture of the backup and restore architecture:
Building an active-passive backup configuration
We explained how to setup the backup and restore component, and how to setup the connection to the storage location. As a reminder, it's understood that a hub cluster can be an active cluster that manages the clusters and backs up data to the storage location, while other passive hub clusters are on stand-by to restore backed up data and prepare to be the active cluster. The administrator is notified of backup issues or disaster scenarios using a backup policy.
Next, let's follow the active-passive scenario to learn how to set the active and passive hub cluster, and how to make sure the backup is enabled and functional.
In an active-passive configuration, the following components are used:
One hub cluster, called active or primary hub cluster, which manages the clusters and is backing up resources at defined time intervals with the
One or more passive hub clusters that continously retrieve the latest backups and restore the passive data. The passive hub clusters use the
Restore.cluster.open-cluster-management.ioresource to continuously restore passive data posted by the primary hub, when new backup data is available. These hubs are on stand-by to become a primary hub cluster when the primary hub becomes unavailable. They are connected to the same storage location where the primary hub backs up data so they can access the primary hub backups. For more details on how to setup this automatic restore configuration see the Restore backups on the passive hubs section.
In the following image, the active hub cluster manages the remote clusters and backs up hub data at regular intervals. The passive hubs restore this data, except for the managed clusters activation data, which moves the managed clusters to the passive hub. The passive hub clusters can restore the passive data continously, or as a one-time operation. For example, see
cluster_v1beta1_restore_passive_sync.yaml sample, which restores passive data continuously,along with
cluster_v1beta1_restore_passive.yaml for a one-time operation.
When the primary hub cluster is unavailable, one of the passive hub clusters are chosen by the administrator to take over the managed clusters. In the following image, the administrator decides to use Hub N as the new primary hub cluster. View the following steps to make Hub N become a primary hub cluster:
The administrator starts a backup on the new primary, Hub N, by creating a
BackupSchedule.cluster.open-cluster-management.ioresource and storing the backups at the same storage location as the initial primary hub cluster. Seamlessly, all other passive hub clusters now restore passive data using the backup data created by the new primary hub. Hub N is now the primary hub cluster, managing clusters and backing up data.
- Step 1 is not automated since the administrator decides if the unavailable primary hub cluster needs to be replaced, or notices that there is some network communication error between the hub and managed clusters. The administrator also decides which passive hub cluster should become the primary cluster. If desired, you can automate this step by using the policy integration with Ansible jobs. The administrator can setup an Ansible job to be ran when the backup policy reports errors.
- Although Step 2 is a manual step, the administrator is notified using the backups that are actively running as a cron job policy template, if the administrator forgets (or ommits)to start creating backups from the new primary hub.
- More details about restore options are available with the backup project.
Passive data is backup data such as secrets, ConfigMaps, apps, policies and all the managed cluster custom resources that do not activate the connection between managed clusters and hub clusters, where these resources are being restored on. These resources are stored by the credentials backup file and resource backup files.
Managed cluster activation data
Managed cluster activation data or activation data, is backup data that results in managed clusters being actively managed by the cluster when it is restored on a new hub cluster. Activation data resources are stored by the managed cluster backups, and by the
resources-generic backup using the
cluster.open-cluster-management.io/backup: cluster-activations label. More details about the activation resources are available with the backup project.
Enable a backup schedule on the active hub
To enable a backup schedule on the active hub cluster, you have to create a
BackupSchedule.cluster.open-cluster-management.io resource in the same namespace where the OADP Operator is installed. View the following image:
The backup schedule status should be in the
Enabled state. In the previous sample, there is a defined backup schedule that creates backups every two hours while the expired backups are deleted after 10 days. See the following image:
Once the backup schedule is enabled, you should see backup schedules created and owned by the
BackupSchedule.cluster.open-cluster-management.io resource. See the following image:
These schedules create backups every two hours. After the Velero backups are complete, the resources are stored at the storage location set by the
DataProtectionApplication resource. These backups can be accessed by any hub cluster connecting to the same storage location. The passive hub clusters should use a
DataProtectionApplication resource that points to the same storage location as the active hub. From this perspective, you can access and restore these backups. View the following image:
Restore backups on the passive hubs
To restore a backup on the passive hub cluster, you have to create a
Restore.cluster.open-cluster-management.io resource in the same namespace where the OADP Operator is installed, and set to restore the latest resources while skipping the managed cluster data. As a result, the managed clusters continue to be managed by the primary hub cluster. For a passive hub cluster, you should restore only passive resources. The managed cluster data is restored at the very end, when the active hub is unavailable and the administrator decides to make this passive hub an active configuration. Continue reading to learn how managed cluster data is restored.
In the following image, a restore passive resource with sync is created. You can choose to keep looking for new backups and restore them as they are found, using the
syncRestoreWithNewBackups option. If this option is set to
true, then the
Enabled and still running after a restore is ran. Use the
restoreSyncInterval parameter to set the duration for checking for new backups. If this
restoreSyncInterval property is not set, the default is 30 minutes:
Restore.cluster.open-cluster-management.io resource is created, the resource is in an
Enabled state if the
syncRestoreWithNewBackups=true is used. Otherwise, the
Restore resource is set to
Finished and the restore is not run again. Notice that only passive data is being restored. The
syncRestoreWithNewBackups option is valid only when passive data is restored. As soon as you restore the managed clusters data, the hub cluster becomes the active hub cluster, so no restores are run after that. View the following image:
After the previous steps, the clusters still show as active on the primary hub and they are managed by the primary hub cluster. The following image shows the managed clusters on the primary hub cluster in a
The following image shows the managed clusters on the passive hub cluster. Note, that only clusters that are created using the Hive API appear at this step. Managed clusters imported on the primary hub using the Import cluster operation displays only when the activation data is restored on the passive hub cluster:
The managed clusters console and login information is available on the passive hub cluster:
View restore details
oc describe -n <oadp-n> <restore-name> command to get information about restore events. View the following sample:
oc describe restore -n open-cluster-management-backup example
API Version: cluster.open-cluster-management.io/v1beta1
Creation Timestamp: 2022-03-26T19:44:49Z
API Version: cluster.open-cluster-management.io/v1beta1
Fields Type: FieldsV1
API Version: cluster.open-cluster-management.io/v1beta1
Fields Type: FieldsV1
Resource Version: 12151491
Cleanup Before Restore: CleanupRestored
Restore Sync Interval: 20m
Sync Restore With New Backups: true
Velero Credentials Backup Name: latest
Velero Managed Clusters Backup Name: skip
Velero Resources Backup Name: latest
Last Message: Velero restores have run to completion, restore will continue to sync with new backups
Velero Credentials Restore Name: example-acm-credentials-schedule-20220406171919
Velero Resources Restore Name: example-acm-resources-schedule-20220406171920
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-credentials-hive-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-credentials-cluster-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-credentials-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-resources-generic-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-resources-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-credentials-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-resources-generic-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-resources-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-credentials-cluster-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-credentials-hive-schedule-20220406155817
Normal Prepare to restore: 64m Restore controller Cleaning up resources for backup acm-resources-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-credentials-hive-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-credentials-cluster-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-credentials-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-resources-generic-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-credentials-cluster-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-credentials-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-resources-generic-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-resources-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-credentials-hive-schedule-20220406165328
Normal Prepare to restore: 38m Restore controller Cleaning up resources for backup acm-resources-generic-schedule-20220406171920
Normal Prepare to restore: 38m Restore controller Cleaning up resources for backup acm-resources-schedule-20220406171920
Normal Prepare to restore: 36m Restore controller Cleaning up resources for backup acm-credentials-hive-schedule-20220406171919
Normal Prepare to restore: 36m Restore controller Cleaning up resources for backup acm-credentials-cluster-schedule-20220406171919
Normal Prepare to restore: 36m Restore controller Cleaning up resources for backup acm-credentials-schedule-20220406171919
Normal Velero restore created: 36m Restore controller example-acm-credentials-cluster-schedule-20220406171919
Normal Velero restore created: 36m Restore controller example-acm-credentials-schedule-20220406171919
Normal Velero restore created: 36m Restore controller example-acm-resources-generic-schedule-20220406171920
Normal Velero restore created: 36m Restore controller example-acm-resources-schedule-20220406171920
Normal Velero restore created: 36m Restore controller example-acm-credentials-hive-schedule-20220406171919
Activate a passive hub
In the case of a disaster, when the active hub becomes unavailable, the administrator is notified using the backup policy that the hub cluster is no longer active and producing backups. In this case, the administrator decides what passive hub to become the active one. On this hub, the administrator creates a
Restore.cluster.open-cluster-management.io resource and sets the
latest; if you have an active
Restore resource in an
Enabled state, you should update the
veleroManagedClustersBackupName on that resource and set it to
latest. This moves the managed clusters to the new hub cluster.
Note: Only managed clusters that are created using the Hive API are automatically connected with the new hub cluster when the activation data is restored on the passive hub cluster. All other managed clusters show up as
Pending Import and must be imported back on the new hub cluster. The Hive managed clusters can be connected with the new hub cluster because Hive provides the
kubeconfig to connect to the managed cluster, and it is being backed up and restored on the new hub. The import controller then updates the bootstrap
kubeconfig on the managed cluster using the restored configuration. This information is only available for managed clusters created using the Hive API.
The administrator then must create a
BackupSchedule.cluster.open-cluster-management.io resource on the hub cluster to start backing up data.
This concludes the steps needed to change the active hub cluster to a passive cluster.
Backup validation using a policy
The policy has a set of templates that check for the following constraints and informs when any of them are violated:
The following templates check the pod status for the backup component and dependencies:
acm-backup-pod-runningtemplate checks if backup and restore operator pod is running
oadp-pod-runningtemplate checks if OADP operator pod is running
velero-pod-runningtemplate checks if the Velero pod is running
Data Protection Application validation
data-protection-application-available template checks if a
DataProtectionApplication.oadp.openshift.io resource is created. This OADP resource sets up Velero configurations.
Backup storage validation
backup-storage-location-available template checks if a
BackupStorageLocation.velero.io resource is created and the status is
Available. This implies that the connection to the backup storage is valid.
Backups are actively running as a cron job
This validation is done with the
backup-schedule-cron-enabled template. It checks that a
BackupSchedule.cluster.open-cluster-management.io is actively running and creating new backups at the storage location. The template verifies that there is a
Backup.velero.io resource with a
velero.io/schedule-name: acm-validation-policy-schedule label as the storage location. The
acm-validation-policy-schedule backups are set to expire after the time set for the backups cron schedule. If no cron job is running to create backups, the old
acm-validation-policy-schedule backup is deleted because it expired and a new one is not created. So if no
acm-validation-policy-schedule backup exists at any moment in time, it means that there are no active cron jobs generating
BackupSchedule collision validation
acm-backup-clusters-collision-report template checks if a
BackupSchedule.cluster.open-cluster-management.io resource exists on the current hub cluster and its state is not
BackupCollision. This verifies that the current hub cluster is not in collision with any other hub when writing backup data to the storage location. For a definition of the
BackupCollision state read the Backup Collisions section.
BackupSchedule and Restore status validation
acm-backup-phase-validationtemplate checks if a
BackupSchedule.cluster.open-cluster-management.ioexists on the current cluster and the status is not in a
Failed, or empty state. This ensures that if this cluster is the primary hub and is generating backups, the
BackupSchedule.cluster.open-cluster-management.iostatus is healthy. It also checks if a
Restore.cluster.open-cluster-management.ioresource exists on the current cluster and the status is not
Failedor a empty state. This ensures that if this cluster is the secondary hub and restores backups, the
Restore.cluster.open-cluster-management.iostatus is healthy.
Backups exist validation
acm-managed-clusters-schedule-backups-available template checks if
Backup.velero.io resources are available at the location specified by the
BackupStorageLocation.velero.io resource, and if the backups were created by a
BackupSchedule.cluster.open-cluster-management.io resource. This validates that the backups have been ran at least once, using the backup and restore operator.
Backups are running to completion
acm-backup-in-progress-report template checks if
Backup.velero.io resources are stuck in
InProgress state. This validation is added because with a large number of resources, the velero pod restarts as the backup runs and the backup stays in progress without completing. During a normal backup though, the backup resources are in progress at some point of the start of backup, but they don't get stuck in this phase and run to completion. It is normal to see the
acm-backup-in-progress-report template reporting a warning during the time that the schedule is running and backups are in progress.
This policy is intended to help notify the administrator of any backup issues as the hub cluster is active, and expected to produce or restore backups. You can setup an automatic response to any of the policy violations by setting up an Ansible job with the following policy:
This blog describes how to use the cluster backup and restore operator available with Red Hat Advanced Cluster Management 2.5.x to setup a disaster recovery configuration for your hub cluster. It shows how to setup an active-passive configuration consisting of one primary hub cluster, managing clusters, and one or more passive hubs ready to take over when the primary hub is unavailable.
The cluster backup and restore operator provides a way for you to setup an automatic schedule of backups, as well as an automatic restore of passive resources, in preparation for the passive hub cluster to take over. The backup solution is extendable so that third-party components that are installed on the same hub cluster can be included into the Red Hat Advanced Cluster Management backup.
The backup solution provides a policy that informs on any issues with the backup configuration, and provides a way for the administrator to manage the disaster solution and further automate the configuration using Ansible tasks.