Introduction

Red Hat Advanced Cluster Management for Kubernetes (RHACM) supplies the ability to manage fleets of Kubernetes and Openshift clusters. The RHACM model consists of a central controller plane of OpenShift that runs in a RHACM cluster (known as the hub cluster), and several managed clusters where the workloads run. The RHACM model is inspired by the ubiquitous two-layer model, which includes the following components:

  • Kubernetes control plane and compute nodes
  • SDN control and data planes

DevOps uses this same model to work with a two-level architecture, increasing understanding and finally predictability. At the same time, the simple two-layer approach increases robustness.

Unfortunately, the single pane of glass is irreparably linked to the "single point of failure" problem. Recently, RHACM users often report the need to back up the control plane configurations for a quick recovery, in case of an outage.

While most of the configuration can be recreated from scratch using the GitOps approach, at the end of a backup procedure the managed cluster fleet is not correctly registered in the new hub cluster. The goal of this article is to show how RHACM managed clusters configurations can be restored. In this blog, I use the common Unix command (bash), the OpenShift (oc) client, and the Velero CLI.

Disaster Recovery foundations

The ability to back up, restore, and re-register managed clusters to RHACM, lays the foundation for an active-passive Disaster Recovery (DR) solution. Ideally, an organization can back up the hub cluster configuration at some frequency, restoring the configurations elsewhere in case of an outage.

Obviously, a full DR solution is well beyond the scope of this article, and a more robust solution is needed to start thinking about DR. There are too many parameters that can impact business continuity and every organization has to carefully consider what, how, and when the configuration should be backed up and restored, to minimize your Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

Velero

Velero is an open-source tool to back up and restore a Kubernetes cluster. Velero follows the operator pattern reacting to the creation of custom resources (CRs). The CRs used in this article are backups.velero.io and restores.velero.io, but Velero exposes more than 10 CRs to fully automate more complex workflows.

Velero needs to be installed to back up and restore the cluster. As I restore the backup in another cluster, keep in mind that Velero has to be installed on both hub clusters.

Velero saves the resources as manifest files in a tarball format and it stores the tarball in S3 object storage. Velero supports all major S3 providers, but in this article I use AWS S3.

I demonstrate how to deploy Velero through its CLI, but it can be installed in other ways, for example through helm, as a usual deployment, or through the Operator Hub. Whatever the installation method, Velero needs the S3 bucket name, the backup storage and the volume snapshot location region, and the S3 credentials.

The configuration

In this blog, the fleet is composed of only one managed cluster named managed-one. When you have several managed clusters, you must run a loop through the bash commands and potentially need to wait until each command is successfully finished.

Using the OpenShift client (oc) on the dr-hub1 hub cluster, I can verify the managed clusters that are available on the cluster from the command line interface (CLI):

$ oc get managedclusters
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
local-cluster true True True 5d1h
managed-one true True True 4d23h

View the following console image of the accepted clusters on the dr-hub1 hub cluster:

cluster-management

As previously mentioned, Velero must be installed through the CLI. Be sure to supply the S3 credentials. For this article, the AWS S3 configuration needs to have the following file information:

$ cat credentials-velero
[default]
aws_access_key_id = <MY AWS ACCESS KEY ID>
aws_secret_access_key = <MY AWS SECRET ACCESS KEY>

Now, run the following command to install Velero:

$ velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.2.0 \
--bucket acm-dr-blog \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file credentials-velero
[...] # output not reported
CustomResourceDefinition/backups.velero.io: created
[...] # output not reported
CustomResourceDefinition/restores.velero.io: created
[...] # output not reported
Namespace/velero: created
[...] # output not reported
ClusterRoleBinding/velero: created
[...] # output not reported
ServiceAccount/velero: created
[...] # output not reported
Secret/cloud-credentials: created
[...] # output not reported
Deployment/velero: created
Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.

The install command creates the velero namespace and installs all that is needed for Velero. I've removed most of the CRD creations leaving only restores.velero.io and backups.velero.io. The install command creates the velero namespace and runs few velero resources, along with the necessary RBAC resources. You can check which resources are running with the following command:

$ oc get all -n velero
NAME READY STATUS RESTARTS AGE
pod/velero-658979bddd-6qgjk 1/1 Running 0 7m49s

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/velero 1/1 1 1 7m50s

NAME DESIRED CURRENT READY AGE
replicaset.apps/velero-658979bddd 1 1 1 7m50s

A simple way to check whether Velero can connect to S3 storage can be done by running the grep available command in the logs:

$ oc logs deployment/velero -n velero | grep available
...
time="2021-08-01T20:24:56Z" level=info msg="Backup storage location valid, marking as available" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:121"
...

After Velero is up and running, the backup command can be run.

The backup process

The backup process is performed by backing up all of the main RHACM namespaces and all the managed cluster namespaces (in this case only the managed-one). Avoid backing up cluster-scoped resources to minimize the amount of data to backup. This is advised because restore is impacted, and clusterroles and clusterrole-bindings need to be recreated. The managed cluster in RHACM is represented by a namespace that is backed up, and by the CR instance managedclusters.cluster.open-cluster-management.io.

Use the velero CLI to run the backup command:

$ velero backup create acm-backup-blog \
--wait \
--include-cluster-resources=false \
--exclude-resources nodes,events,certificatesigningrequests \
--include-namespaces managed-one,hive,openshift-operator-lifecycle-manager,open-cluster-management-agent,open-cluster-management-agent-addon
[...]
Backup request "acm-backup-blog" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c to stop waiting - your backup will continue in the background.
......................................................
Backup completed with status: Completed. You may check for more information using the commands `velero backup describe acm-backup-blog` and `velero backup logs acm-backup-blog`.

At the end of the backup process, the backups (only one currently) that have the Completed status, without errors or warnings are listed. Run the following command:

$ velero backup get -n velero
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
acm-backup-blog Completed 0 0 2021-08-02 12:12:08 +0200 CEST 29d default <none>

View the following image of the Amazon S3 console. Notice that the acm-backup-log is currently present in the S3 console:

amazon-S3-backups

For the sake of this article, the backup process is finished. As already mentioned, there is definitively more configurations needed for a production environment. Generally, the following configurations can be set:

  • Error handling
  • Backup frequency
  • S3 storage space handling
  • Encrypting data at rest

This list can (and should) continue to grow, depending on the specific use cases.

The restore process

At this point, let's assume that the current hub cluster, dr-hub1, is severely impacted by a disaster. In this case, another hub cluster (dr-hub2) is used to restore the data and to re-register the managed cluster (managed-one). To maintain simplicity in this article, let's assume the RHACM is already installed on dr-hub2.

To proceed in the restore process, velero needs to be installed, configuring the S3 storage in the same way:

$ velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.2.0 \
--bucket acm-dr-blog \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file credentials-velero

When velero is available the backup CRs should also be available on the dr-hub2 hub cluster. Run the following command to verify:

$ velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
acm-backup-blog Completed 0 0 2021-08-02 12:12:08 +0200 CEST 29d default <none>

You can also use the OpenShift client to check for the CRs in the velero namespace:

$ oc get backups -n velero
NAME AGE
acm-backup-blog 58m

Now, restore the managed cluster in dr-hub2 using the Velero CLI. Run the following command:

velero restore create --from-backup acm-backup-blog
Restore request "acm-backup-blog-20210802182417" submitted successfully.
Run `velero restore describe acm-backup-blog-20210802182417` or `velero restore logs acm-backup-blog-20210802182417` for more details.
$ velero restore get
NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR
acm-backup-blog-20210802182417 acm-backup-blog Completed 2021-08-02 18:24:17 +0200 CEST 2021-08-02 18:24:57 +0200 CEST 0 56 2021-08-02 18:24:17 +0200 CEST <none>

At the end of the restore process, verify the creation of the managed cluster namespace and verify that there are no managed cluster registered to your hub cluster:

$ oc get ns managed-one --show-labels=true
NAME STATUS AGE LABELS
managed-one Active 5m37s cluster.open-cluster-management.io/managedCluster=managed-one
$ oc get managedclusters
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
local-cluster true True True 5h11m

Since this is not a real DR scenario, let's push our analysis a little further:

  • Taking a look at the dr-hub1 console, notice that the managed-one cluster is still registered to the hub cluster:

    cluster-management-2

  • Taking a look at the dr-hub2 console, the managed cluster is not registered but appears in the console:

    cluster-management-fail

    This error occurs because the console detects the presence of the managed-one namespace, but it returns the failure since it cannot fetch any information from the managedcluster CR.

Registering the managed cluster

The current solution to the previoulsy mentioned error is to have each managed cluster registration operator, register to the new hub cluster. Each managed cluster registration operator watches the bootstrap-hub-kubeconfig secret, that is within the open-cluster-management-agent namespace. The general idea for the solution can be complete with the following actions:

  1. Generate the new hub cluster Kubernetes configuration (dr-hub2 kubeconfig).
  2. Fetch the admin-kubeconfig secret from the managed cluster namespace.
  3. Supply the authorization to the registration operator to register the managed cluster.
  4. Use the admin-kubeconfig to replace the boostrap-hub-kubeconfig with the dr-hub2 kubeconfig.

Let's start by generating the dr-hub2 kubeconfig. Use heredoc to genereate dr-hub2 kubeconfig. Assuming oc is pointing to dr-hub2, the following series of commands can generate the kubeconfig:

managedclustername=managed-one
server=$(oc config view -o jsonpath='{.clusters[0].cluster.server}')
secretname=$(oc get secret -o name -n $managedclustername | grep ${managedclustername}-bootstrap-sa-token)
ca=$(oc get ${secretname} -n ${managedclustername} -o jsonpath='{.data.ca\.crt}')
token=$(oc get ${secretname} -n ${managedclustername} -o jsonpath='{.data.token}' | base64 --decode)

cat << EOF > newbootstraphub.kubeconfig
apiVersion: v1
kind: Config
clusters:
- name: default-cluster
cluster:
certificate-authority-data: ${ca}
server: ${server}
contexts:
- name: default-context
context:
cluster: default-cluster
namespace: default
user: default-auth
current-context: default-context
users:
- name: default-auth
user:
token: ${token}
EOF

Now that the newbootstraphub.kubeconfig is created, let's replace bootrap-hub-kubeconfig. Before the replacement, fetch the managed-one kubeconfig secret that was restored using Velero. With oc pointing to dr-hub2, the admin-kubeconfig secret must be fetched to access the managed cluster.

Run the following command to get the admin-kubeconfig secret:

$ oc get secret -o name -n $managedclustername  | grep admin-kubeconfig
secret/managed-one-0-qg8wm-admin-kubeconfig

To automate this process a little bit, simply fetch the kubeconfig with the following commands:

$ managed_kubeconfig_secret=$(basename $(oc get secret -o name -n $managedclustername | grep admin-kubeconfig))
$ get secret $managed_kubeconfig_secret -n ${managedclustername} -o jsonpath={.data.kubeconfig} | base64 -d > managedcluster.kubeconfig

You can test the managedcluster.kubeconfig file with any command, but let's get the boostrap-hub-kubeconfig since it needs to be replaced:

$  oc --kubeconfig=managedcluster.kubeconfig get secret bootstrap-hub-kubeconfig -n open-cluster-management-agent
NAME TYPE DATA AGE
bootstrap-hub-kubeconfig Opaque 1 6d16h

The managed cluster registration operator needs access permission to register the managed clusters. Remember, cluster-scope resources from the managed clusters were not backed up. Let's use heredoc to create a ClusterRole resource, where access is granted to the managed cluster resources:

cat << EOF | oc  apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: system:open-cluster-management:managedcluster:bootstrap:${managedclustername}
rules:
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- create
- get
- list
- watch
- apiGroups:
- cluster.open-cluster-management.io
resources:
- managedclusters
verbs:
- get
- create
EOF

cat << EOF | oc apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
managedFields:
name: system:open-cluster-management:managedcluster:bootstrap:${managedclustername}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:open-cluster-management:managedcluster:bootstrap:${managedclustername}
subjects:
- kind: ServiceAccount
name: ${managedclustername}-bootstrap-sa
namespace: ${managedclustername}
EOF

cat << EOF | oc apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: open-cluster-management:managedcluster:${managedclustername}
rules:
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- create
- get
- list
- watch
- apiGroups:
- register.open-cluster-management.io
resources:
- managedclusters/clientcertificates
verbs:
- renew
- apiGroups:
- cluster.open-cluster-management.io
resourceNames:
- ${managedclustername}
resources:
- managedclusters
verbs:
- get
- list
- update
- watch
- apiGroups:
- cluster.open-cluster-management.io
resourceNames:
- ${managedclustername}
resources:
- managedclusters/status
verbs:
- patch
- update
EOF

cat << EOF | oc apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: open-cluster-management:managedcluster:${managedclustername}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: open-cluster-management:managedcluster:${managedclustername}
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:open-cluster-management:${managedclustername}
EOF

Now replace boostrap-hub-kubeconfig with the following commands:

oc --kubeconfig=managedcluster.kubeconfig delete secret bootstrap-hub-kubeconfig -n open-cluster-management-agent
oc --kubeconfig=managedcluster.kubeconfig create secret generic bootstrap-hub-kubeconfig --from-file=kubeconfig=newbootstraphub.kubeconfig -n open-cluster-management-agent

After you replace the boostrap-hub-kubeconfig, the pod/klusterlet-registration-agent and pod/klusterlet-work-agent are refreshed. When these pods restart, the managed cluster appears in the list of the managed clusters.

$ oc get managedclusters
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
local-cluster true True True 23h
managed-one false True 7m1s

Let's have a look at the dr-hub2 console, where managed-one cluster is not yet accepted:

cluster-management-not-accepted

Since the dr-hub1 is still up and running (luckily with no disaster, at least today), take a look at dr-hub1. In the following image, managed-one is labeled Offline.

cluster-management-offline

Accepting managed-one in dr-hub2

Manually accept the managed-one cluster in the dr-hub2 hub cluster with the following command:

 oc patch managedcluster ${managedclustername} -p='{"spec":{"hubAcceptsClient":true}}' --type=merge

Verify that the managed cluster has joined by running the following command:

oc get managedclusters
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
local-cluster true True True 24h
managed-one true True True 104m

The managed-one cluster is accepted in the dr-hub2 hub cluster. View the following image:

cluster-management-accepted

Conclusion

I have demonstrated how managed cluster configurations can be backed up and restored in a different hub cluster without affecting other pods, except for the RHACM pods in the managed cluster. As a reminder, the RHACM pods restart due to the replacement of boostrap-hub-kubeconfig.

I want to give a special thanks to Zachary Kayyali, David Schmidt, Christine Rizzo, Christian Stark, and Mikela Jackson for reviewing and contributing to this blog. Another special thanks go to Chris Doan for the reviews, insights, and for being the best teammate one would like to have.


Categories

How-tos, GitOps, Red Hat Advanced Cluster Management, Multi-Cluster, disaster recovery, backup

< Back to the blog