This is the second part of our series of blogs on the OpenShift sandboxed containers Operator. The first part is available from the following link. In this blog, we want to show what you can do when things go wrong. An OpenShift cluster is a complex system, and many pieces need to work together. Sometimes things get out of order, and a little help is needed to get everything back into the right order. We have chosen the most common problems and describe what those mean and how they can be solved.

Kata Runtime Installation and Uninstallation on OpenShift Nodes

Here is a quick recap of the Kata Containers runtime deployment workflow, which will be useful for troubleshooting.

The installation workflow gets triggered by the creation of KataConfig CR. Alternatively, uninstallation workflow gets triggered by the deletion of KataConfig CR.

For each of the worker nodes in the MachineConfigPool (MCP), the MachineConfigOperator (MCO) performs a set of steps as described in the following diagram to install or remove the sandboxed-containers extension:

For detailed understanding of the install process including creation of MachineConfigPool(MCP) and MachineConfig (MC) refer to the previous blog in the series - https://cloud.redhat.com/blog/openshift-sandboxed-containers-operator-from-zero-to-hero-the-hard-way

Troubleshooting

“Logs” is your friend to troubleshoot issues with installation or uninstallation.

To understand the key logs that are required for troubleshooting, let’s consider the following diagram depicting the major components:

You can use the following table as a cheat sheet describing how to collect the required info for each of the components:

Component

Commands

Node

Get the cluster nodes and current status (Ready, NotReady etc..)

oc get nodes

Sandbox Containers Operator

Get the operator POD

oc get pods -n openshift-sandboxed-containers-operator

You can also use the following snippet to get the POD name

POD_NAME=$(oc get pods -l control-plane=controller-manager -n openshift-sandboxed-containers-operator -o jsonpath="{.items[0].metadata.name}")

View Logs

oc logs $POD_NAME -n openshift-sandboxed-containers-operator

KataConfig CR

Get the KataConfig CR status

oc describe kataconfigs <kataconfig-cr-name>

You can also use the following snippet to get only the status:

oc get kataconfigs <kataconfig-cr-name> -o jsonpath='{.status.*}'

Machine Config Operator

Get MachineConfigPool and current status

oc get mcp

Get MachineConfig and current status

oc get mc

View Logs

Find out the machine config daemon pods running on each node.

oc get pods -n openshift-machine-config-operator -o wide | grep machine-config-daemon

You can also use the following snippet to get the machine-config-daemon and the corresponding nodes:

oc get pods -l k8s-app=machine-config-daemon -n openshift-machine-config-operator -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName

View logs of specific machine config daemon

oc logs <pod-name> -n openshift-machine-config-operator -c machine-config-daemon

CRI-O and Kata runtime logs

Use ‘journalctl’ on specific node

oc debug node/<node-name>

...

To use host binaries, run `chroot /host`

sh-4.4# chroot /host

sh-4.4# journalctl -u crio

Use ‘oc adm’ command

oc adm node-logs <node_name> -u crio

If SSH is configured then you can SSH to the node and check the logs.

For more details refer to: https://docs.openshift.com/container-platform/4.8/support/troubleshooting/troubleshooting-crio-issues.html

You can also use OpenShift Logging to centrally collect logs from all the nodes and create dashboards in Kibana: https://docs.openshift.com/container-platform/4.8/logging/cluster-logging-visualizer.html

Let’s go through a sample scenario and troubleshoot it using the above commands.

Scenario

You have created a KataConfig Custom Resource(CR) and are monitoring the progress of installation by watching the CR description:

$ oc describe kataconfig example-kataconfig
Spec:
Kata Config Pool Selector:  <nil>
Status:
Installation Status:
  Is In Progress:  True
  Completed:
  Failed:
  Inprogress:
    Binaries Install Nodes List:
      ip-10-0-140-61.us-east-2.compute.internal
Kata Image:
Prev Mcp Generation:  2
Runtime Class:
Total Nodes Count:    3
[..snip..]

You realize  that the installation status is in-progress ( Is In Progress:  True) for a long time and appears to be stuck for one node: ip-10-0-140-61.us-east-2.compute.internal under “Binaries Install Nodes List.

Viewing the operator logs is a good first step to understand what is happening as part of the Kata runtime deployment process.

Get the Operator POD

$ oc get pods -n openshift-sandboxed-containers-operator
NAME                                  READY   STATUS    RESTARTS   AGE

sandboxed-containers-operator-controller-manager-65c57c9d5s7nkx    2/2     Running   0          16h

 

View the POD logs

$ oc logs sandboxed-containers-operator-controller-manager-65c57c9d5s7nkx -n openshift-sandboxed-containers-operator

2021-07-02T10:02:50.699Z        INFO    controllers.KataConfig  Creating MachineConfig for Custom Resource
2021-07-02T10:02:50.699Z        INFO    controllers.KataConfig  Getting MachineConfigPool Name
2021-07-02T10:02:50.699Z        INFO    controllers.KataConfig  Getting MachineConfigPool Name
2021-07-02T10:02:50.699Z        INFO    controllers.KataConfig  Getting MachineConfigPool Name
2021-07-02T10:02:50.699Z        INFO    controllers.KataConfig  Waiting for MachineConfigPool to be fully updated
2021-07-02T10:03:05.704Z        INFO    controllers.KataConfig  Reconciling KataConfig in OpenShift Cluster
2021-07-02T10:03:05.705Z        INFO    controllers.KataConfig  Kata installation in progress
2021-07-02T10:03:05.705Z        INFO    controllers.KataConfig  Getting MachineConfigPool Name
2021-07-02T10:03:05.705Z        INFO    controllers.KataConfig  creating RHCOS extension MachineConfig
2021-07-02T10:03:05.705Z        INFO    controllers.KataConfig  Creating MachineConfig for Custom Resource
2021-07-02T10:03:05.705Z        INFO    controllers.KataConfig  Getting MachineConfigPool Name
2021-07-02T10:03:05.705Z        INFO    controllers.KataConfig  Getting MachineConfigPool Name
2021-07-02T10:03:05.705Z        INFO    controllers.KataConfig  Getting MachineConfigPool Name
2021-07-02T10:03:05.705Z        INFO    controllers.KataConfig  Waiting for MachineConfigPool to be fully updated

As you can see from the logs, the operator has created the MachineConfig (MC) and is waiting for the MachineConfigPool (MCP) to be updated.

Viewing the MCP status confirms that the “worker” MCP is updating, as shown below:

$ oc get mcp

NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-26065a51be10defb9236a9e86726bf1e   True      False      False      3              3                   3                     0                      103m
worker   rendered-worker-95ae3283c7dfd0ecdf2594414e1f083c   False     True       False      3              0                   0                     0                      103m

Let’s view the node status:

$ oc get nodes

NAME                                         STATUS                        ROLES    AGE    VERSION
ip-10-0-134-242.us-east-2.compute.internal   Ready                         master   109m   v1.21.0-rc.0+1622f87
ip-10-0-140-61.us-east-2.compute.internal    NotReady,SchedulingDisabled   worker   102m   v1.21.0-rc.0+1622f87
ip-10-0-178-80.us-east-2.compute.internal    Ready                         worker   103m   v1.21.0-rc.0+1622f87
ip-10-0-184-83.us-east-2.compute.internal    Ready                         master   108m   v1.21.0-rc.0+1622f87
ip-10-0-217-214.us-east-2.compute.internal   Ready                         worker   101m   v1.21.0-rc.0+1622f87
ip-10-0-242-199.us-east-2.compute.internal   Ready                         master   109m   v1.21.0-rc.0+1622f87

The node status for “ip-10-0-140-61.us-east-2.compute.internal” shows “NotReady.

Let’s continue to observe the status for some time. If the node status doesn’t change to “Ready,” the MachineConfigPool “UPDATING” status will still be “True” and the KataConfig CR description will continue to show in-progress and appear to be stuck

A good first step is to look at the machine-config-daemon logs for the specific node (ip-10-0-140-61.us-east-2.compute.internal) to know the status of extension installation.

Get the machine-config-daemon POD for the specific node:

$ oc get pods -l k8s-app=machine-config-daemon -n openshift-machine-config-operator -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName

NAME                          NODE
machine-config-daemon-5zzcf   ip-10-0-134-242.us-east-2.compute.internal
machine-config-daemon-gb7s8   ip-10-0-140-61.us-east-2.compute.internal
machine-config-daemon-mmjjj   ip-10-0-178-80.us-east-2.compute.internal
machine-config-daemon-rxjb8   ip-10-0-184-83.us-east-2.compute.internal
machine-config-daemon-txvpz   ip-10-0-242-199.us-east-2.compute.internal

View the POD logs:

$ oc logs machine-config-daemon-gb7s8 -n openshift-machine-config-operator -c machine-config-daemon

[..snip..]
I0704 15:23:58.131569    4107 update.go:1099] Applying extensions : ["update" "--install" "kata-containers"]
I0704 15:23:58.131656    4107 rpm-ostree.go:258] Running captured: rpm-ostree update --install kata-containers
[..snip..]

The logs do not  indicate any issue with package installation.

 You will need to check the node for errors. Check the node console for any boot time errors. If the node is pingable and direct SSH is configured, then SSH to the node, and check for kubelet or crio errors. Resolve any errors and check the node status.

Once the node status changes to “Ready,” all the related objects will reflect the updated status.

For example, the KataConfig CR description will show the node in the completed list:

Status:
Installation Status:
  Is In Progress:  True
  Completed:
    Completed Nodes Count:  1
    Completed Nodes List:
      ip-10-0-140-61.us-east-2.compute.internal
  Failed:
  Inprogress:
    Binaries Install Nodes List:
      ip-10-0-178-80.us-east-2.compute.internal
Kata Image:
Prev Mcp Generation:  2
Runtime Class:
Total Nodes Count:    3
[..snip..]

Also, the MCP READYMACHINECOUNT and UPDATEDMACHINECOUNT status will get updated to reflect the same:

$ oc get mcp

NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-26065a51be10defb9236a9e86726bf1e   True      False      False      3              3                   3                     0                      108m
worker   rendered-worker-95ae3283c7dfd0ecdf2594414e1f083c   False     True       False      3              1                   1                     0                      108m

Summary

In this last part of our blog series about the OpenShift sandboxed containers Operator, we showed how to troubleshoot the most common problems that can occur. We looked at one exemplary scenario and used a step-by-step process to show the root cause. In the next release of the Operator, we will provide an image that can be passed to the ‘oc adm must-gather’ command to collect even more information that will help solve all kinds of possible problems. We hope you have a great experience with the OpenShift sandboxed containers Operator and have no need to use what you have learned in this blog.


About the authors

Jens Freimann is a Software Engineering Manager at Red Hat with a focus on OpenShift sandboxed containers and Confidential Containers. He has been with Red Hat for more than six years, during which he has made contributions to low-level virtualization features in QEMU, KVM and virtio(-net). Freimann is passionate about Confidential Computing and has a keen interest in helping organizations implement the technology. Freimann has over 15 years of experience in the tech industry and has held various technical roles throughout his career.

Read full bio