Introduction

The Node Tuning Operator (NTO) is a Second Level Operator (SLO) that ships with OpenShift Container Platform (OCP) 4 by default. It has come a long way since its first release, and this blog post aims to give a basic overview of the Operator and the new functionality introduced in the latest OCP 4.5 release.

Why NTO?

The majority of high-performance applications require some level of OS tuning. On RHEL systems, these tunings are traditionally delivered by the Tuned daemon. The introduction of the OpenShift-managed RHEL (RHCOS) has seen many traditional RHEL packages stripped away from this immutable OS. The Tuned daemon was no exception.

A lot of effort from many teams and individuals went into creating workload-specific and time-tested Tuned profiles and the Tuned daemon itself. Therefore, an OpenShift-aware wrapper around the Tuned daemon (openshift-tuned) and NTO that manages this operand on all cluster nodes were created.

The main reasons for using NTO are to:

  • Bridge the gap between existing Tuned/HPC profiles on RHEL into OCP.
  • Ensure full compatibility with all these profiles on RHEL/RHCOS.
  • Abstract OS version level-dependent tuning details away.
  • Enable modularity and Tuned profile inheritance.
  • Provide sane defaults for OCP control plane and worker nodes.
  • Provide dynamic tuning with rollback without the need for node reboots.
  • Have a centralized way to customize node-level tuning for cluster administrators.

So Sysctls Only?

Taking a look at the default Tuned CR shipped in OCP 4.5, the NTO mostly configures kernel parameters at run time via the /proc/sys kernel interface. While setting these parameters is most typical for a Tuned profile, there are many other Tuned plugins supported by the NTO. Most notably:

  • [bootloader] boot configuration and kernel parameters
  • [cpu] governors, EPB, locking CPU to low C states (by PM QoS)
  • [disk] I/O disk scheduler (elevator), readahead values, spindown
  • [irqbalance] HW interrupt management across CPUs (no irqbalance on isolated cores)
  • [module] kernel module parameters with optional reload
  • [mounts] enable/disable barriers (ext* filesystems only)
  • [net] HW settings for network interfaces, WoL, netfilter tuning
  • [scheduler] SMP IRQ affinity, proc/kthread sched_setaffinity(), white/blacklists
  • [selinux] AVC cache tuning
  • [sysfs] writes to kernel-exported internal implementation details
  • [systemd] CPUAffinity in /etc/systemd/system.conf
  • [vm] Virtual Memory subsystem tuning

Combined with the use of profile inheritance, Tuned variables, conditionals, and built-in functions, the profiles provide a powerful way to modularize and abstract tuning details away from the users.

There are many predefined Tuned profiles, most of which work out-of-the box with the NTO. We continue to work towards supporting them all.

MachineConfigPool matching in OCP 4.5

The [bootloader] plugin is key for many Tuned profiles, such as the cpu-partitioning and realtime Tuned profiles. NTO added support for the Tuned [bootloader] plugin for RHCOS hosts in OCP 4.5. This functionality has been added via NTO’s MachineConfigPool matching. Let us take a look at an example of how this works.

Note the order of node labeling and object creation used below, as it minimizes the number of individual node reboots to one.

First, add a new node-role to a set of nodes in your OCP cluster:

$ oc label node <rt-nodes> node-role.kubernetes.io/worker-rt

Next, create a new custom openshift-realtime Tuned profile that inherits from openshift-node and realtime profiles. This profile targets a MachineConfigPool matching MachineConfigs with "worker-rt" labels:

$ oc create -f- <<TUNED_PROFILE
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: openshift-realtime
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
- data: |
[main]
summary=Custom OpenShift realtime profile
include=openshift-node,realtime
[variables]
# isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7
isolated_cores=1
name: openshift-realtime

recommend:
- machineConfigLabels:
machineconfiguration.openshift.io/role: "worker-rt"
priority: 30
profile: openshift-realtime
TUNED_PROFILE

Finally, create a worker-rt MachineConfigPool that selects the set of nodes we have labeled with the “worker-rt” label. The MachineConfigPool also selects the MachineConfig labels we have defined in our custom openshift-realtime Tuned profile:

$ oc create -f- <<MCP_RT
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-rt
labels:
worker-rt: ""
spec:
machineConfigSelector:
matchExpressions:
- key: machineconfiguration.openshift.io/role
operator: In
values:
- worker
- worker-rt
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker-rt: ""
MCP_RT

Looking at any of the “worker-rt” labeled nodes should reveal kernel parameters calculated and set by the realtime Tuned profile:

$ echo 'cat /proc/cmdline' | oc debug node/<rt-node>
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-98c9b430a9a4b27263eb20ac33df3737681efe0cc866f51d2a9d7c0985a50348/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.1/rhcos/98c9b430a9a4b27263eb20ac33df3737681efe0cc866f51d2a9d7c0985a50348/0 ignition.platform.id=aws skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup tsc=nowatchdog

We also see the realtime Tuned profile’s parameters appended to the kernel command-line.

Based on isolated_cores=1 defined in the Tuned custom resource (openshift-tuned), the Tuned daemon calculated the mask for the irqbalance daemon, updated its host configuration file, and reloaded it. Let us view that on the node:

$ echo 'grep ^IRQ /host/etc/sysconfig/irq*' | oc debug node/<rt-node>
IRQBALANCE_BANNED_CPUS=00000002

There are other settings applied by the realtime profile in the background via the Tuned daemon, such as setting kernel runtime parameters, using the [scheduler] plugin to set SMP IRQ affinity, and setting process/kthread CPU affinity mask.

Adding kernel parameters on RHCOS nodes can be achieved by creating MachineConfigs and MachineConfigPools manually. However, setting them via the new NTO 4.5 functionality has an advantage of building on top of existing profiles, using additional Tuned plugins including [scheduler], [irqbalance], and the ability to override kernel parameters defined by other Tuned profiles.

But Is This enough?

The new MachineConfigPool matching functionality introduced in OCP 4.5 helps in many application scenarios and improves the compatibility with RHEL-shipped Tuned profiles. However, there are other considerations not covered by Tuned profiles that need to be addressed for applications sensitive to CPU and network latency.

Some of these considerations are:

  • NUMA-aware hugepage runtime allocation immediately after system boot.
  • Installing a real-time kernel.
  • Setting kubelet’s reservedSystemCPUs option to complement the isolcpus kernel parameter.
  • Setting kubelet’s topologyPolicy option.

The Performance Addon Operator (PAO) addresses the considerations above and works with the NTO by creating NTO resources to apply Tuned profiles to achieve the required latency on selected nodes.

Future Work

There are currently several areas of work planned for the NTO:

  • Integrating the latest Tuned 2.14 as the operand for the NTO.
  • Implementation of an interface for retrieving operator’s metrics.
  • Support for the [bootloader] plugin on RHEL 7.x hosts.
  • General compatibility improvements for all Tuned profiles.