How workload partitioning and isolation works behind the scenes
After we’ve seen what resulting outputs we can get through partitioning, it begs the question of how it all comes together in the background. The isolation is achieved through a combination of configurations at the Kubelet and CRI-O levels, managed by the Performance Addon Operator.
What will you learn?
- A deeper understanding of what exactly goes into partitioning and isolation within your environment
What you need before starting:
- Red Hat account
- Red Hat OpenShift Container Platform
- A cluster with partitioning enabled
Kubelet Configuration
The PerformanceProfile creates a KubeletConfig object. The key parameter here is reservedSystemCPUs.
oc get kubeletconfig performance-openshift-node-performance-profile -o yaml# Snippet from KubeletConfig
...
spec:
kubeletConfig:
...
cpuManagerPolicy: static
reservedSystemCPUs: 0-19
topologyManagerPolicy: restricted
...The reservedSystemCPUs: 0-19 directive instructs the Kubelet to reserve cores 0-19 for the operating system and Kubernetes system daemons. The Kubelet's CPU Manager will only consider the remaining cores (20-23) as allocatable for pods.
CRI-O configuration
Additionally, a CRI-O configuration file is created to pin specific system-level workloads to the reserved cores. This ensures that even containers part of the OpenShift infrastructure are constrained to the reserved set.
# On a master node
cat /etc/crio/crio.conf.d/99-workload-pinning.conf[crio.runtime.workloads.management]
activation_annotation = "target.workload.openshift.io/management"
annotation_prefix = "resources.workload.openshift.io"
resources = { "cpushares" = 0, "cpuset" = "0-19" }This configuration tells CRI-O that any pod with the target.workload.openshift.io/management annotation should be placed on the 0-19 cpuset. This is how control plane pods are pinned, ensuring they do not interfere with user workloads.
Verifying core isolation at the kernel level
The final and most fundamental layer of verification is to inspect the kernel's boot parameters. These parameters, passed to the Linux kernel at startup, provide the low-level instructions that enforce CPU isolation from the very beginning of the system's operation. By examining /proc/cmdline, we can see the direct result of the PerformanceProfile configuration.
cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/boot/ostree/rhcos-c97ac5f995c95de8117ca18e99d4fd82651d24967ea8f886514abf2d37f508cd/vmlinuz-5.14.0-427.81.1.el9_4.x86_64
ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/c97ac5f995c95de8117ca18e99d4fd82651d24967ea8f886514abf2d37f508cd/0 root=UUID=910678ff-f77e-4a7d-8d53-86f2ac47a823
rw rootflags=prjquota boot=UUID=5da29aba-79d3-42eb-b6f1-df02cd30cc8a skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=20-23 tuned.non_isolcpus=000fffff
systemd.cpu_affinity=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 intel_iommu=on iommu=pt isolcpus=managed_irq,20-23 intel_pstate=disable systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0The output reveals several key parameters that directly enable workload partitioning:
isolcpus=managed_irq,20-23: This is the primary parameter instructing the Linux kernel scheduler to isolate cores 20-23. The scheduler will avoid placing general-purpose processes on these cores, reserving them for workloads that are explicitly affinitized. The managed_irq option allows some interrupt handling to remain on these cores to prevent system instability.rcu_nocbs=20-23: This parameter offloads RCU (Read-Copy-Update) callbacks from the isolated cores. RCU is a synchronization mechanism in the kernel, and moving its callbacks away from the workload cores helps to reduce kernel "noise" and ensures more predictable performance for applications.systemd.cpu_affinity=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19: This parameter pins the systemd init process (PID 1) and, by extension, all core system services it manages, to the reserved CPU set (0-19).tuned.non_isolcpus=000fffff: This provides a CPU mask to thetuneddaemon, which manages performance profiles. The mask000fffffcorresponds to the first 20 CPUs (0-19), explicitly telling that these are the non-isolated, general-purpose cores.
Together, these kernel arguments create a robust, low-level foundation for CPU isolation, ensuring that the separation between system and workload resources is maintained right from the boot process.
Node Status Comparison
The effect of workload partitioning is clearly visible in the oc describe node output, specifically in the Capacity and Allocatable sections.
Before Workload Partitioning
Without workload partitioning, the Allocatable CPU is much higher (e.g., 23.5), as only a small fraction is reserved by default for system overhead.
# oc describe node master-01-demo (Before)
...
Capacity:
cpu: 24
memory: 30797840Ki
pods: 250
Allocatable:
cpu: 23500m
memory: 29646864Ki
pods: 250
...After Workload Partitioning
Notice that while the Capacity shows 24 total CPUs, the Allocatable CPU count is only 4. This reflects the 20 cores that were reserved for the system.
# oc describe node master-01-demo (After)
...
Capacity:
cpu: 24
memory: 30797848Ki
pods: 250
Allocatable:
cpu: 4
memory: 29671448Ki
pods: 250
...This comparison starkly illustrates how workload partitioning carves out a dedicated, non-allocatable set of CPU resources for system stability. With this in mind, we now have a full picture of the benefits and changes available to us with partitioning as an option.