How workload partitioning and isolation works behind the scenes

1 hr

After we’ve seen what resulting outputs we can get through partitioning, it begs the question of how it all comes together in the background. The isolation is achieved through a combination of configurations at the Kubelet and CRI-O levels, managed by the Performance Addon Operator.

What will you learn?

A deeper understanding of what exactly goes into partitioning and isolation within your environment

What you need before starting:

Red Hat account
Red Hat OpenShift Container Platform
A cluster with partitioning enabled

Kubelet Configuration

The PerformanceProfile creates a KubeletConfig object. The key parameter here is reservedSystemCPUs.

oc get kubeletconfig performance-openshift-node-performance-profile -o yaml

# Snippet from KubeletConfig
...
spec:
  kubeletConfig:
    ...
    cpuManagerPolicy: static
    reservedSystemCPUs: 0-19
    topologyManagerPolicy: restricted
...

The reservedSystemCPUs: 0-19 directive instructs the Kubelet to reserve cores 0-19 for the operating system and Kubernetes system daemons. The Kubelet's CPU Manager will only consider the remaining cores (20-23) as allocatable for pods.

CRI-O configuration

Additionally, a CRI-O configuration file is created to pin specific system-level workloads to the reserved cores. This ensures that even containers part of the OpenShift infrastructure are constrained to the reserved set.

# On a master node
cat /etc/crio/crio.conf.d/99-workload-pinning.conf

[crio.runtime.workloads.management]
activation_annotation = "target.workload.openshift.io/management"
annotation_prefix = "resources.workload.openshift.io"
resources = { "cpushares" = 0, "cpuset" = "0-19" }

This configuration tells CRI-O that any pod with the target.workload.openshift.io/management annotation should be placed on the 0-19 cpuset. This is how control plane pods are pinned, ensuring they do not interfere with user workloads.

Verifying core isolation at the kernel level

The final and most fundamental layer of verification is to inspect the kernel's boot parameters. These parameters, passed to the Linux kernel at startup, provide the low-level instructions that enforce CPU isolation from the very beginning of the system's operation. By examining /proc/cmdline, we can see the direct result of the PerformanceProfile configuration.

cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/boot/ostree/rhcos-c97ac5f995c95de8117ca18e99d4fd82651d24967ea8f886514abf2d37f508cd/vmlinuz-5.14.0-427.81.1.el9_4.x86_64 
ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/c97ac5f995c95de8117ca18e99d4fd82651d24967ea8f886514abf2d37f508cd/0 root=UUID=910678ff-f77e-4a7d-8d53-86f2ac47a823 
rw rootflags=prjquota boot=UUID=5da29aba-79d3-42eb-b6f1-df02cd30cc8a skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=20-23 tuned.non_isolcpus=000fffff 
systemd.cpu_affinity=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 intel_iommu=on iommu=pt isolcpus=managed_irq,20-23 intel_pstate=disable systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0

The output reveals several key parameters that directly enable workload partitioning:

 isolcpus=managed_irq,20-23: This is the primary parameter instructing the Linux kernel scheduler to isolate cores 20-23. The scheduler will avoid placing general-purpose processes on these cores, reserving them for workloads that are explicitly affinitized. The managed_irq option allows some interrupt handling to remain on these cores to prevent system instability.

rcu_nocbs=20-23: This parameter offloads RCU (Read-Copy-Update) callbacks from the isolated cores. RCU is a synchronization mechanism in the kernel, and moving its callbacks away from the workload cores helps to reduce kernel "noise" and ensures more predictable performance for applications.
systemd.cpu_affinity=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19: This parameter pins the systemd init process (PID 1) and, by extension, all core system services it manages, to the reserved CPU set (0-19).
tuned.non_isolcpus=000fffff: This provides a CPU mask to the tuned daemon, which manages performance profiles. The mask 000fffff corresponds to the first 20 CPUs (0-19), explicitly telling that these are the non-isolated, general-purpose cores.

Together, these kernel arguments create a robust, low-level foundation for CPU isolation, ensuring that the separation between system and workload resources is maintained right from the boot process.

Node Status Comparison

The effect of workload partitioning is clearly visible in the oc describe node output, specifically in the Capacity and Allocatable sections.

Before Workload Partitioning

Without workload partitioning, the Allocatable CPU is much higher (e.g., 23.5), as only a small fraction is reserved by default for system overhead.

# oc describe node master-01-demo (Before)
...
Capacity:
  cpu:                24
  memory:             30797840Ki
  pods:               250
Allocatable:
  cpu:                23500m
  memory:             29646864Ki
  pods:               250
...

After Workload Partitioning

Notice that while the Capacity shows 24 total CPUs, the Allocatable CPU count is only 4. This reflects the 20 cores that were reserved for the system.

# oc describe node master-01-demo (After)
...
Capacity:
  cpu:                24
  memory:             30797848Ki
  pods:               250
Allocatable:
  cpu:                4
  memory:             29671448Ki
  pods:               250
...

This comparison starkly illustrates how workload partitioning carves out a dedicated, non-allocatable set of CPU resources for system stability. With this in mind, we now have a full picture of the benefits and changes available to us with partitioning as an option.

Get more support

Troubleshoot with Red Hat support

How to use workload partitioning with Red Hat OpenShift Container Platform

Resources in this path

How workload partitioning and isolation works behind the scenes

What will you learn?

What you need before starting:

Kubelet Configuration

CRI-O configuration

Verifying core isolation at the kernel level

Node Status Comparison

Before Workload Partitioning

After Workload Partitioning

Get more support

Platforms

Tools

Try, buy, sell

Communicate

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links

How to use workload partitioning with Red Hat OpenShift Container Platform

View the resources in this path

Resources in this path

How workload partitioning and isolation works behind the scenes

What will you learn?

What you need before starting:

Kubelet Configuration

CRI-O configuration

Verifying core isolation at the kernel level

Node Status Comparison

Before Workload Partitioning

After Workload Partitioning

Get more support

Platforms

Tools

Try, buy, sell

Communicate

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links