Red Hat blog

Demystifying node-level tuning on OpenShift

January 6, 2023Jiří Mencák

Typically, Red Hat OpenShift Container Platform (RHOCP) administrators do not need to worry about node-level tuning because the platform comes installed with reasonable defaults to run general-purpose workloads. However, there are scenarios where intervention is needed to improve workload performance. While most of the time, this intervention will be done by cluster administrators as a post-installation (day 2) configuration change, it might even be necessary at cluster installation (day 1).

The aim of this blog post is to give an overview of the default node-level tuning done on an OpenShift cluster and the extra options cluster administrators have to apply node-level tuning to tailor the performance of the platform to their needs.

Option 0: Do nothing

The OpenShift platform, like Red Hat Enterprise Linux, comes tuned by default for general-purpose workloads. The system tuning is primarily performed by the Node Tuning Operator (NTO), which is one of the core OpenShift operators.

Many of the tunables in the parent openshift profile raise certain kernel limits. This improves how the system functions during higher system loads and cluster scale. On the other hand, the changes mostly come at the cost of increased kernel memory consumption.

[main]
summary=Optimize systems running OpenShift (parent profile)
include=${f:virt_check:virtual-guest:throughput-performance}

[selinux]
avc_cache_threshold=8192      # rhbz#1548428, PR10027

[net]
nf_conntrack_hashsize=1048576 # PR413 (the default limit is too low for OpenShift)

[sysctl]
net.ipv4.ip_forward=1         # Forward packets between interfaces
kernel.pid_max=>4194304       # PR79, for large-scale workloads; systemd sets kernel.pid_max to 4M since v243
fs.aio-max-nr=>1048576        # PSAP-900
net.netfilter.nf_conntrack_max=1048576
net.ipv4.conf.all.arp_announce=2           # rhbz#1758552 pod communication due to ARP failures
net.ipv4.neigh.default.gc_thresh1=8192
net.ipv4.neigh.default.gc_thresh2=32768
net.ipv4.neigh.default.gc_thresh3=65536    # rhbz#1384746 gc_thresh3 limits no. of nodes/routes
net.ipv6.neigh.default.gc_thresh1=8192
net.ipv6.neigh.default.gc_thresh2=32768
net.ipv6.neigh.default.gc_thresh3=65536
vm.max_map_count=262144                    # rhbz#1793714 ElasticSearch (logging)

[sysfs]
/sys/module/nvme_core/parameters/io_timeout=4294967295
/sys/module/nvme_core/parameters/max_retries=10

[scheduler]
# see rhbz#1979352; exclude containers from aligning to house keeping CPUs
cgroup_ps_blacklist=/kubepods\.slice/
# workaround for rhbz#1921738
runtime=0

In the openshift profile, we mostly build on the throughput-performance profile, which is the default profile recommended for servers. Moreover, we include other functional and performance tunables. The functional ones:

Fix OpenShift pod communication issues due to ARP failures between a node and its pods
Increase vm.max_map_count to enable clean start of Elasticsearch pods
Adjust the kernel.pid_max for large-scale workloads (also helps with pod density)

The performance tunables:

Allow large-size clusters and more than 1000 of routes
Adjust the sizes of netfilter connection tracking hash table and its maximum entries
Improve node performance (CPU utilization) by adjusting the AVC cache threshold (1, 2)
Enable more VMs running on RHOCP nodes
Prevent the TuneD [scheduler] plug-in to align containers to housekeeping CPUs
Disable the dynamic behavior of the TuneD [scheduler] plug-in

In the current versions of RHOCP, the openshift-control-plane simply inherits from the openshift profile. As for openshift-node profile, both fs.inotify settings are functional settings that, at this point, only mirror the settings already provided by MCO to enable their application before kubelet start. The net.ipv4.tcp_fastopen=3 setting reduces network latency by enabling data exchange during the sender's initial TCP SYN on client and server connections.

Option 1: I need some custom tuning

In a previous blog post, we covered how to apply custom node-level configuration using NTO. Here we will give an example Tuned CR for tuning a system with a 10 Gigabit Intel(R) network interface card for throughput, as suggested by the Linux kernel documentation for the ixgb driver.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-network-tuning
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Increase throughput for NICs using ixgb driver
      include=openshift-node
      [sysctl]
      ### CORE settings (mostly for socket and UDP effect)
      # Set maximum receive socket buffer size.
      net.core.rmem_max = 524287
      # Set maximum send socket buffer size
      net.core.wmem_max = 524287
      # Set default receive socket buffer size.
      net.core.rmem_default = 524287
      # Set default send socket buffer size.
      net.core.wmem_default = 524287
      # Set maximum amount of option memory buffers.
      net.core.optmem_max = 524287
      # Set number of unprocessed input packets before kernel starts dropping them.
      net.core.netdev_max_backlog = 300000
    name: openshift-network-tuning
  recommend:
  - match:
    - label: node-role.kubernetes.io/worker
    priority: 20
    profile: openshift-network-tuning

Note that the core network settings only are included for brevity.

Option 2: I need low-latency/real-time tuning

Some specialized workloads require low-latency/real-time tuning, such as Telco 5G Core User Plane Function (UPF), Financial Services Industry (FSI), and some High-Performance Computing (HPC) workloads. However, such tuning requires sacrifices. Be it a loss of overall throughput when using the real-time kernel, using more power or statically partitioning your system into housekeeping and workload partitions. Static partitioning counteracts the OpenShift Kubernetes platform’s efficient use of computing resources and might oversubscribe the housekeeping partitions. Partitioning needs to happen on many different levels and by coordination of various components:

Separating management and workload pods
Using Guaranteed pods for workloads
Separating system processes away from workload CPUs
Moving kernel threads to housekeeping CPUs
Moving network interface controller (NIC) IRQs to housekeeping CPUs

Apart from partitioning, there are other ways of reducing latency in software:

Using real-time kernel
Using huge pages (per NUMA node) to avoid the cost of TLB misses
Disabling of CPU load balancing for DPDK
Disabling CPU CFS quota
Possibly disabling hyperthreading to reduce variations in latency
BIOS tuning

All of the above can be performed manually, however, great care needs to be taken to perform them coherently. This is where NTO’s Performance Profile controller comes in. It acts as an orchestrator that takes the burden out of manual configuration and makes sure that all the components (kernel, TuneD, Kubelet [CPU, Topology and Memory manager], CRI-O) necessary to perform the above tasks are properly configured based on a given PerformanceProfile.

This is an example of a PerformanceProfile single-node OpenShift deployment:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: "2-31,34-63"
    reserved: "0-1,32-33"
  globallyDisableIrqLoadBalancing: false
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      count: 16
      node: 0
  net:
    userLevelNetworking: false
    devices: []
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: "best-effort"
  realTimeKernel:
    enabled: true

The PerformanceProfile above will assign CPUs 2-31 and 34-63 for low-latency workloads and reserve the remaining 4 CPUs for system housekeeping tasks. In some cases, the reserved CPUs are insufficient to handle device interrupts. For this reason, the example above allows interrupt processing on isolated (tuned and ready for sensitive workload) CPUs by setting globallyDisableIrqLoadBalancing to false. However, the IRQs load balancing can be disabled per pod CPUs when using irq-load-balancing.crio.io/cpu-quota.crio.io=disable annotations. Additionally, this example profile also provides 16 1GiB huge pages on NUMA node 0; it does not restrict the number of NIC queues to the number of reserved CPUs, enables Topology Manager best-effort NUMA alignment policy and enables real-time kernel. Other system changes performed by the PerformanceProfile controller are implicitly configured by the controller. For example:

Setting CPU Manager policy static to enable exclusive allocation of CPUs
Enforcing allocation of full physical cores when topology policy is restricted or single-numa-node
Setting CPU Manager reconcile period (the shorter the reconcile period, the faster the CPU Manager prevents non-Guaranteed pods to run on isolated CPUs, at cost of more system resources)
Setting Memory Manager policy (when topology policy is restricted or single-numa-node) to pin memory and huge pages closer to the allocated CPUs
Creating high-performance handler and RuntimeClass for CRI-O (note the runtimeClassName in the pod specification below)
Enabling and starting stalld via TuneD

At this point the reader might believe that all the configuration to achieve low latency goes into PerformanceProfile. However, this is not the case. We also need to make appropriate adjustments to the low latency workload pod specification itself. Specifically, making sure the pod is placed into the Guaranteed resource class, adding user-requested CRI-O annotations, and specifying the predefined runtime class.

apiVersion: v1
kind: Pod
metadata:
  name: example
  # Disable CFS cpu quota accounting
  cpu-quota.crio.io: "disable"
  # Disable CPU balance with CRIO
  cpu-load-balancing.crio.io: "disable"
  # Opt-out from interrupt handling
  irq-load-balancing.crio.io: "disable"
spec:
  # Map to the correct performance class
  runtimeClassName: get-from-performance-profile
  ...
  containers:
  - name: container-name
    image: image-registry/image
    resources:
      limits:
        memory: "2Gi"
        cpu: "16"
  ...

It is important to note that NTO's Performance Profile controller will overwrite any custom Kubelet changes. However, it is possible to add the custom Kubelet changes to the PerformanceProfile annotation. Similarly, it is also possible to add extra TuneD configuration to override or build on top of the Performance Profile controller generated one.

Configuration by PerformanceProfiles adds more partitioning to the RHOCP system. This makes sense for DPDK applications that process network packets in user space and cannot afford HW interruptions. Similarly, it will also apply to other latency-sensitive applications. However, the extra partitioning has a cost. Reserved cores can be wasted unnecessarily or not be sufficient to run the OS and/or RHOCP management pods. Therefore, careful planning and testing is always necessary when partitioning RHOCP in this way.

Summary

The RHOCP administrators have multiple options to tune their nodes for performance. There are a few key considerations to keep in mind. Firstly, can the node-level tuning be performed after cluster installation, or do we need to think about the tuning at cluster installation time? The vast majority of tuning can be performed as a post-installation step. Custom tuning by NTO and Tuned Profiles falls into this category. Secondly, can we afford to trade off underutilization or overcommitment of some CPUs by strictly partitioning our RHOCP cluster to avoid noisy neighbors? If so, consider using NTO's PerformanceProfiles. And lastly, we need to ask ourselves how many times will our nodes reboot when choosing our way of tuning.

About the author

Jiří Mencák

Read full bio

Platform products

Try & buy

Featured cloud services

By category

By organization type

By customer

Services

Training & certification

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Communities

Recommendations

Select a language

Select a language

Demystifying node-level tuning on OpenShift

Option 0: Do nothing

Option 1: I need some custom tuning

Option 2: I need low-latency/real-time tuning

Summary

About the author

Jiří Mencák

Connect hybrid cloud Kubernetes with F5 multicloud networking and Red Hat OpenShift for optimized security footprints

Deploying SAS Viya on HPE GreenLake and Red Hat OpenShift

SAS Viya on Red Hat OpenShift Service for AWS (ROSA)

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links