Typically, Red Hat OpenShift Container Platform (RHOCP) administrators do not need to worry about node-level tuning because the platform comes installed with reasonable defaults to run general-purpose workloads. However, there are scenarios where intervention is needed to improve workload performance. While most of the time, this intervention will be done by cluster administrators as a post-installation (day 2) configuration change, it might even be necessary at cluster installation (day 1).
The aim of this blog post is to give an overview of the default node-level tuning done on an OpenShift cluster and the extra options cluster administrators have to apply node-level tuning to tailor the performance of the platform to their needs.
Option 0: Do nothing
The OpenShift platform, like Red Hat Enterprise Linux, comes tuned by default for general-purpose workloads. The system tuning is primarily performed by the Node Tuning Operator (NTO), which is one of the core OpenShift operators.
Many of the tunables in the parent
openshift profile raise certain kernel limits. This improves how the system functions during higher system loads and cluster scale. On the other hand, the changes mostly come at the cost of increased kernel memory consumption.
summary=Optimize systems running OpenShift (parent profile)
avc_cache_threshold=8192 # rhbz#1548428, PR10027
nf_conntrack_hashsize=1048576 # PR413 (the default limit is too low for OpenShift)
net.ipv4.ip_forward=1 # Forward packets between interfaces
kernel.pid_max=>4194304 # PR79, for large-scale workloads; systemd sets kernel.pid_max to 4M since v243
fs.aio-max-nr=>1048576 # PSAP-900
net.ipv4.conf.all.arp_announce=2 # rhbz#1758552 pod communication due to ARP failures
net.ipv4.neigh.default.gc_thresh3=65536 # rhbz#1384746 gc_thresh3 limits no. of nodes/routes
vm.max_map_count=262144 # rhbz#1793714 ElasticSearch (logging)
# see rhbz#1979352; exclude containers from aligning to house keeping CPUs
# workaround for rhbz#1921738
openshift profile, we mostly build on the
throughput-performance profile, which is the default profile recommended for servers. Moreover, we include other functional and performance tunables. The functional ones:
- Fix OpenShift pod communication issues due to ARP failures between a node and its pods
vm.max_map_countto enable clean start of Elasticsearch pods
- Adjust the
kernel.pid_maxfor large-scale workloads (also helps with pod density)
The performance tunables:
- Allow large-size clusters and more than 1000 of routes
- Adjust the sizes of netfilter connection tracking hash table and its maximum entries
- Improve node performance (CPU utilization) by adjusting the AVC cache threshold (1, 2)
- Enable more VMs running on RHOCP nodes
- Prevent the TuneD
[scheduler]plug-in to align containers to housekeeping CPUs
- Disable the dynamic behavior of the TuneD
In the current versions of RHOCP, the
openshift-control-plane simply inherits from the
openshift profile. As for
openshift-node profile, both
fs.inotify settings are functional settings that, at this point, only mirror the settings already provided by MCO to enable their application before kubelet start. The
net.ipv4.tcp_fastopen=3 setting reduces network latency by enabling data exchange during the sender's initial TCP SYN on client and server connections.
Option 1: I need some custom tuning
In a previous blog post, we covered how to apply custom node-level configuration using NTO. Here we will give an example Tuned CR for tuning a system with a 10 Gigabit Intel(R) network interface card for throughput, as suggested by the Linux kernel documentation for the ixgb driver.
- data: |
summary=Increase throughput for NICs using ixgb driver
### CORE settings (mostly for socket and UDP effect)
# Set maximum receive socket buffer size.
net.core.rmem_max = 524287
# Set maximum send socket buffer size
net.core.wmem_max = 524287
# Set default receive socket buffer size.
net.core.rmem_default = 524287
# Set default send socket buffer size.
net.core.wmem_default = 524287
# Set maximum amount of option memory buffers.
net.core.optmem_max = 524287
# Set number of unprocessed input packets before kernel starts dropping them.
net.core.netdev_max_backlog = 300000
- label: node-role.kubernetes.io/worker
Note that the core network settings only are included for brevity.
Option 2: I need low-latency/real-time tuning
Some specialized workloads require low-latency/real-time tuning, such as Telco 5G Core User Plane Function (UPF), Financial Services Industry (FSI), and some High-Performance Computing (HPC) workloads. However, such tuning requires sacrifices. Be it a loss of overall throughput when using the real-time kernel, using more power or statically partitioning your system into housekeeping and workload partitions. Static partitioning counteracts the OpenShift Kubernetes platform’s efficient use of computing resources and might oversubscribe the housekeeping partitions. Partitioning needs to happen on many different levels and by coordination of various components:
- Separating management and workload pods
- Using Guaranteed pods for workloads
- Separating system processes away from workload CPUs
- Moving kernel threads to housekeeping CPUs
- Moving network interface controller (NIC) IRQs to housekeeping CPUs
Apart from partitioning, there are other ways of reducing latency in software:
- Using real-time kernel
- Using huge pages (per NUMA node) to avoid the cost of TLB misses
- Disabling of CPU load balancing for DPDK
- Disabling CPU CFS quota
- Possibly disabling hyperthreading to reduce variations in latency
- BIOS tuning
All of the above can be performed manually, however, great care needs to be taken to perform them coherently. This is where NTO’s Performance Profile controller comes in. It acts as an orchestrator that takes the burden out of manual configuration and makes sure that all the components (kernel, TuneD, Kubelet [CPU, Topology and Memory manager], CRI-O) necessary to perform the above tasks are properly configured based on a given PerformanceProfile.
This is an example of a PerformanceProfile single-node OpenShift deployment:
- size: "1G"
The PerformanceProfile above will assign CPUs 2-31 and 34-63 for low-latency workloads and reserve the remaining 4 CPUs for system housekeeping tasks. In some cases, the reserved CPUs are insufficient to handle device interrupts. For this reason, the example above allows interrupt processing on isolated (tuned and ready for sensitive workload) CPUs by setting
false. However, the IRQs load balancing can be disabled per pod CPUs when
using irq-load-balancing.crio.io/cpu-quota.crio.io=disable annotations. Additionally, this example profile also provides 16 1GiB huge pages on NUMA node 0; it does not restrict the number of NIC queues to the number of reserved CPUs, enables Topology Manager
best-effort NUMA alignment policy and enables real-time kernel. Other system changes performed by the PerformanceProfile controller are implicitly configured by the controller. For example:
- Setting CPU Manager policy static to enable exclusive allocation of CPUs
- Enforcing allocation of full physical cores when topology policy is restricted or single-numa-node
- Setting CPU Manager reconcile period (the shorter the reconcile period, the faster the CPU Manager prevents non-Guaranteed pods to run on isolated CPUs, at cost of more system resources)
- Setting Memory Manager policy (when topology policy is restricted or single-numa-node) to pin memory and huge pages closer to the allocated CPUs
- Creating high-performance handler and RuntimeClass for CRI-O (note the
runtimeClassNamein the pod specification below)
- Enabling and starting stalld via TuneD
At this point the reader might believe that all the configuration to achieve low latency goes into PerformanceProfile. However, this is not the case. We also need to make appropriate adjustments to the low latency workload pod specification itself. Specifically, making sure the pod is placed into the Guaranteed resource class, adding user-requested CRI-O annotations, and specifying the predefined runtime class.
# Disable CFS cpu quota accounting
# Disable CPU balance with CRIO
# Opt-out from interrupt handling
# Map to the correct performance class
- name: container-name
It is important to note that NTO's Performance Profile controller will overwrite any custom Kubelet changes. However, it is possible to add the custom Kubelet changes to the PerformanceProfile annotation. Similarly, it is also possible to add extra TuneD configuration to override or build on top of the Performance Profile controller generated one.
Configuration by PerformanceProfiles adds more partitioning to the RHOCP system. This makes sense for DPDK applications that process network packets in user space and cannot afford HW interruptions. Similarly, it will also apply to other latency-sensitive applications. However, the extra partitioning has a cost. Reserved cores can be wasted unnecessarily or not be sufficient to run the OS and/or RHOCP management pods. Therefore, careful planning and testing is always necessary when partitioning RHOCP in this way.
The RHOCP administrators have multiple options to tune their nodes for performance. There are a few key considerations to keep in mind. Firstly, can the node-level tuning be performed after cluster installation, or do we need to think about the tuning at cluster installation time? The vast majority of tuning can be performed as a post-installation step. Custom tuning by NTO and Tuned Profiles falls into this category. Secondly, can we afford to trade off underutilization or overcommitment of some CPUs by strictly partitioning our RHOCP cluster to avoid noisy neighbors? If so, consider using NTO's PerformanceProfiles. And lastly, we need to ask ourselves how many times will our nodes reboot when choosing our way of tuning.