How to create and scale 6,000 virtual machines in 7 hours with Red Hat OpenShift Virtualization

In this learning path by Boaz Ben Shabat, learn how to create a large amount of virtual machines for a production infrastructure in less than a weekend’s worth of time. 

In this learning path by Boaz Ben Shabat, learn how to create a large amount of virtual machines for a production infrastructure in less than a weekend’s worth of time. 

Hour 1: Network tuning for Red Hat Ceph Storage

1 hr

The objective of this Red Hat® Ceph® Storage (RHCS) cluster is to provide storage for multiple network streams generated by the 132 nodes, all of which access the RHCS cluster that consists of only 12 hosts.

This resource will discuss the Linux® network tuning used on the RHCS hosts to accommodate this large-scale environment, which is based on this TuneD profile.

This process can be automated using Python. This automation was built in a modular structure, which can be modified for use in any environment.

What will you learn?

  • Setting up TuneD profile
  • Configuring specific controls in the profile
  • Overview of what is the resource

What you need before starting:

Steps for tuned profile configuration:

TuneD Profile

TuneD was used to optimize the network performance for this scale. TuneD can be installed via:

dnf install tuned -y; systemctl start tuned; systemctl enable tuned

In our environment, we changed the parameters as shown in the screenshot below.

[main]
summary=Optimize for RHCS performance focused on low latency network performance
[vm]
# Disable Transparent Huge Pages (Default: always)
transparent_hugepages=never
[sysctl]
# Network core: Adjust the busy read threshold to 50 (Default: 0)
net.core.busy_read=50
# Network core: Adjust the busy poll threshold to 50 (Default: 0)
net.core.busy_poll=50
# NUMA balancing: Disable NUMA balancing (Default: 1)
kernel.numa_balancing=0
# Kernel task timeout: Set hung task timeout to 600 seconds (Default: 120)
kernel.hung_task_timeout_secs=600
# NMI watchdog: Disable the NMI watchdog (Default: 1)
kernel.nmi_watchdog=0
# Virtual Memory statistics interval: Set interval to 10 seconds (Default: 1)
vm.stat_interval=10
# Kernel timer migration: Disable kernel timer migration (Default: 1)
kernel.timer_migration=0
# ARP cache tuning: Threshold 1 for garbage collector triggering (Default: 128)
net.ipv4.neigh.default.gc_thresh1 = 4096
# ARP cache tuning: Threshold 2 for garbage collector triggering (Default: 512)
net.ipv4.neigh.default.gc_thresh2 = 16384
# ARP cache tuning: Threshold 3 for garbage collector triggering (Default: 1024)
net.ipv4.neigh.default.gc_thresh3 = 32768
# ARP Flux: Enable ARP filtering (Default: 0)
net.ipv4.conf.all.arp_filter = 1
# ARP Flux: Ignore ARP requests from unknown sources (Default: 0)
net.ipv4.conf.all.arp_ignore = 1
# ARP Flux: Announce local source IP address on ARP requests (Default: 0)
net.ipv4.conf.all.arp_announce = 1
# TCP/IP Tuning: Enable TCP window scaling (Default: 0)
net.ipv4.tcp_window_scaling = 1
# TCP Fast Open: Enable TCP Fast Open (Default: 1)
net.ipv4.tcp_fastopen = 3
# Buffer Size Tuning: Maximum receive buffer size for all network interfaces (Default: 212992)
net.core.rmem_max = 11639193
# Buffer Size Tuning: Maximum send buffer size for all network interfaces (Default: 212992)
net.core.wmem_max = 11639193
# Buffer Size Tuning: Default send buffer size for all network interfaces (Default: 212992)
net.core.wmem_default = 2909798
# NIC buffers: Maximum number of packets per network device queue (Default: 300)
net.core.netdev_budget = 1000
# NIC buffers: Maximum backlog size for incoming packets (Default: 1000)
net.core.netdev_max_backlog = 5000
# Network latency optimization: Enable kernel skew_tick for reduced network latency
[bootloader]
cmdline_network_latency=skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1

Let’s explore the key modifications made to the parameters above.

Hugepages

Transparent_hugepages = never

While transparent hugepages (THP) can be useful in certain cases for better memory management, it can actually cause performance problems and unpredictable delays.

So, by disabling THP and setting it to never, we are making sure that RHCS performance is consistent, which is what I’m looking for with my performance tests. It helps avoid any potential slowdowns or hiccups caused by the automatic handling of page sizes. This way, we can maintain low latency and high throughput.

Note that the impact of THP can vary depending on the hardware type due to factors such as page size support, memory management unit (MMU) features, cache hierarchy, etc.

Network Busy Read Threshold

net.core.busy_read=50

This parameter allows for the fine-tuning of how the network handles busy read operations. It optimizes network performance during high-demand periods, akin to adjusting the throttle in a vehicle to ensure a smooth and efficient experience.

Network Busy Poll Threshold

net.core.busy_poll=50

This setting provides a means to enhance network responsiveness when subjected to a surge in requests. It's comparable to equipping the network with improved capabilities during peak traffic, ensuring it remains agile and efficient.

NUMA Balancing

kernel.numa_balancing=0

Disabling Non-Uniform Memory Access (NUMA) balancing ensures that memory allocation for applications on NUMA systems is optimized, minimizing latency and enhancing overall performance.

Hung Task Timeout

kernel.hung_task_timeout_secs=600

This parameter regulates the duration for detecting and managing unresponsive tasks, bolstering system stability and reliability by allowing more time for task recovery.

NMI Watchdog

kernel.nmi_watchdog=0

By deactivating the Non-Maskable Interrupt (NMI) watchdog, the system experiences fewer disruptive interruptions, enabling it to prioritize tasks without unwarranted distractions.

Virtual Memory Statistics Interval

vm.stat_interval=10

The virtual memory statistics interval is adjusted to a 10-second timeframe, facilitating more frequent monitoring of memory-related metrics. This offers insights into memory utilization, contributing to smoother system performance.

Timer Migration

kernel.timer_migration=0

The disabling of timer migration ensures that timers remain on their original CPU cores, promoting predictability and minimizing delays within the system.

ARP Cache

net.ipv4.neigh.default.gc_thresh1 = 8192  
net.ipv4.neigh.default.gc_thresh2 = 32768 
net.ipv4.neigh.default.gc_thresh3 = 65536

The ARP cache keeps a list of ARP entries that are generated when an IP address is resolved to a MAC address, in order to avoid large-scale cases in which the ARP cache cannot hold all the entries we will need to increase the ARP cache size, in order to avoid the following errors:

net_ratelimit: 1947 callbacks suppressed
neighbour: arp_cache: neighbor table overflow!

By increasing the values of these parameters, the ARP cache can accommodate a larger number of entries before triggering the garbage collector. This can improve the efficiency of ARP cache management, reduce unnecessary overhead, and enhance network performance in scenarios where there are frequent changes or large numbers of ARP entries due to the network scale and the number of nodes accessing the hosts.

ARP Flux

net.ipv4.conf.all.arp_filter=1 
net.ipv4.conf.all.arp_ignore=1 
net.ipv4.conf.all.arp_announce=1

Any Linux host that has multiple network interfaces on the same subnet might be affected by ARP Flux issues, the ARP Flux problem might occur when a host replies to an ARP request for interfaces on the same subnet this behavior is not necessarily a problem however, in some cases, ARP flux might cause some applications to misbehave due to incorrect mapping between IPv4 addresses and MAC addresses.

In addition, those settings help optimize network operations and ensure efficient communication in such an environment. By enabling ARP filtering, ignoring ARP requests from unknown sources, and announcing a consistent source IP address, we enhance network performance, reduce unnecessary network traffic, and promote better load balancing and routing in a setup where thousands of IP addresses access a limited number of RHCS hosts which provide them storage.

TCP Window Scaling

net.ipv4.tcp_window_scaling=1

RHEL default network settings might not produce optimum throughput/latency performance for large parallel jobs that are typically found on large-scale setups, the information below describes how to tune the Linux network and certain network devices for better parallel job performance.

For better use of high-bandwidth networks, a larger TCP window size needs to be used. Therefore, we made sure that TCP window scaling is enabled.

TCP Handshake Tuning

net.ipv4.tcp_fastopen = 3

Setting net.ipv4.tcp_fastopen to 3 enables TCP Fast Open for both client and server applications. This reduces connection establishment time and improves performance for both outbound and inbound connections.

Once the TuneD profile is initially configured, next is to tackle the buffer size settings for the RHCS network. 

Previous resource
Prerequisites
Next resource
Hour 2 - Buffer tuning

This learning path is for operations teams or system administrators
Developers may want to check out Foundations of OpenShift on developers.redhat.com.

Get started on developers.redhat.com

Hybrid Cloud Logo LinkedIn YouTube Facebook Twitter

Products

Tools

Try, buy, sell

Communicate

About Red Hat

We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.