Recently, due to the increasing use of Kubernetes, we see a trend towards using powerful nodes in these clusters and a requirement for users to increase the number of pods that can run on a node.

In a recent  blog posts, we have already seen some suggestions on how to run a large number of pods on an Kubernetes-based cluster, so it's reasonable to ask: “If we can now run 500 pods per node, how many KubeVirt VMs (VMI) can be created on such a node? While running hundreds of VMs on a single node may not be practical for production today, it allows us to test the performance and scalability of KubeVirt using a small number of powerful nodes and putting pressure on the KubeVirt control plane. Such a density test allows us to do in-depth performance evaluation and possible optimization for the control plane.

Enabling creation of hundreds of VMs on a KubeVirt node was not a straightforward procedure. This blog describes all the steps taken to achieve the creation of 400 ** VMIs per node on a vanilla Kubernetes cluster, showing the testing process, bug fixes and other changes needed to increase VMI density, highlighting performance current of the KubeVirt control plane.

** Our initial goal was creating 500 VMIs per node, however, our testbed became unstable due to CPU and memory limitations, we determined that the safe margin in our system is up to 400 VMIs.

Goal 

The main objective of this project was to measure the performance of KubeVirt and verify that the KubeVirt control plan can handle a large number of requests. To produce such a load of requests, we define a density test that creates as many VMIs as possible on a specific node. Data plane analysis (the performance of the VM itself) was not part of this experiment.

 A bit of background on KubeVirt’s control plane 

KubeVirt is a virtual machine management add-on for Kubernetes, intended to allow users to run VMs alongside containers on their Kubernetes or OpenShift clusters.

KubeVirt extends Kubernetes by adding resource types to VMs and sets of VMs through the Kubernetes Custom Resource Definitions (CRD) API. KubeVirt VMs run in regular Kubernetes pods, where they have standard pod storage and network access, and can be managed using standard Kubernetes tools such as kubectl.

The key KubeVirt components are shown in the Figure below:

 

The virt-api component provides the RESTful HTTP entry point (i.e., sub-resources for CRDs) to manage the virtual machines. Details on why KubeVirt provides additional HTTPS CRD webhook API registered with Kubernetes is beyond the scope of this post. The virt-controller component is a Kubernetes controller that manages the lifecycle of VM and VMI objects within the Kubernetes cluster. The virt-handler is responsible for creating a corresponding virt-launcher pod for a VMI object. In KubeVirt, all VMs run in a pod. The virt-launcher pod contains an instance of the libvirt daemon (i.e., libvirtd) in its container to launch the VM. Each Kubernetes worker node needs a single instance of virt-handler, which monitors updates to VMI objects. Then, when the virt-launcher pod is scheduled for a node, the virt-handler identifies the VMI and signals the creation of the corresponding libvirt domain (i.e., the virtual machine) using libvirtd inside the VMI virt-launcher pod.

A few notes 

Our goal is to be able to run many VMI objects to reproduce the required number of VMs per node. In order to be able to create a large number of VMs, we need to use a small OS image allocating as few resources as possible. KubeVirt ensures that enough memory is allocated to be able to create a VMI, which translates later as memory footprint overheads in the virt-launcher memory request. However, in our experiments, we found that we needed to request at least 10MB of RAM to avoid memory errors. Then, to further minimize resource usage, we allocate the minimum resources required to create a VM, as shown in the table below:

Operating System

CPU

MEM (MB)

MEM Overhead

Disk (MB)

cirros 

10

10

~288

50

* 1 CPU = 1000

By allocating such a small number of CPUs, we prevented the VM from booting the operating system. We can safely skip the VM boot stage since OS boot does not introduce any load to the KubeVirt control plane and we do not want to evaluate the performance of the VM data plane for now. Similarly, when comparing the Kubernetes control plane performance and scalability tests, we can see that these tests also only focus on introducing API requests related to the pod creation workflow and not even a pod is created; the Kubemark module simulates the creation of the pod.

The key performance metric in our experiment is VMI creation time. We define this as the time from sending a request to create the object to when the VMI object is in the Running phase. This phase means that libvirt created the virtual machine domain and sent the command to start it. Recently, KubeVirt introduced a new metric that measures VMI phase transition latency:  

kubevirt_vmi_phase_transition_time_from_creation_seconds_bucket   

Therefore, we can use this metric to calculate the 95th percentile of the transition from the VMI phases from the creation phase to the Running phase.

Initial Testing 

As mentioned before, increasing the maximum number of pods per node in Kubernetes has been done before and is well documented. However, increasing the maximum number of VMIs per node in Kubevirt proved not to be so easy due to several reasons, for example: hardcoded limits, timeouts and API Query Per Second (QPS) setting.

Initially, we noticed that it was not possible to create more than 110 VMs, although we increased the pod limit to 1022. In addition to 110, we saw some pods remaining in Pending state. After some exploration, we found that this was because the virt-handler max-device parameter was encoded in KubeVirt (using Kubernetes' official default pod limit value of 110). However, as the number of devices does not directly affect VM performance, for simplicity we have increased this value to 1000.

Even after increasing the max devices parameter, it was not possible to create more than 200 VMIs per node. Some VMI compute containers were failing. This container runs the virt-launcher daemon and that daemon panicked with a qemu response timeout error. The virt-launcher already had a qemu-timeout flag to set the amount of time to wait for responses from qemu. However, this configuration was not exposed to the user when creating a VMI, since the virt-controller module is responsible for creating the virt-launcher. To solve this, we introduced the ability to set the virt-launcher parameter qemu-timeout via the virt-controller module. To change the timeout, it is now possible to directly patch the KubeVirt CRD object with the slightly hidden feature to patch the virt-controller.

For example, you can create the following  kubevirt.yaml file to increase the qemu timeout to 15mim:

spec:    
 customizeComponents:      
        patches:      
        - patch: '[{"op":"add","path":"/spec/template/spec/containers/0/command/-","value":"--launcher-qemu-timeout=900"}]'        
       resourceName: virt-controller        
       resourceType: Deployment        
           type: json

and then execute the following command to patch the KubeVirt CR object:

kubectl patch --type=merge kubevirt kubevirt -n kubevirt --patch "$(cat kubevirt.yaml)" 

After these settings, we were able to create 353 VMIs, but 147 failed with KillPodSandboxError: "Context timeout exceeded". After a few more investigations, this error appeared to be related to container runtime performance. This error was very difficult to track down as there was no information in the kubelet logs showing what was actually going on. However, we could see in kernel messages (using journalctl -b command) that the docker runtime was not able to create some containers. Consequently, we tried different runtimes, docker, containerd and cri-o. In our experiment, containerd performed better and fixed this issue.

400 VMIs/node Test Configuration and Infrastructure 

Creating a large number of pods and VMIs introduces considerable stress on the control plane and monitoring infrastructure. To handle this load, we use very large nodes for the control plane and monitoring infrastructure. In addition, we expected the monitoring database (Prometheus) to require a large amount of memory. To meet this requirement, we use dedicated work nodes for Prometheus and Grafana and do not allocate any VMI to these nodes. This was done using explicit node selectors when creating the VMIs. We used the  IBM Cloud Bare Metal Servers for our underlying platform.

Cluster nodes 

To simplify the performance analysis, we created a cluster with identical nodes. This was also motivated by the cost, as we selected the most cost-effective node type that suits our needs. The node hardware type consists of:  

  • 48 CPUs Intel Xeon 8260 (Cascade Lake)  
  • 128GB Memory
  • 10 Gbps network interface
  • 1T Disk  

We disabled swapping on all nodes due to a kubelet requirement and installed Kubernetes 1.21 using kubespray. Although the objective of the experiment was to create VMIs on just one node, we created a cluster with 3 master nodes and 3 worker nodes so that we can deploy additional components and services without affecting the test. 

Configuration changes 

The kubespray's default maximum pods per node is 110. On each working node, we also need to run the pods needed by the Kubernetes and Kubevirt control plane. There are about 6 pods for the Kubernetes control plane, 1 for monitoring and 3 for the KubeVirt control plane; 10 extra pods in total. To ensure that we could create a large number of pods, we set Kubespray kubelet_max_pods to 1022. This requires a special configuration change, setting kube_network_node_prefix to 22. Each pod on a node needs a distinct IP address allocated outside the host IP range. Unfortunately, the default range is set to 24 (i.e., a /24 subnet) allowing only 256 IPs, so we had to change it to 22. By default, when creating a cluster, Kubespray uses the calico network.

During our initial experiments, we found that we could create more containers when using the containerd runtime, so we also chose to change Kubespray to configure the cluster with containerd.

Kubernetes API clients are usually shared between different controllers, including the KubeVirt controllers. Kubernetes APIs clients come with a token bucket rate limiter that supports configurable Query Per Seconds (QPS) and burst parameters. When there is an explosion of API calls beyond the limit, the calls are limited so that a single controller or the kubelet itself does not congest the kube-apiserver bandwidth. The challenge was to identify the lowest settings for kubeAPIQPS and kubeAPIBurst, where the default is 5/10, respectively. Previous experiments recommended values of 50/100 for kubeAPIQPS / kubeAPIBurst when using a big scenario like ours. Following these guidelines, we increased the kubeAPIQPS and kubeAPIBurst of the KubeVirt controllers from 5/10 to 50/100 by patching the KubeVirt CRD object.

Test Results 

We use the PerfScale-Load-Generator tool to perform a burst density test by creating the necessary namespace and VMI objects.

PerfScale-Load-Generator is a simple tool that was written as part of this project to generate a specified number of VMI objects using a predefined template. The tool creates the objects and waits for all of them to enter the Running phase, then deletes all objects and waits for deletion. As the tool uses templates, the objects created do not necessarily need to be VMIs. Also, using this tool, we can configure the object creation rate by adjusting the QPS and burst parameters of requests for the Kubernetes API. In all tests, we set both parameters to 20/20.

We ran tests by creating 50, 100, 200, 300, and 400 VMIs on one node. To schedule all VMIs on the same node, we use the node selector parameter in the object model.

We collected a variety of information and data during testing, including Prometheus metrics that we visualized with Grafana dashboards to monitor control plane performance and cluster activity.

To prevent interference between different test runs, we have introduced a cooling interval of 1h between each test, allowing Kubernetes and the go garbage collector enough time to be activated and clear the cluster state.

As expected, the 95th percentile of the time between sending the VMI create request until the VMI enters in the Running phase increases with the number of VMs created. For example, when creating 100 VMIs, the creation time is ~3.33 minutes, while creating 200 VMIs triples the time to ~10mim. In the figure below, you can also see that creating more VMI also increases the time the VMI spends in the scheduled phase. The Scheduled phase represents an internal Kubevirt state where a VMI is waiting for the virt-launcher pod to get up and running and is not related to the k8s scheduler. The scheduling phase, on the other hand, is related to the k8s scheduler performance and is increased slightly when creating more VMIs, indicating a small overhead in the CRD scheduling process. It is noteworthy that the 10mim limit that we see in the figure is due to the inherent limitation of the histogram range in the monitored metric, where the maximum range was 10mim. 

 

An interesting observation from looking at the VMI creation rate is that VMIs were being created in about 48/minute when creating 300 VMIs, and that rate did not change much when creating 400 VMIs. This shows that we have reached an upper limit for this load, which also indirectly explains why we could not create 500 VMIs. The same behavior occurs with the container creation rate.

Final Considerations 

In this blog, we demonstrate how to configure Kubernetes and KubeVirt. Testing a large-scale cluster typically requires many nodes and is difficult to create and expand. Enabling a very dense VM population on a single node allows us to achieve a similar test load using a small cluster with just a few nodes. As you can see from the test results we have provided, this test can highlight performance bottlenecks and can be used as the basis for a KubeVirt control plane scalability analysis. Anyone interested in creating more than 110 VMIs per node can follow the steps listed in this blog to properly configure a KubeVirt cluster to increase node density.

 Credits 

We would like to thank the many members of the KubeVirt and OpenShift Perf/Scale and IBM Research Haifa, Yorktown and Tokyo teams who worked with us on this, including Ashish Kamra, Fabian Deutsch, Roman Mohr, Nelson Mimura, Tatsuhiro Chiba, among others.


Categories

How-tos, virtualization, cloud scale, massive scale, KubeVirt

< Back to the blog