Node Selection and Affinity for Virtual Machines in OpenShift

January 10, 2023Chandler Wilkerson

Image composed of map tiles created by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL. Map tiles are, from top to bottom, Boston, MA, Raleigh, NC, and Phoenix, AZ

According to the OpenShift installation guide,

You can deploy an OpenShift Container Platform 4 cluster to both on-premise hardware and to cloud hosting services, but all of the machines in a cluster must be in the same datacenter or cloud hosting service.

Even as they reside in a single data center or cloud region, there are cases where an OpenShift cluster's nodes may span multiple failure domains, whether they involve power domains like a subset of generator or UPS feeds in a datacenter, or a cloud provider's availability zones in a region. Kubernetes provides node selection and node affinity mechanisms to provide applications the ability to span failure domains to keep them running through a planned or unplanned outage.

Set up and label nodes

The test cluster for this demonstration includes six nodes, named wkr0, mcp0, wkr1, mcp1, wkr2, and mcp2. For illustration purposes only, we pretend these nodes span three separate data centers in three cities in two geographical regions:

$ oc label node wkr0 topology.kubernetes.io/region=us-east topology.kubernetes.io/zone=bos
$ oc label node mcp0 topology.kubernetes.io/region=us-east topology.kubernetes.io/zone=bos

$ oc label node wkr1 topology.kubernetes.io/region=us-east topology.kubernetes.io/zone=rdu
$ oc label node mcp1 topology.kubernetes.io/region=us-mntn topology.kubernetes.io/zone=rdu

$ oc label node wkr2 topology.kubernetes.io/region=us-east topology.kubernetes.io/zone=phx
$ oc label node mcp2 topology.kubernetes.io/region=us-mntn topology.kubernetes.io/zone=phx

Additionally, to demonstrate adding hardware hints to a subset of nodes, all wkr* nodes are labeled as having faster SSD storage available:

$ oc label nodes wkr0 wkr1 wkr2 disktype=ssd

Assign virtual machines to nodes

Virtual machines in OpenShift follow similar node selection and affinity criteria to Pods with one notable exception. Pods can select a node by name by using nodeName, but this is not implemented for VirtualMachine resources. There are a number of use cases for nodeSelector and affinity rules, we will start with one of the simplest.

NodeSelector

The nodeSelector for a VirtualMachine belongs at the same level as the domain object, under the path spec.template.spec as seen here:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: boston
spec:
  template:
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: bos
    [ remainder of VM omitted ]

This nodeSelector will cause the VM to require one of the nodes labeled with the zone bos. In this case, either wkr0 or mcp0 may run this VM.

$ oc get vmi boston

NAME     AGE     PHASE     IP             NODENAME   READY
boston   7m16s   Running   10.129.2.107   wkr0       True

Should work need to be done on wkr0, start by cordoning and draining the node as outlined in the OpenShift Understanding node rebooting documentation

$ oc adm cordon wkr0

node/wkr0 cordoned

$ oc adm drain <node1> --ignore-daemonsets --delete-emptydir-data --force

node/wkr0 already cordoned

[ skipping updates of all the evicted pods ]

error when evicting pods/"virt-launcher-boston-pnbcd" -n "database" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

[ skipping repeats of above message ]

pod/virt-launcher-boston-pnbcd evicted
node/wkr0 drained

While the drain command runs, it outputs error messages showing that it fails to immediately evict the VM's virt-launcher Pod. Behind the scenes, the eviction request has triggered a VM migration, which we can see afterwards with:

[kni@jump-cnv zones]$ oc get vmim
NAME                        PHASE       VMI
kubevirt-evacuation-5xg8b   Succeeded   boston

Next, check the migrated VM landed in the other bos node, mcp0:

$ oc get vmi

NAME                                        AGE     PHASE     IP            NODENAME   READY
virtualmachineinstance.kubevirt.io/boston   8m27s   Running   10.131.0.34   mcp0       True

To return the node to service use:

$ oc adm uncordon wkr0

The nodeSelector field could also be used with the disktype or topology.kubernetes.io/region labels or even multiple labels at once, provided the logic required is AND:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: boston
spec:
  template:
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: bos
        disktype: ssd
    [ remainder of VM omitted ]

Only the node wkr0 will satisfy both nodeSelector labels, and attempts to migrate the node will result in a failed migration:

$ oc get vmi,vmim

NAME                                        AGE     PHASE     IP             NODENAME   READY
virtualmachineinstance.kubevirt.io/boston   6m21s   Running   10.129.3.182   wkr0       True

NAME                                                                    PHASE    VMI
virtualmachineinstancemigration.kubevirt.io/kubevirt-migrate-vm-k77r5   Failed   boston

Assign virtual machines to nodes using affinity rules

When more nuanced control is required, affinity rules come in to play. Affinity rules fall into three categories: nodeAffinity, podAffinity, and podAntiAffinity. The first behaves much as the nodeSelector above, but with more options. All three affinity rule categories further subdivide into preferredDuringSchedulingIgnoredDuringExecution and requiredDuringSchedulingIgnoredDuringExecution. Ignored during execution means these rules can not affect the behavior of running VMs. In other words, changes to a node's label while VMs are running will not cause a migration. The difference between preferred and required indicates whether the scheduler will make a best-effort attempt to schedule according to the weighted selectors (preferred), or will require all selectors be true and fail to schedule the VM if this is impossible (required).

As an example, we can adapt the above nodeSelector with the failed migration as preferredDuringSchedulingIgnoredDuringExecution, giving a 75% weight to staying in Boston, and 50% weight to having an SSD disk:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: boston
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - bos
            weight: 75
          - preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
            weight: 50
    [ remainder of VM omitted ]

As before, the VM schedules on wkr0, which is the only node that satisfies both conditions:

$ oc get vmi,vmim
NAME                                        AGE     PHASE     IP             NODENAME   READY
virtualmachineinstance.kubevirt.io/boston   4m54s   Running   10.129.3.188   wkr0       True

Now when the VM is migrated, it will allow itself to run on node mcp0 which, while it is in Boston, does not carry the SSD label:

$ oc get vmi,vmim

NAME                                        AGE   PHASE     IP            NODENAME   READY
virtualmachineinstance.kubevirt.io/boston   11m   Running   10.131.0.56   mcp0       True

NAME                                                                    PHASE       VMI
virtualmachineinstancemigration.kubevirt.io/kubevirt-migrate-vm-jqdl4   Succeeded   boston

Pod affinity and anti-affinity

The podAffinity selector covers the case where it is required to keep a VM in the same node, availability zone, or region as a related service. An example might be a latency sensitive front-end application that should run on the same node as its corresponding back-end service. The following Pod and VM definitions will always place the Pod on node mcp0 and allow the VM to migrate between wkr0 (preferred due to the disktype=ssd label there) and mcp0.

apiVersion: v1
kind: Pod
metadata:
  name: httpd
  labels:
    app: low-latency
spec:
  nodeName: mcp0
  containers:
  - name: httpd
    image: httpd
    imagePullPolicy: IfNotPresent

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: back-end
spec:
  template:
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - low-latency
            topologyKey: topology.kubernetes.io/zone
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
            weight: 50
    [ remainder of VM omitted ]

An example of this running would look like the following:

$ oc get vmi

NAME       AGE   PHASE     IP            NODENAME   READY
back-end   12m   Running   10.129.3.15   wkr0       True

$ oc get pod httpd -o wide

NAME    READY   STATUS    RESTARTS   AGE   IP            NODE   NOMINATED NODE   READINESS GATES
httpd   1/1     Running   0          30m   10.131.0.60   mcp0   <none>           <none>

Note that changes to the Pod do not trigger an effect in the VM. As an example, we delete the httpd Pod and recreate it in mcp2:

$ oc get po httpd -o wide

NAME    READY   STATUS    RESTARTS   AGE    IP             NODE   NOMINATED NODE   READINESS GATES
httpd   1/1     Running   0          2m9s   10.128.3.198   mcp2   <none>           <none>

The back-end VM stays running where it was, but if we migrate it now:

$ virtctl migrate back-end

VM back-end was scheduled to migrate

$ oc get vmi
NAME       AGE   PHASE     IP            NODENAME   READY
back-end   18m   Running   10.128.4.26   wkr2       True

Finally, consider the case where a clustered application like a database requires three cluster members, and for maximum protection, it is desired to keep those all in separate zones. Translating this to an anti-affinity rule would look something like:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: db01
spec:
  template:
    metadata:
      labels:
        app: database
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - database
            topologyKey: topology.kubernetes.io/zone

A collection of VMs with the above anti-affinity rule and app: database labels will arrange themselves into nodes in the bos, rdu, and phx zones:

$ oc get vmi

NAME   AGE   PHASE     IP             NODENAME   READY
db01   58m   Running   10.130.2.173   wkr1       True
db02   58m   Running   10.128.4.27    wkr2       True
db03   57m   Running   10.131.0.62    mcp0       True

Caveats

As mentioned above, none of the affinity rules currently have any effect during execution. In other words, a running VM will continue running even if affinity rules suggest it should migrate to another node. For both nodeSelector and affinity rules, it is not possible to alter the set of rules applied to a VirtualMachineInstance and then migrate it according to the new rules. Instead, a shutdown and restart of the VM's OS is required to propagate the changes. In the case of a single VM, this could be some minutes worth of interruption. In the case of a clustered application like a database, this still could allow an admin to work around planned maintenance or unplanned emergencies without causing interruption of the clustered service. Work to update the KubeVirt API to allow propagating nodeSelector and affinity rules is scheduled for a future release, and can be tracked here. On the subject of future work, this blog was written based on version 4.11 of the OpenShift Virtualization operator. In 4.12, there will be an additional mechanism available to control distribution of virtual machines across a cluster, topology spread constraints. The Kubernetes documentation provides an explanation of how this works for Pods now.

Conclusion

Whether your goal is to avoid losing service during standard maintenance, or to make sure certain VMs always have particular hardware available, node selection and affinity is the way to go.

For more documentation about virtual machines and node assignment, see the upstream documentation at the KubeVirt User Guide.

About the author

Chandler Wilkerson

Browse by channel

Explore all channels

Platform products

Try & buy

Featured cloud services

By category

By organization type

By customer

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Communities

Recommendations

Select a language

Select a language

Node Selection and Affinity for Virtual Machines in OpenShift

Set up and label nodes

Assign virtual machines to nodes

NodeSelector

Assign virtual machines to nodes using affinity rules

Pod affinity and anti-affinity

Caveats

Conclusion

About the author

Chandler Wilkerson

More like this

Browse by channel

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links