This post was written by: Swati Sehgal, Alexey Perevalov, Killian Muldoon & Francesco Romani. This is part 3, and here are Parts 1 and Part 2

Topology Aware Scheduling is all about enhancing the hardware awareness of Kubernetes at the cluster control plane level. The hard details, like everything else related to the infrastructure underlying a cluster, is held by kubelet. 

Topology Aware Scheduling will bring changes to the way kubelet exposes this information to make hardware aware extensions easier to develop and consistent with Kubernetes’ understanding of the infrastructure it is managing. To do all this, we are relying on the Pod Resources API.

The Kubernetes pod resources API is a kubelet API introduced in Kubernetes 1.13 that enables monitoring applications to track the resource allocation of the pods.

The service that enables the API is implemented as gRPC listening on a unix domain socket and returning information about the kubelet's assignment of devices to containers.

The API as originally introduced offered an API call to list all the pods and learn about their resource assignment, which fits the proposed use case for monitoring applications.

A few months later, an effort to introduce more topology-aware scheduling began.

So, the scheduler will need to learn more detailed information about the node resource availability and allocation.

Topology-aware scheduling wants to provide a framework to enable generic topology-aware resource allocation, but the first step to solve this problem is working out the simpler case of numa-aware allocation. However, we will use “topology zone” in this post as an alias for the much simpler and more familiar term “NUMA zone” (or “NUMA cell” or “NUMA node,” which we consider synonymous).

The kubelet is the source of truth with respect to the node resource state; thus, extracting the resource state information from the kubelet quickly emerged as the right approach.

To implement this resource, reporting a few approaches were discussed, including introducing a new API. But we realized it is possible to generalize the pod resources API to export

all the information the topology-aware scheduling requires while remaining true to the spirit of the pod resources API itself.

To report enough data to enable the topology-aware scheduling, we need to export more information than the current API does.

When listing pod resources allocation:

  • We need to attach topology information to all the allocated devicesr to properly account the resource availability for the topology zones on the node.
  • In case of exclusive CPU allocation, we need to expose which CPUs are exclusively allocated to containers in the pods. This is a bit of a special case considering that all other resources that have a topology information fit into the more generic “resource” reporting (for example, devices and memory).

To do a proper placement, the scheduler needs to know about the per-NUMA zone available resources. The kubelet reports the available resource on a per-physical-node granularity, which is too coarse grained.

To overcome this limitation, a new Pod Resource API was added: GetAllocatableResources.

This API complements the List API and allows the consumers (the scheduler) to track allocated and allocatable resources on a per-NUMA zone basis.

Even with these important additions, the API is still not ideal because both List and GetAllocatableResources require the monitoring application to poll the kubelet.

If the monitoring application has a too slow monitoring loop, the scheduler likely gets stale information; on the other hand, if the monitoring application monitors very frequently, it adds extra load to the kubelet and to the system in general.

To further improve this, another extension to Pod Resources API is being developed.

The idea is to add Watch endpoints, which will report a stream of events to the monitoring application when resource allocation changes (for example, when pods are created or deleted) or if the resource availability changes (for example, if new device plug-ins are added or deleted). This further extension is planned to be submitted during 2021.

The expected flow to consume these APIs is as follows.

Polling approach for applications that do not need to react quickly to allocation changes:

  1. Connect to podresources endpoint
  2. Intial resource assessment: call GetAllocatableResources and List to learn about the resources available on this node.
  3. Loop forever:
    1. Optionally, call GetAllocatableResources to fully reconcile the resource state. This is optional because for some applications, the initial GetAllocatableResources call and proper resource tracking using List may be sufficient.
    1. Call List to learn about the current resource allocation
    2. Perform the business logic comparing the available resources and the allocated resources
    3. Sleep as needed

Event-based approach for applications that do need to react quickly to allocation changes:

  1. Connect to podresources endpoint
  2. Initial resource assessment: call GetAllocatableResources and List to learn about the resources available on this node.
  3. Register itself calling Watch (and optionally WatchAllocatable) to subscribe to allocation changes
  4. Wait forever:
    1. Both Watch endpoints will provide events reflecting resource allocation or availability (for example, new device plug-in registered) changes
    2. The APIs provide enough data to intelligently and deliberately reconcile the information coming from Watch streams with what was provided from GetAllocatableResources and List.

Should you want to learn more about the pod resources API and the changes proposed for topology-aware scheduling, you can start from the KEPs:

Other References: Survey of Resource Management in Kubernetes for Performance Critical Workloads