Not all traffic has the same priority, and when there is contention for bandwidth, there should be a mechanism for network appliances outside the OpenShift Container Platform (OCP) cluster to prioritize the traffic. To enable this, we will use Quality of Service (QoS) Differentiated Services Code Point (DSCP), which allows us to classify packets by setting a 6-bit field in the IP header, effectively marking the priority of a given packet relative to other packets as "Critical," "High Priority," "Best Effort," and so on.

Marking packets with DSCP as they head out allows a router to distinguish between them and determine, for example, which require higher bandwidth or higher priority and handle their requirements properly.

Starting from OCP 4.11 (enabled by default to all customers), a new Developer Preview OVN-Kubernetes Container Network Interface (CNI) feature is introduced: EgressQoS, which enables a cluster administrator to mark pods egress traffic with a valid QoS DSCP value. The markings will be consumed and acted on by network appliances outside the OCP cluster to optimize traffic flow throughout their networks.

Configuring the router to handle DSCP markings is outside the scope of this post. Instead, we'll focus on how we can apply different markings to traffic coming from pods heading to an external destination using EgressQoS.

A simple user story example: As a cluster administrator, I pre-configured my router to handle the different DSCP values (using colors for demonstration, in reality they are decimals from 0-63) of incoming traffic, by giving “green” traffic full priority, “yellow” traffic low priority, and “red” best effort. I want egress traffic coming from different applications (pods) on a given namespace (namespace1) to be marked with different DSCP “colors” so my router can handle them properly and allow their requirements to be fulfilled. imagefor

In this post, we'll explore how such configuration is available in OCP clusters that use OVN-Kubernetes CNI as their network provider.

What is EgressQoS?

Starting from OCP 4.11, EgressQoS (Developer Preview) is a namespaced Custom Resource Definition (CRD) that enables marking pods egress traffic with a valid QoS DSCP value. A namespace supports having only one EgressQoS resource named default (other EgressQoSes will be ignored).

An EgressQoS resource allows specifying a list of QoS rules, each consisting of 3 fields:

  • dscp: DSCP value for matching egress traffic

  • dstCIDR (optional): Apply DSCP to traffic heading to this CIDR

  • podSelector (optional): Apply DSCP to traffic from pods whose labels match this selector

kind: EgressQoS
apiVersion: k8s.ovn.org/v1
metadata:
name: default
namespace: default
spec:
egress:
- dscp: 30
dstCIDR: 1.2.3.0/24
- dscp: 42
podSelector:
matchLabels:
app: example
- dscp: 28

This example marks the packets originating from pods in the default namespace in the following way:

  • All traffic heading to an address that belongs to 1.2.3.0/24 is marked with DSCP 30.

  • Egress traffic from pods labeled app: example heading to a CIDR that is not 1.2.3.0/24 is marked with DSCP 42.

  • All egress traffic is marked with DSCP 28.

IMPORTANT: The priority of a rule is determined by its placement in the egress array. An earlier rule is processed before a later rule. In this example, if the rules are reversed, all traffic originating from pods in the default namespace is marked with DSCP 28, regardless of its destination or pods labels. Because of that, specific rules should always come before general ones in that array.

Usage Example

Following a similar example to the user story we mentioned previously, here we would like to have the packets coming from the default namespace to be marked the following way:

  • All packets heading to 172.18.0.6/32 marked with DSCP 40.

  • All packets heading to 172.18.0.7/32 from pods labeled app: demo marked with DSCP 50.

To achieve that, we create the following EgressQoS resource in our OCP cluster:

apiVersion: k8s.ovn.org/v1
kind: EgressQoS
metadata:
name: default
namespace: default
spec:
egress:
- dscp: 40
dstCIDR: 172.18.0.6/32
- dscp: 50
dstCIDR: 172.18.0.7/32
podSelector:
matchLabels:
app: demo

Assuming these are the pods in the default namespace:
image2-Sep-28-2023-06-46-46-3931-PM

We can expect the traffic to be marked like:
image1-Sep-28-2023-06-46-46-5057-PM

And, indeed, running tcpdump on each of the destinations and pinging them from the pods results in:

tcpdump on 172.18.0.6 host:

bash-5.0# tcpdump -i eth0 -v icmp

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes

10:40:06.238100 IP (tos 0xa0, ttl 62, id 23892, offset 0, flags [DF], proto ICMP (1), length 84)

ovn-worker > a7acb5556708: ICMP echo request, id 7424, seq 0, length 64


10:40:08.280624 IP (tos 0xa0, ttl 62, id 42569, offset 0, flags [DF], proto ICMP (1), length 84)

ovn-worker2 > a7acb5556708: ICMP echo request, id 6656, seq 0, length 64

tcpdump on 172.18.0.7 host:

bash-5.0# tcpdump -i eth0 -v icmp

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes

10:44:33.847400 IP (tos 0xc8, ttl 62, id 58984, offset 0, flags [DF], proto ICMP (1), length 84)

ovn-worker > 90d8708e53a8: ICMP echo request, id 7680, seq 0, length 64


10:44:37.536332 IP (tos 0x0, ttl 62, id 33532, offset 0, flags [DF], proto ICMP (1), length 84)

ovn-worker2 > 90d8708e53a8: ICMP echo request, id 6912, seq 0, length 64

DSCP is derived from the tos field. To get the right decimal value from the hexadecimal we must convert it to decimals and shift 2 bits to the right (e.g., 0xc8 = 200, after shifting 2 bits to the right we get 50).

When a packet from a pod exits a node, its src is changed to the node’s IP, hence we see here that the packets come from our nodes.

Overall, from our tcpdump outputs we can see that we have reached the desired state.

Summary

In this post we saw how an OCP cluster running OVN-Kubernetes CNI can use QoS DSCP to mark selected pods’ egress traffic with a simple CRD. This allows routers and other network appliances that are connected to the cluster to prioritize packets from pods the same way they do for virtual machines (VMs) and bare-metal servers.