OpenShift is becoming the enterprise platform of choice for cloud native software, implementing higher level abstractions on top of the Kubernetes low-level primitives. As extension mechanisms like aggregated API servers, admission webhooks and custom resource definitions are becoming more widely adopted to run custom workloads, additional stress is being imposed on the API server.
The API server is becoming a critical component under risk. Custom controllers with unregulated traffic can cause cluster instabilities when high-level object access slows down critical low-level communications, leading to request failures, timeouts and API retry storms.
Hence, it is very important that the API server knows how to prioritize traffic from all the different clients, without starving critical control plane traffic.
Kubernetes API Priority and Fairness (APF) is a flow control mechanism that allows platform owners to define API-level policies to regulate inbound requests to the API server. It protects the API server from being overwhelmed by unexpectedly high request volume, while protecting critical traffic from the throttling effect on best-effort workloads.
APF has been enabled in OpenShift since version 4.5. In this post, we will examine how OpenShift utilizes APF to protect the control plane. We will also be going over some configuration, metrics and debugging endpoints that will aid you in making APF works for your OpenShift cluster.
What Is APF
Prior to APF, the API server uses the --max-requests-inflight and --max-mutating-requests-inflight command-line flags to regulate the volume of inbound requests. The only distinction that this implementation can make is whether the requests are mutating ones or not. They can’t, for example, ensure that lower priority traffic doesn’t overwhelm the critical ones as described in this issue.
By classifying requests into flows and priorities, APF manages and throttles all inbound requests in a prioritized and fair manner.
With APF enabled, all incoming requests are evaluated against a set of flow schemas. Every request will be matched with exactly one flow schema, which assigns the request to a priority level. When requests of a priority level are being throttled, requests of other priority levels remain unaffected.
To further enforce fairness among requests of a priority level, the matching flow schema associates requests with flows, where requests originating from the same source are assigned the same flow distinguisher.
Among the flows in a priority level, new requests are either served immediately, enqueued or rejected, depending on the priority level’s queue capacity, concurrent request limit, and total in-flight requests.
Requests are rejected if one of the following conditions is true:
- The priority level is configured to reject excessive requests
- The queues that the new requests will be assigned to are full
Requests are enqueued using shuffle sharding, a technique commonly used to isolate workloads to improve fault tolerance. When sufficient capacity becomes available, the requests will be dequeued using a fair queueing algorithm across the flows. Enqueued requests can also be rejected if the queue’s time limit expires.
In subsequent sections, we will go over how to adjust and validate these queueing properties using the FlowSchema and PriorityLevelConfiguration resources.
Introducing OpenShift Flow Schemas
Let’s spin up an OpenShift 4.6.15 cluster with monitoring enabled using CodeReady Containers 1.22.0:
crc config set enable-cluster-monitoring true
crc start --memory=16096
Once the cluster is ready, use the oc CLI to login:
oc login -u kubeadmin -p [password] https://api.crc.testing:6443
You can retrieve your login credential with:
crc console --credentials
The following is the list of OpenShift FlowSchema resources:
oc get flowschema | grep openshift
openshift-apiserver-sar exempt 2 ByUser 29d False
openshift-oauth-apiserver-sar exempt 2 ByUser 29d False
openshift-apiserver workload-high 1000 ByUser 29d False
openshift-controller-manager workload-high 1000 ByUser 29d False
openshift-oauth-apiserver workload-high 1000 ByUser 29d False
openshift-oauth-server workload-high 1000 ByUser 29d False
openshift-apiserver-operator openshift-control-plane-operators 2000 ByUser 29d False
openshift-authentication-operator openshift-control-plane-operators 2000 ByUser 29d False
openshift-etcd-operator openshift-control-plane-operators 2000 ByUser 29d False
openshift-kube-apiserver-operator openshift-control-plane-operators 2000 ByUser 29d False
openshift-monitoring-metrics workload-high 2000 ByUser 29d False
yq 4.3.1 is used to improve the readability of the YAML outputs of subsequent commands.
To help us better understand some important configuration, let’s examine the spec of the openshift-apiserver-operator flow schema:
oc get flowschema openshift-apiserver-operator -oyaml | yq e .spec -
distinguisherMethod:
type: ByUser
matchingPrecedence: 2000
priorityLevelConfiguration:
name: openshift-control-plane-operators
rules:
- resourceRules:
- apiGroups:
- '*'
clusterScope: true
namespaces:
- '*'
resources:
- '*'
verbs:
- '*'
subjects:
- kind: ServiceAccount
serviceAccount:
name: openshift-apiserver-operator
namespace: openshift-apiserver-operator
The rules describe the list of criteria used to identify matching requests. The flow schema matches a request if and only if:
- at least one of its subjects matches the subject making the request and
- at least one of its resourceRules or nonResourceRules matches the verb and (non-)resource being requested
Essentially, this flow schema will match all requests issued by the openshift-apiserver-operator service account in the openshift-apiserver-operator namespace, for all namespaced-scoped as well as cluster-scoped resources.
If we impersonate the openshift-apiserver-operator service account to issue a GET request to list all the pods, the X-Kubernetes-Pf-Prioritylevel-Uid and X-Kubernetes-Pf-Flowschema-Uid response headers would show that our request is mapped with the openshift-apiserver-operator flow schema and its priority level configuration, as expected:
SERVICE_ACCOUNT="system:serviceaccount:openshift-apiserver-operator:openshift-apiserver-operator"
FLOW_SCHEMA_UID="$(oc get po -A --as "$SERVICE_ACCOUNT" -v8 2>&1 | grep -i X-Kubernetes-Pf-Flowschema-Uid | awk '{print $6}')"
PRIORITY_LEVEL_UID="$(oc get po -A --as "$SERVICE_ACCOUNT" -v8 2>&1 | grep -i X-Kubernetes-Pf-Prioritylevel-Uid | awk '{print $6}')"
CUSTOM_COLUMN=”uid:{metadata.uid},name:{metadata.name}”
oc get flowschema -o custom-columns="$CUSTOM_COLUMN" | grep $FLOW_SCHEMA_UID
9a3bf863-d69f-470a-b119-df9bd3a709bd openshift-apiserver-operator
oc get prioritylevelconfiguration -o custom-columns="$CUSTOM_COLUMN" | grep $PRIORITY_LEVEL_UID
2cf49074-5360-44da-a259-2b051972daf0 openshift-control-plane-operators
Without the service account impersonation, the request issued by the same command will be mapped with the global-default flow schema because the request is bound to the OpenShift kubeadmin user.
This request mapping mechanism provides a granular way to assign requests from different origins to different flows, based on their flow distinguishers, so that they can’t starve each other.
The distinguishedMethod defines how the flow distinguishers are computed:
- ByUser where requests originated from the same subject are grouped into the same flow so that different users can’t overwhelm each other
- ByNamespace where requests originated from the same namespace are grouped into the same flow so that workloads in one namespace can’t overwhelm those in other namespaces
- An empty string where all requests are grouped into a single flow
When matching requests, a flow schema with a lower matchingPrecedence has higher precedence than one with a higher matchingPrecendence.
The priorityLevelConfiguration refers to the priority level configuration resource that specifies the flow control attributes.
Understanding Priority Level Configuration
The openshift-control-plane-operators priority level is used to regulate OpenShift operator requests to the API server. Let’s take a look at its .spec:
oc get prioritylevelconfiguration openshift-control-plane-operators -oyaml | yq e .spec -
limited:
assuredConcurrencyShares: 10
limitResponse:
queuing:
handSize: 6
queueLengthLimit: 50
queues: 128
type: Queue
type: Limited
The limited.assuredConcurrencyShares (ACS) defines the concurrency shares used to calculate the assured concurrency value (ACV). The ACV of a priority level defines the total number of concurrent requests that may be executing at a time. Its exact value is affected by the API server’s concurrency limit (SCL), which is divided among all priority levels in proportion to their ACS.
When APF enabled, the SCL is set to the summation of the --max-requests-inflight and --max-mutating-requests-inflight options. By default, these options are set to 3000 and 1000, respectively in OpenShift 4.6.
Using the formula presented in the Kubernetes documentation, we can calculate the ACV of the openshift-control-plane-operators priority level as follows:
ACV(l) = ceil(SCL * ACS(l) / (sum[priority levels k] ACS(k)))
= ceil((3000 + 1000) * 10 / (1 + 100 + 10 + 10 + 30 + 40 +20))
= ceil(189.57)
= 190
We can use the apiserver_flowcontrol_request_concurrency_limit metric to confirm this value:
The Prometheus console is accessible at localhost:9090 via port-forwarding: oc -n openshift-monitoring port-forward svc/prometheus-operated 9090 |
Later when a new custom priority level is added, the ACVs of all the existing priority levels will decrease as the SCL is being shared across more priority levels.
The limited.limitResponse defines the strategy to handle requests that can’t be executed immediately. The type subproperty supports two values:
- Queue where excessive requests are queued
- Reject where excessive requests are dropped with an HTTP 429 error
With the Queue configuration, the queueing behavior can be further adjusted using the following subproperties:
- queueing.queues is the number of queues of a priority level
- queueing.queueLengthLimit is the number of requests allowed to be waiting in a queue
- queueing.handSize is the number of possible queues a request can be assigned to during enqueuing. The request will be added to the shortest queue in this list
The exact values to be used for these properties are dependent on your use case.
For example, while increasing queues size reduces the rate of collisions between different flows (because there are more queues available), it increases memory usage. Although increasing queueLengthLimit supports bursty traffic (as each queue can hold more requests), the cost of latency and memory usage are likely to be higher. Since handSize is computed by a request’s flow distinguisher, a larger value means it’s less likely for individual flows to collide (because there are more queues to choose from), but more likely for a small number of flows to dominate the API server (as some queues get more dense than others).
In the next section, we will create some custom flow schema and priority level configuration to regulate the traffic from a custom controller.
Configuring Custom Flow Schema and Priority Level
Let’s start by creating a demo namespace with 3 service accounts, namely, podlister-0, podlister-1 and podlister-2, with permissions to LIST and GET pods from the demo namespace:
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: demo
EOF
for i in {0..2}; do
cat <<EOF | oc auth reconcile -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: podlister
namespace: demo
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["list", "get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: podlister
namespace: demo
subjects:
- apiGroup: ""
kind: ServiceAccount
name: podlister-$i
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: podlister
EOF
done
for i in {0..2}; do
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: podlister-$i
namespace: demo
labels:
kubernetes.io/name: podlister-$i
EOF
done
Then we will create a custom flow schema and priority level configuration to regulate requests originating from these 3 service accounts:
cat <<EOF | oc apply -f -
apiVersion: flowcontrol.apiserver.k8s.io/v1alpha1
kind: FlowSchema
metadata:
name: restrict-pod-lister
spec:
priorityLevelConfiguration:
name: restrict-pod-lister
distinguisherMethod:
type: ByUser
rules:
- resourceRules:
- apiGroups: [""]
namespaces: ["demo"]
resources: ["pods"]
verbs: ["list", "get"]
subjects:
- kind: ServiceAccount
serviceAccount:
name: podlister-0
namespace: demo
- kind: ServiceAccount
serviceAccount:
name: podlister-1
namespace: demo
- kind: ServiceAccount
serviceAccount:
name: podlister-2
namespace: demo
---
apiVersion: flowcontrol.apiserver.k8s.io/v1alpha1
kind: PriorityLevelConfiguration
metadata:
name: restrict-pod-lister
spec:
type: Limited
limited:
assuredConcurrencyShares: 5
limitResponse:
queuing:
queues: 10
queueLengthLimit: 20
handSize: 4
type: Queue
EOF
The restrict-pod-lister priority level has 10 queues. Each queue can hold a maximum of 20 requests. With its ACS set to 5, this priority level can handle about 93 concurrent requests:
The values used in the above queue configuration are provided for demonstration purposes only. You should adjust them to suit your use case. |
Examining APF Metrics And Debugging Endpoints
Now we are ready to deploy our custom controller into the demo namespace, as 3 separate Deployment resources. Each Deployment uses one of the service accounts we created earlier:
for i in {0..2}; do
cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: podlister-$i
namespace: demo
labels:
kubernetes.io/name: podlister-$i
spec:
selector:
matchLabels:
kubernetes.io/name: podlister-$i
template:
metadata:
labels:
kubernetes.io/name: podlister-$i
spec:
serviceAccountName: podlister-$i
containers:
- name: podlister
image: quay.io/isim/podlister
imagePullPolicy: Always
command:
- /podlister
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: SHOW_ERRORS_ONLY
value: "true"
- name: TARGET_NAMESPACE
value: demo
- name: TICK_INTERVAL
value: 100ms
resources:
requests:
cpu: 30m
memory: 50Mi
limits:
cpu: 100m
memory: 128Mi
EOF
done
The controller uses Go’s time.Tick() function to send continuous traffic to the LIST pod endpoint of the API server, to retrieve all the pods in the demo namespace. The source code can be found here.
Switching over to the Prometheus console, let’s use the apiserver_flowcontrol_dispatched_requests_total metric to retrieve the total number of requests matched by our flow schema:
apiserver_flowcontrol_dispatched_requests_total{job="apiserver",flowSchema="restrict-pod-lister"}
As expected of a counter vector, we observe an upward trend in the summation of its rates:
sum(rate(apiserver_flowcontrol_dispatched_requests_total{job="apiserver",flowSchema="restrict-pod-lister"}[15m])) by (flowSchema)
The apiserver_flowcontrol_current_inqueue_requests metric shows the number of requests waiting in the queues. The 0 value indicates that our queues are currently empty:
apiserver_flowcontrol_current_inqueue_requests{job="apiserver",flowSchema="restrict-pod-lister"}
More importantly, the number of rejected requests is also 0, as shown by the apiserver_flowcontrol_rejected_requests_total metric:
apiserver_flowcontrol_rejected_requests_total{job="apiserver",flowSchema="restrict-pod-lister"}
The apiserver_flowcontrol_request_execution_seconds metric provides insights into how long it takes to execute requests in our priority level:
histogram_quantile(0.99, sum(rate(apiserver_flowcontrol_request_execution_seconds_bucket{job="apiserver",flowSchema="restrict-pod-lister"}[15m])) by (le,flowSchema))
In this particular test run, the p99 of the request execution time in our queues is around 16 milliseconds.
Conversely, the apiserver_flowcontrol_request_wait_duration_seconds metric shows how long requests spent inside the queues:
histogram_quantile(0.99, sum(rate(apiserver_flowcontrol_request_wait_duration_seconds_bucket{job="apiserver",flowSchema="restrict-pod-lister"}[15m])) by (le,flowSchema))
The p99 of the request wait duration of this test run is around 4.95 milliseconds.
We will revisit these two metrics later to see how they can affect our client-side context timeout.
Let’s add more replicas to increase the traffic volume to activate the queueing effect:
for i in {0..2}; do oc -n demo scale deploy/podlister-$i --replicas=10; done
When our queues are saturated, the number of rejected requests starts to increase. The reason label tells us why these requests are being rejected (i.e. queue-full, timeout or concurrency-limit):
sum(rate(apiserver_flowcontrol_rejected_requests_total{job="apiserver",flowSchema="restrict-pod-lister"}[15m])) by (flowSchema,reason)
As the API server responds with HTTP 504 (request timed out) errors, these error messages can be seen in the controller’s logs:
oc -n demo logs deploy/podlister-0 | grep -i error
2021/02/11 04:32:39 error while listing pods: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
2021/02/11 04:33:39 error while listing pods: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
In cases where the API server responds with a HTTP 429 (too many requests) error, the controller will see this error message:
the server has received too many requests and has asked us to try again later
The p99 of the request in-queue wait duration ranges between 1.5 to 3.5 seconds:
histogram_quantile(0.99, sum(rate(apiserver_flowcontrol_request_wait_duration_seconds_bucket{job="apiserver",flowSchema="restrict-pod-lister"}[15m])) by (le,flowSchema))
The p99 request execution time lies between 1.5 to 6.0 seconds:
histogram_quantile(0.99, sum(rate(apiserver_flowcontrol_request_execution_seconds_bucket{job="apiserver",flowSchema="restrict-pod-lister"}[15m])) by (le,flowSchema))
In addition to metrics, APF also exposes debugging endpoints that can provide further insights into the conditions of the queues.
The /debug/api_priority_and_fairness/dump_priority_levels endpoint tells us the total number of executing and waiting requests in our priority level:
oc get --raw /debug/api_priority_and_fairness/dump_priority_levels
PriorityLevelName, ActiveQueues, IsIdle, IsQuiescing, WaitingRequests, Exec..
workload-high, 0, true, false, 0, 0
exempt, <none>, <none>, <none>, <none>, <none>
openshift-control-plane-operators, 0, false, false, 0, 2
global-default, 0, false, false, 0, 1
system, 0, true, false, 0, 0
restrict-pod-lister, 8, false, false, 155, 93
leader-election, 0, true, false, 0, 0
workload-low, 0, false, false, 0, 2
catch-all, 0, true, false, 0, 0
At the time of this particular test run, there were 155 waiting requests and 93 executing requests in the restrict-pod-lister priority level.
The /debug/api_priority_and_fairness/dump_queues endpoint can provide further visibility into the condition of every queue in our flow schema:
oc get --raw /debug/api_priority_and_fairness/dump_queues | grep -i restrict-pod-lister
PriorityLevelName, Index, PendingRequests, ExecutingRequests, VirtualStart
restrict-pod-lister, 0, 19, 12, 25217.6231
restrict-pod-lister, 1, 18, 10, 25251.8502
restrict-pod-lister, 2, 20, 11, 25213.0914
restrict-pod-lister, 3, 19, 11, 25229.0108
restrict-pod-lister, 4, 18, 12, 25207.1798
restrict-pod-lister, 5, 19, 12, 25213.2181
restrict-pod-lister, 6, 0, 0, 0.0000
restrict-pod-lister, 7, 19, 11, 25205.3927
restrict-pod-lister, 8, 0, 0, 0.0000
restrict-pod-lister, 9, 19, 14, 25232.5364
...
Finally, the /debug/api_priority_and_fairness/dump_requests endpoint allows us to identify which queue the request is assigned to:
oc get --raw /debug/api_priority_and_fairness/dump_requests
PriorityLevelName, FlowSchemaName, QueueIndex, RequestIndexInQueue, FlowDistingsher, ArriveTime
restrict-pod-lister, restrict-pod-lister, 0, 0, system:serviceaccount:demo:podlister-1, 2021-02-11T05:13:59.874733557Z
restrict-pod-lister, restrict-pod-lister, 0, 1, system:serviceaccount:demo:podlister-1, 2021-02-11T05:13:59.880309335Z
restrict-pod-lister, restrict-pod-lister, 0, 2, system:serviceaccount:demo:podlister-1, 2021-02-11T05:13:59.881055726Z
…
restrict-pod-lister, restrict-pod-lister, 1, 13, system:serviceaccount:demo:podlister-0, 2021-02-11T05:14:01.645786117Z
restrict-pod-lister, restrict-pod-lister, 1, 14, system:serviceaccount:demo:podlister-0, 2021-02-11T05:14:01.825985532Z
restrict-pod-lister, restrict-pod-lister, 1, 15, system:serviceaccount:demo:podlister-0, 2021-02-11T05:14:01.899721291Z
restrict-pod-lister, restrict-pod-lister, 1, 16, system:serviceaccount:demo:podlister-1, 2021-02-11T05:14:02.167530293Z
restrict-pod-lister, restrict-pod-lister, 1, 17, system:serviceaccount:demo:podlister-1, 2021-02-11T05:14:02.183224599Z
...
restrict-pod-lister, restrict-pod-lister, 3, 0, system:serviceaccount:demo:podlister-2, 2021-02-11T05:14:01.051811112Z
restrict-pod-lister, restrict-pod-lister, 3, 1, system:serviceaccount:demo:podlister-2, 2021-02-11T05:14:01.053504144Z
restrict-pod-lister, restrict-pod-lister, 3, 2, system:serviceaccount:demo:podlister-2, 2021-02-11T05:14:01.0833556Z
...
While all these are happening, all the OpenShift operators remain healthy and unaffected:
CUSTOM_COLUMNS="Name:.metadata.name,AVAILABLE:.status.conditions[?(@.type=='Available')].status,DEGRADED:.status.conditions[?(@.type=='Degraded')].status"
oc get clusteroperator -o custom-columns="$CUSTOM_COLUMNS"
Name AVAILABLE DEGRADED
authentication True False
cloud-credential True False
cluster-autoscaler True False
config-operator True False
console True False
csi-snapshot-controller True False
dns True False
etcd True False
image-registry True False
ingress True False
insights True False
kube-apiserver True False
kube-controller-manager True False
kube-scheduler True False
kube-storage-version-migrator True False
machine-api True False
machine-approver True False
machine-config True False
marketplace True False
monitoring True False
network True False
node-tuning True False
openshift-apiserver True False
openshift-controller-manager True False
openshift-samples True False
operator-lifecycle-manager True False
operator-lifecycle-manager-catalog True False
operator-lifecycle-manager-packageserver True False
service-ca True False
storage True False
If we update the controllers with a context timeout that is less than the in-queue wait duration, we will start seeing some client-side context deadline exceeded errors in the logs:
oc -n demo set env deploy CONTEXT_TIMEOUT=1s --all
oc -n demo logs deploy/podlister-0 | grep -i "context deadline"
2021/02/11 05:22:35 error while listing pods: Get "https://172.25.0.1:443/api/v1/namespaces/demo/pods": context deadline exceeded
2021/02/11 05:22:36 error while listing pods: Get "https://172.25.0.1:443/api/v1/namespaces/demo/pods": context deadline exceeded
2021/02/11 05:22:36 error while listing pods: Get "https://172.25.0.1:443/api/v1/namespaces/demo/pods": context deadline exceeded
So if you start seeing many context deadline exceeded errors in your controllers’ logs, you now know how to use the APF metrics, debugging endpoints and error logs to determine if APF is throttling your requests.
There are other APF metrics not covered in this post that you might find relevant to your use case. Check out the APF documentation for the full list.
Recovering From The Throttling Effect
If we scale the controllers down to 0 replicas, the number of rejected requests will gradually decrease, as the API server recovers from the throttling effect:
for i in {0..2}; do oc -n demo scale deploy/podlister-$i --replicas=0; done
Conclusion
In this post, we went over how to create and configure custom FlowSchema and PriorityLevelConfiguration resources to regulate inbound traffic to the API server.
We saw how APF queued and rejected the excessive requests generated by our custom controller. We used the different APF metrics and debugging endpoints to gain insights into the priority level queues. While this is happening, all the OpenShift operators remain unaffected by the throttling effect.
We also looked at the scenario where our client-side context deadline timed out before the API server finished processing our requests, due to a long in-queue wait duration.
The flow schema provides other configurations such as rejecting the excessive traffic instead of queueing them, and regulating inbound traffic by namespace instead of by users, all of which will be left as an exercise for the readers.
Categories