Picture this scenario: your application has been running in Red Hat OpenShift for months when suddenly the container platform tells you that it can't spin up a new pod in the cluster due to resource constraints. Or picture another scenario where the application has come to a grinding halt due to high CPU utilization. Neither scenario is desirable, and to address these issues, we’ll need to have a conversation about rightsizing your application. This article covers how the OpenShift platform manages capacity and the author's considerations and recommendations for the administrator.
Disclaimer: The information contained in this article represents the author’s views and opinions and does not necessarily represent good practices or guidance from Red Hat. Consider this article to contain unofficial tips and tricks, rather than formally support doctrine.
To start things off, I'd like to give this caveat: rightsizing your application is an art, not a science. There is not one single approach to this topic, and much varies by customer needs and business requirements. In this article, I'll cover my thoughts on various topics along with hands-on practices in how I've guided many of my Red Hat OpenShift customers. The approaches I've taken in this article may not work best for you, but should be considered as conversation starters.
Before we begin a conversation around rightsizing, we'll want to understand how the Kubernetes platform (and OpenShift, by extension) applies resource constraints at both the container and node level. For the purposes of this rightsizing discussion, we'll focus exclusively on CPU and memory, although there are others to consider as well. We’ll use the following diagram to illustrate some of the resource constraints on the node.
Figure 1: Node memory (or CPU) over time
Resource requests and limits can be specified for each pod and container. Requests are guaranteed resources set aside for pods, whereas limits are safeguards designed to protect the cluster infrastructure. The relationship between requests and limits on a pod is configured in Kubernetes as Quality of Service (QoS). On the node, the kubelet (an agent which monitors resources) passes this information to the container runtime which uses kernel cgroups to apply the resource constraints.
To schedule a new pod, the Kubernetes scheduler determines valid placement on available nodes and takes existing pod resource constraints into account. OpenShift preconfigures system-reserved to set aside resources for OS and Kubernetes system components to use. The remaining amount is defined as allocatable and the scheduler treats this as the node’s capacity. The scheduler can schedule pods to the node’s capacity based on the aggregate resource requests of all pods. Note that the aggregate resource limits of all pods can be greater than node capacity, and the practice of doing so is known as overcommitting.
In managing our node capacity, there are two scenarios we are trying to avoid. In the first scenario, actual memory utilization reaches capacity and the kubelet triggers a node-pressure eviction based on eviction signals. If the node runs out of memory before the kubelet can reclaim memory, the node oom-killer will respond, selecting pods to be killed based on an oom_score_adj value calculated from each pod's QoS. As a result, the applications comprising these pods are impacted.
The underlying mechanics of overcommitting on CPU behaves differently than memory in that it distributes CPU time as available across containers. High CPU utilization results in CPU throttling but does not trigger node-pressure eviction nor automatically cause Kubernetes to terminate pods. Note that CPU exhaustion may still cause applications pods to degrade, fail their liveness probe, and restart anyway.
There is another scenario we look to avoid as well. At a node level, requests are guaranteed resources and must be less than capacity as the Kubernetes scheduler does not oversubscribe. If requests are significantly and consistently larger than actual resources used, the excess capacity essentially goes unused. While it may be desirable to reserve resources for peak processing times, the administrator should balance this with the recurring costs of running excess capacity that may not be needed. Configuring requests based on actual usage is a balancing act and risk management of the application should be taken into account.
A major focus for an OpenShift administrator is to abstract the infrastructure from developers who in turn can focus on developing applications. Administrators are tasked with managing and rightsizing cluster capacity, and OpenShift captures metrics on cluster utilization for administrator consumption both in web console dashboards and command line. OpenShift also provides administrators with the Machine API Operator to flexibly manage nodes and autoscaling capabilities on supported hypervisors, and nodes can also be added or removed manually. For further reading on managing cluster capacity, an excellent blog series on How Full is My Cluster was written by a fellow Red Hatter.
In this blog, we will devote more attention to the interactions between administrators and developers. While administrators are able to manage cluster capacity themselves through OpenShift’s built-in tools, there is a large piece of the rightsizing puzzle yet to be solved: the running applications. An application to solve a particular problem can be written by different developers in different ways, resulting in different performance. Each application is unique and there is no one size fits all approach. Administrators have less control over a developer’s application and in large enterprises, a single administration team may be hard pressed to reach out to numerous development teams. Thus, the administrator’s focus should be to set guard rails to allow developers to rightsize their own applications.
To accomplish this, administrators can implement LimitRanges which provide developers with suggested sizing constraints for individual containers and pods. The following is an example of a LimitRange for the purposes of this discussion. Since each cluster and application has different business and risk requirements, your actual numbers will vary.
Good practices for development in a containerized platform are to create microservice applications rather than large monolithic ones. To encourage microservice development, limits should be applied to constrain the maximum size of pods. A node’s physical capacity may dictate this maximum size as it should comfortably fit several of the largest pods. An analogy is a cup which holds rocks of varying sizes. If the largest rocks are placed first, then pebbles and grains of sand can fill in the gaps. However, depending on the size of the cup, if the pebbles and sand are placed first, the largest rock may not fit.
Let's continue to walk through the above LimitRange example. The minimum pod and container size is likely determined by the running application's requirements and is not as relevant for administrators to enforce. Developers are also encouraged to run one container per pod for simplicity (a notable exception is the use of sidecar containers i.e. Red Hat Service Mesh, based on Istio). For this reason, the above example uses the same resource values for pods and containers.
Default requests and limits act as suggested values for developers. Workload resources (i.e. DeploymentConfig or BuildConfig) that don't explicitly declare container sizes will inherit default values, as will terminating pods (i.e. deployer pod from a DeploymentConfig or build pod from a BuildConfig). As a good practice, developers should explicitly define resource requests and limits in workload resources and not assume default values.
The maxLimitRequestRatio for CPU and memory are bursting guidelines for developers. In a development environment, a high CPU maxLimitRequestRatio works well when a prototype application is often running idle but requires reasonable on-demand resources when used. Developers may be working business hours only, coding offline in their own IDE, sporadically testing a single microservice, or testing a different phase of a CI/CD pipeline altogether. In contrast, if many end users simultaneously access the application throughout the day, you'll see a higher baseline utilization. This may be closer to your production environment and could lower the maxLimitRequestRatio, perhaps event 1:1 limits to requests). Since different utilization patterns across stages of the pipeline will result in different requests and limits, it is important to test with simulated workloads prior to production to rightsize pods.
Developers will use the maxLimitRequestRatio as a rightsizing guideline. The Kubernetes scheduler bases scheduling decisions on resource requests so developers should configure resource requests to reflect actual usage. Then, based on their application’s risk profile, developers will configure limits to adhere to the maxLimitRequestRatio. An administrator that sets maxLimitRequestRatio equal to 1 forces developers to configure requests equal to limits, which may be desirable in production to reduce risk and prioritize stability.
Earlier in this article, we compared memory with CPU and described how the two resources behave differently under load, with high memory potentially resulting in pod eviction or restarts from an Out Of Memory condition. As a result, it is better to err on the side of caution and configure a lower maxLimitRequestRatio for memory across environments to prevent application pod restarts. Additional considerations should be taken for configuring memory for OpenJDK pods. The JVM heap inside the container and pod knows nothing about the container’s requests and limits, yet resource constraints applied to the former will affect the latter. The OpenShift documentation provides guidance and considerations for tuning OpenJDK specific workloads.
Administrators can also implement ResourceQuotas which provide capacity-based constraints on namespaces to guide developers in rightsizing their application based on forecasted estimates. The following is an example of a ResourceQuota (shortened to quota in this blog for brevity) for the purposes of this discussion.
During the initial creation of an application namespace, the development team should work with the administrator to forecast their application sizing and apply an appropriate quota. An administrator should forecast application size based on the number of services, replicas, and estimated size of pods. For simplicity in managing numerous namespaces, the administrator may consider a “t-shirt size” approach as a starting guideline, with small, medium, and large applications being given corresponding predetermined quotas.
Applications are promoted across various stages of a CI/CD pipeline, each in a different namespace with its own configured quota. In development and testing namespaces where performance and high availability are not concerns, applications should configure minimum sized pods and 1 pod replica per service to reduce usage. On the other hand, in a production namespace, larger pods and a minimum of 2 pod replicas per service should be used to handle higher volume and provide high availability. By stress and performance testing with simulated workloads in the CI/CD pipeline, developers can determine appropriate production pod sizes, replica counts, and quotas prior to production release.
An administrator should budget quota for future expansion and account for the application’s usage pattern, peak volume, and configured pod or node autoscalers, if any. For example, additional quota may be allocated in a development namespace that is rapidly adding new microservices, performance testing namespace to determine appropriate production pod sizes, or a production namespace using pod autoscalers to adjust to peak volume. An administrator should provide sufficient quota overhead for these various scenarios and others, while balancing risk to and protecting the infrastructure capacity.
Both administrators and developers should expect to adjust quotas over time. Developers can reclaim quota without requiring an administrator’s assistance by reviewing each service and reducing pod requests or limits to match actual consumption. The OpenShift web console provides developers with metrics of actual CPU and memory consumption for deployments and running pods, and this blog describes how developers can use those metrics to determine pod constraints. If developers have taken these steps yet still require additional quota, then they should reach out to the administrator. Administrators should take a developer’s periodic request for quota as an opportunity to analyze the actual consumption against previously forecasted estimates, and confirm or adjust quota sizing and new forecasted estimates accordingly.
The rest of this section will describe some secondary considerations when sizing quotas. Node capacity should be considered when determining the ratio of quota for CPU and memory, to efficiently utilize both. As an example, an AWS EC2 instance of type m5.2xlarge is 8 vCPU, 32 GiB RAM. A cluster consisting of m5.2xlarge nodes can efficiently use both CPU and memory by assigning application quota in ratios of 1 vCPU for every 4 GB RAM (not accounting for the node’s system-reserved). If application workloads (i.e. CPU or memory intensive) do not match node sizes, a different node size may be considered.
Optionally, administrators may consider applying multi-project quotas, also known as ClusterResourceQuotas, for applications that are deployed across multiple namespaces. The following are some examples of where this may be an appropriate approach. Perhaps components of an application are logically separated and deployed across multiple namespaces. Perhaps an application team is given fixed capacity based on hardware purchased to be shared across multiple development stages of a CI/CD pipeline. Or perhaps a development team utilizing feature branches rapidly creates and destroys namespaces, and requires a pool of resources to use.
When to and when not to apply CPU limits for quota has been debated amongst administrators, and here we'll provide considerations to take into account rather than formal guidance. This article is a great read and covers a section on Resources limits and compressibility. As we've covered previously, CPU starvation of a pod leads to throttling but not necessarily pod termination. CPU limits for quota should not be set if an administrator prefers to overcommit and make use of all available CPU on a node. Conversely, CPU limits for quota should be set to reduce overcommitting and risk to application performance, which may in turn be a business and cost decision, not a technical one. A development environment may tolerate higher risk and unpredictable performance than a production environment, and thus an administrator may consider applying CPU limits for production but not development.
Finally, there are some scenarios where applying quotas would not be advised. The purpose of applying quotas is for the administrator to gain some control over capacity planning of custom developed applications. Quotas should not be applied to OpenShift infrastructure projects as they require preconfigured amounts of resources that are tested and supported by the Red Hat vendor. For similar reasons, quotas should also not be applied to commercial off the shelf (COTS) applications provided by third party vendors.
In this blog, we have covered how the Kubernetes platform protects the infrastructure through resource constraints, and provided rightsizing considerations in applying the guardrails of limits and quotas to application namespaces. As it was mentioned at the beginning of this article, rightsizing is an art, not a science. Each application’s appetite for risk and OpenShift cluster’s capacity provide unique constraints that an administrator must navigate, and it is my hope that this blog has provided many of these considerations for thought.