With the increase in adoption and reliance on digital technology and microservices architecture, the uptime of an application has never been more important. Downtime of even a few minutes can lead to huge revenue loss and most importantly trust. This is exactly why we proactively focus on identifying bottlenecks and improving the resilience and performance of OpenShift under chaotic conditions instead of being reactive i.e fix the issues after encountering them in the production environment - Proactive vs Reactive. We built, leveraged and actively maintained a tool called Krkn tool to aid with the chaos testing.
In this blog post, we will:
- Walk through why and how chaos testing can help with hardening not only the platform but also the applications deployed on top of it.
- Results from chaos testing Single Node OpenShift to emphasize the importance.
- Share the Chaos Testing Guide to help achieve confidence in your environment.
Why Chaos testing?
Here are some of the assumptions which might not hold true in a production environment:
- The network is reliable, homogeneous and secure.
- Consistent resource usage with no spikes.
- Processes do not go rogue hogging CPU/Memory/IO.
- Processes recover quickly at high load during disruptions.
- There is zero latency. Bandwidth is infinite.
- Topology never changes.
- All shared resources are available from all places.
There can be a number of things that can go wrong in a production environment, it can be related to the platform or the application running on the platform. Various such assumptions led to a number of outages in production environments in the past. The services suffered from poor performance or were inaccessible to the customers, leading to missing Service Level Agreement, uptime promises, revenue loss and degradation in terms of reliability.
Let’s look at a couple of such issues that we wouldn’t have encountered if the chaos testing was in place. We'll also examine some fixes, and how this would have impacted the end users if not accounted for:
Control plane has to be sized taking into account failure conditions to avoid downtime
OpenShift is designed to be highly available, but there are many potential areas which can break the clusters and cause downtime. The health of the control plane is tightly coupled with the uptime, or the behavior of the other components like etcd, where introducing high latency can trigger leader elections. This can be disruptive to the cluster, depending on the load.
Disrupting one master node in a three master node cluster at a certain load potentially can take down the entire cluster if not sized correctly. Here are a couple of the questions one might have:
- Was the cluster running low on resources ( CPU/Memory/Disk/Network ) which caused the failure?
- No, the cluster had more than 60% resources available and was in perfect stable condition before stopping one of the masters.
- Shouldn’t the cluster be available to handle the applications and user requests since there are 2 other masters in the cluster in HA configuration?
- Ideally yes, but the components including the control plane, kubelet and others behave entirely differently under failures at scale.
- Will we be able to detect this failure during the Performance/Scale or other testing in place?
- Most likely not. The clusters by nature are typically always in the happy path during the testing, there’s no disrupting anything be it the network or disk saturation or latency or kill components/nodes. The components behave completely differently under chaotic conditions. We might uncover the issue if one of the operations during the run is indirectly disruptive but it would be a lucky find rather than intentional.
Recommendation to mitigate this issue:
On a large and dense cluster with three master or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails. The failures can be due to unexpected issues with power, network or underlying infrastructure in addition to intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available which leads to increase in the resource usage. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources. This recommendation is also part of the scalability and performance guide shipped with each of the OpenShift releases.
Prometheus/monitoring stack capacity planning under failure conditions
Monitoring is one of the key components in OpenShift to help with understanding the behavior of the cluster. Prometheus is a very memory intensive application given the current design and we found out that crashing or restarting a prometheus would lead to 2.5 times the memory usage given that it replays the write ahead log during the initialization phase. What does it mean for large scale clusters? Large scale clusters have hundreds/thousands of node exporters scraping metrics from thousands of objects on the cluster meaning the memory can go beyond 100 GB, so a crash of Prometheus because of any external factors will lead to the Prometheus pod using up to 220 GB memory on the node it’s running. This has a very high potential for getting OOM killed or even worse hogging the resources leading to other components getting OOM killed.
Again we would have missed this data point if chaos testing was not in place. We might uncover it during other areas of testing but that would be an unintentional finding vs intentional/proactive one.
Recommendation to mitigate this issue:
- Prometheus memory usage needs to be actively monitored and the node which hosts the stack needs to take into account the resource usage spike during failures.
Now that we understand the importance of Chaos testing, let’s take a look at how we improved the resilience of one of the OpenShift variants - Single Node OpenShift.
Single Node OpenShift
It was very critical to test and harden this given the GA and how extensive its use cases are especially in the telco edge environment. Here are few of the improvements:
Handling API Server downtime gracefully
During the development of Single Node OpenShift, various components were undergoing leader elections during the API downtime where the downtime can be unintentional outages or intentional ones i.e during upgrades, rollout during the config changes, certificate rotations etc. This impacted a number of components including etcd, OVN/sdn-controller ( network plugin ), kube-controller-manager, kube-scheduler, Monitoring, Machine-API etc. Here is now we addressed it:
- We started with a goal to handle 60 seconds API downtime without having to go through leader elections as a rollout will take that much time after tweaking the shutdown-delay-duration and gracefulTerminationDuration to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ) to bring down the termination+startup time.
- Components/cluster operators either have their own lease durations/leader election or they use the defaults in library go, we tweaked them to the following to handle at least 60 seconds of downtime without undergoing a leader election:
LeaseDuration=137s, Renew Deadline=107s, RetryPeriod=26s.
This gives us
1. clock skew tolerance == 30s
2. kube-apiserver downtime tolerance == 78s
3. worst non-graceful lease reacquisition == 163s
4. worst graceful lease reacquisition == 26s
Improving the startup time of etcd to avoid extended API downtime
In a single node setup, etcd downtime means that API goes down. It was very important to improve the recovery time of the etcd during a disruption given that the API server stops responding to the requests for about 2 minutes during etcd container disruptions. The Krkn container disruption scenario was leveraged for this scenario. There were improvements made both on the API as well as etcd side to handle this.
Chaos testing guide
Here is a guide which covers best practices for the best performance, reliability and user experience with the platform. This link also includes suggestions around test methodology: https://redhat-chaos.github.io/krkn/.
- Test Strategies and Methodology
- Best Practices
- Test Environment Recommendations - how and where to run chaos tests
- Chaos testing in Practice within the OpenShift Organization
We would love to hear your thoughts and stories from your experience with running resilient OpenShift/Kubernetes clusters at scale. Feel free to reach us out on Github: https://github.com/redhat-chaos/krkn. Of course, any feedback and contributions are greatly appreciated. Stay tuned for more stories and findings!
How-tos, Prometheus, Chaos Engineering, etcd, single node