Over the course of this year, I have been working with several development teams that started building applications on OpenShift. My goal was to provide the developers with guidance and best practices that would help them to successfully deploy their applications to production. If you are a developer that builds applications on top of OpenShift, this blog might be of interest to you.

This blog includes two categories of best practices. The first category lists practices that increase application reliability, the second category includes practices that improve security. Note that there is some overlap between the two categories. You will find reliability practices that, to some extent, improve security and vice versa.

Update 09/13/2023: The links in this blog were updated to point to the latest version of OpenShift documentation.

Application Reliability

The following 9 best practices increase application availability, uptime, and overall improve the application user experience.

Recommendation Detail
1. Keep application configuration outside of the image. Container images that include environment-specific configuration cannot be promoted across environments (Dev, QA, Prod). To achieve a reliable release process, the same image that was tested in the lower environments should be deployed into production.

Keep the environment-specific configuration outside of the container image. For example, use ConfigMaps and Secrets to store the application configuration.
2. Specify the resource requests and resource limits in the pod definitions. Applications can run out of memory or incur CPU starvation due to improper configuration of requested resources. Specifying the requested memory and CPU resources allows the cluster to make proper scheduling decisions to ensure that the application will have the requested resources available.
3. Always define liveness and readiness probes in the pod definitions. Health check probes allow the cluster to provide basic resiliency to your application. It allows the cluster to restart your application (liveness probe failed), or avoid routing traffic to your application if it's not ready to serve requests (readiness probe). See also Monitoring application health in the OCP documentation.
4. Protect the application with pod disruption budgets. There are situations where the application pods need to be evicted from the cluster node. For example, the eviction is needed before the administrator can perform maintenance of the node or before the cluster autoscaler can remove the node from the cluster while downscaling. To ensure that your application remains available when pods need to be evicted, you must define the respective PodDistruptionBudget objects.
5. Ensure that application pods terminate gracefully. On termination, an application pod should complete all in-flight requests and terminate existing connections gracefully. This allows for restarting the pod without end-users noticing, for example when a new version of the application is deployed.
6. Run one process per container. Avoid running multiple processes in a single container. Running each process in a separate container allows for better isolation of processes, avoids issues with signal routing, and avoids the need for reaping the zombie processes. See also Avoid multiple processes in the OCP documentation.
7. Implement application monitoring and alerting. Application monitoring and alerting are essential for keeping the application operating well in production and serving the business purpose. Use monitoring tools like Prometheus & Grafana to monitor your application.
8. Configure the applications to write their logs to stdout/stderr. OpenShift will collect those logs and send them to a centralized location (ELK, Splunk). Application logs are an invaluable resource when analyzing production issues. Alerting based on the content of the application logs helps ensure that the application is performing as expected.
9. Consider implementing the following resiliency measures:
* Circuit breakers
* Timeouts
* Retries
* Rate limiting
The listed resiliency measures make your application perform better in the case of failures. They protect your application from getting overloaded (rate limiting, circuit breakers), and improve the performance when facing connectivity issues (timeouts, retries). Consider leveraging OpenShift Service Mesh which implements these measures without the need for code changes in your application.

Application Security

This section includes 5 best practices that will improve the security of your application. I strongly recommend that you consider implementing all of these practices in your environment.

Recommendation Detail
10. Use trusted base container images. Use vendor-provided container images where possible. Vendor images are tested, hardened, and supported. If using community-supported images, use only the images provided by the communities that you trust. There are images of unknown origin available in public registries like Docker Hub. Do not use them!
11. Use the latest version of base container images. Only the latest versions of container images include all the available security fixes. Set up your CI pipeline to always pull the latest version of base images when building the application image. Also, set up your CI pipeline to rebuild the application when updated base images become available.
12. Use a separate build image and runtime image. Creating a separate runtime image with minimum dependencies reduces the attack surface and produces a smaller runtime image. The build image contains build dependencies that are required for building the application but are not required for running the application.
13. Stick to the restricted security context constraint where possible. Modify your container images to allow running under the restricted SCC. See also Support arbitrary user ids in the OCP documentation.

Applications are vulnerable to breach where the attacker can take control of the application. Enforcing the use of the OpenShift restricted SCC provides the highest level of security that protects the cluster node from being compromised in the case that the application was breached.
14. Protect the communication between application components using TLS. Application components may communicate sensitive data that should be protected. Unless you consider the underlying OpenShift network to be secure, you may want to leverage TLS to protect the traffic between the application components. Consider leveraging OpenShift Service Mesh to offload the TLS management from the application.

Conclusion

In this blog, we reviewed 14 best practices that can help you build more reliable and secure applications on OpenShift. Developers can use this list to derive their own list of mandatory practices that must be followed by all the team members.

The list of practices presented in this blog is a good start. If you are interested to learn more, you can find another set of great recommendations in the section Creating images of the OpenShift documentation.