Recently a cluster I setup in an Air-Gapped environment underwent a PT test and passed with flying colors! I thought I would take the opportunity to share with others the main items I focused on when setting up the cluster as well as day 2 operating procedures that we implemented to make our cluster rock solid.
Note that all examples given in this article do not reflect in any way the actual customer systems.
Networking — Let's Start at the BottomFirst line of defense is the network architecture. The attacker has to get to your system from somewhere. The 2 main routes are networking or using a user in the domain (which I’ll cover in another section). The key principle is network isolation. You should have at the least the following setup:
- Control Plane VLAN
- Worker VLAN
- Load Balancer VLAN
- Infra Node VLAN (2 to 3 workers nodes designated with a separate role “infra” that host the OCP Routers)
By separating these functionalities at a network level you allow for the following:
- Control of who can talk to who on what ports.
- Ability to track communication between the different VLANs.
- Ability to lock down parts of the system if needed.
- Isolation of all incoming traffic to applications running on the cluster to the Infra node VLAN.
Based on the size and setup of your cluster, you may want to separate Worker groups into separate VLANs as well.
Project Creation — NetworkPolicy
Now that we have the external network organized, we need to take care of the internal network. When creating projects in the cluster there are many considerations such as default limit-ranges, resourcequotas, clusterresourcequotas, etc. In this article I am not going to discuss resource management, but rather focus on the defense of the system from penetration, not proper resource distribution. In most implementations of OpenShift you have multiple groups working on the cluster with their own projects. It is important to isolate the groups so they only see the projects they work on and also are protected from other applications running on the cluster.
My recommendation when creating projects in the cluster is to do the following:
- Set the default node-selector at the project level to only allow scheduling on worker nodes.
- Create 2 Network Policies that will isolate the project: Deny all traffic from other namespaces and allow traffic only from the OCP router.
The above implementation will lock down each project from external access by default and isolate them to running on worker nodes in the worker VLAN. You do want to take into consideration that anyone who has admin on these projects can change the above mentioned settings. What you can do to maintain the state is to introduce a powerful admission controller called https://kyverno.io/ which allows you to enforce different policies in the cluster, such as block deletion or editting of particular objects.
RBAC — No short-cuts
In my opinion, the recommended way to manage user permissions in an enterprise OCP cluster is through groups. Usually these groups in the OCP cluster are synched to an IDP inside the organization. In order that the users in the groups should only see what is relevant to them when connecting to the cluster as well as controlling their permissions, the permissions given on projects should be based as follows:
- ProjectAdminGroup — a group of users who will have admin on the given projects.
- ProjectDevGroup — a group of users who will have edit permissions on the projects.
I love this feature. This is the cornerstone of protecting the system in my opinion. A lot of cluster administrators get tired of fighting with their dev teams about locking down permissions to access different features as adapting to the methodology can take some time, but its worth it! The main approach here should be as follows:
- No container runs privileged!
- No container runs as root!
- No container can access any files on the hosts!
- Rare cases where permissions are needed to be given should be done only for those specific permissions (requirement to run as a particular UID, mounting an external NFS mount, etc.)
- Dont be affraid to push back and tell the DevOps guys “No!”
Bastion Host — Lock it up
In most clusters that I have worked with their is always a bastion linux host that is used to manage the system as well as access all of the OCP Nodes via SSH. As you can imagine, access to this host by a bad actor could be catastrophic. The following rules should be applied to the host:
- Access to the host should be limited to clusteradmins.
- If connected to IDP for login, then sudo on the host should be limited to the clusteradmins group.
- The SSH key used to connect to the OCP Nodes should be guarded tightly! It should sit under /root/.ssh/ and if possible on a secure vault solution.
Finally, it is important that you implement one of the many solutions for monitoring and auditing the cluster, such as Redhat Advanced Cluster Security (aka stackrox). Whichever solution you choose should implement the following:
- Monitoring of network activity within the cluster with the ability to apply strict networkpolicies if needed.
- Auditing/tracking of user activity.
- Alerting to a NOC when sensitive areas of the cluster are accessed.
- Scanning Container Images running on the cluster (to this point, I would also recommend implementing the scanning process on the external registry where containers are pulled from (OpenScap on Quay.io, XRAY on JFROG Artifactory, etc.).