This blog updates some of the content I previously wrote about on Security Context Constraints and Linux Capabilities in OpenShift given the latest pod security standards implementation in Kubernetes and important SCC improvements from OpenShift 4.11 and beyond. I will also use this opportunity to share some of the most important links and docs in a single place for those interested in the subject.
What is Pod Security Admission?
Pod Security Admission is the process of validating pod creation at the API level based on security options requested in the pod definition. Those security options reside mainly on the Security Context field available at the pod and the container levels in the pod's manifest. Other fields related to host resources, such as PIDs, file system paths, and networking, may also be validated.
With the deprecation of the Pod Security Policies and the creation of Pod Security Standards Kubernetes version 1.25 adds another special controller called PodSecurity controller. The new admission controller validates requested workloads against the Pod Security Standards.
The graph below shows the Workload Admission Flow in OpenShift and Kubernetes. Pod security is in the validating phase of the admission process. It's part of a new controller compiled within the kube-apiserver binary, and therefore it's a Kubernetes built-in feature.
How does Pod Security Admission work?
Pod Security Admission separates policies into three levels: privileged, baseline, and restricted. Those levels are activated per namespace by applying a label on the respective namespace. The privileged level is completely unrestricted, baseline has some restriction on certain privileged escalations, and restricted is the most restricted one.
You can configure the three levels in three different modes:
- enforce: Rejects creation on noncompliant requests.
- audit: Creates an entry in the audit log but allows creation.
- warn: Sends a warning message to the user but allows creation.
You can enable the levels simultaneously in the same namespace. Check the Pod Security Admission docs for more information.
In OpenShift, the privileged level is enforced globally. Another controller will synchronize the labels to match the highest privileged profile among the Service Accounts present in that namespace. For example, if a service account has access to use the privileged SCC, the labels will be updated with warn and audit on the privileged level to prevent unnecessary warnings or audit entries. Exceptions to this rule are system namespaces such as default, any starting with kube, or any starting with openshift that are part of the cluster installation. Those namespaces have label synchronization permanently disabled. You can find more details on label synchronization here.
The following section describes what happens when you bind a service account with a privileged SCC.
You can see the default rules applied by creating a new empty namespace and then checking its labels. It has the audit and warn configuration set to version 1.24 and restricted mode.
oc describe ns test
Annotations: openshift.io/sa.scc.mcs: s0:c26,c10
No resource quota.
No LimitRange resource.
Creating a test service account:
oc create sa test-sa
Binding it to the default privileged SCC:
oc apply -f - <<EOF
- kind: ServiceAccount
Check the namespace again and observe that the label values changed to privileged. By default, it is configured to synchronize to the highest privileged SCC in use by a service account on that particular namespace. Any non-privileged pod creation won't generate any warnings and won't interfere with SCC policies. This is a good example of Pod Admission on that namespace preventing unnecessary warnings and audit messages.
oc describe ns test
Annotations: openshift.io/sa.scc.mcs: s0:c26,c10
No resource quota.
No LimitRange resource.
How does that compare with Security Context Constraints?
Security Context Constraints are evaluated by a different controller under the OpenShift API and are part of the overall Pod Admission process in OpenShift, giving much more granular control over pod security contexts. You can customize it, and it could be considered a different layer of defense. While Pod Security Standards are applied to all pods created in a given namespace with labels, SCCs provide customizable validation applied to specific pods through Role Based Access Control (RBAC). So, instead of a label applied to a namespace, you need a Role or ClusterRole that allows the use of a SCC and a RoleBinding or ClusterRoleBinding that ties that SCC to a service account.
Since in OpenShift, the default pod admission is privileged, enforced globally, and has the restricted setting only with warn and audit on all namespaces, SCCs will be required to further validate pod requests if no changes are made to pod admission settings. In addition, another controller will create new warn and audit labels for those namespaces where service accounts with SCC-granted privileges exist.
I explored the definitions in previous blog posts and explained how SSCs work with pods, containers, and process privileges inside a worker node.
For reference, check the links below:
What is new in Security Context Constraints Version 2
The OpenShift 4.11 release introduced new or version 2 of some SCCs. Those bring some important improvements in the security domain. They are hostnetwork-v2, nonroot-v2, and restricted-v2. I'll explore what is different in those SCCs and discuss the impact on OpenShift workloads.
|Version 1||Version 2|
|allowPrivilegeEscalation: true||allowPrivilegeEscalation: false|
The descriptions explain what differs from the legacy version of restricted, hostnetwork, and nonroot SCCs.
kubernetes.io/description: restricted-v2 denies access to all host features and
requires pods to be run with a UID, and SELinux context that are allocated to
the namespace. This is the most restrictive SCC and it is used by default for
authenticated users. On top of the legacy 'restricted' SCC, it also requires
to drop ALL capabilities and does not allow privilege escalation binaries. It
will also default the seccomp profile to runtime/default if unset, otherwise
this seccomp profile is required.
kubernetes.io/description: hostnetwork allows using host networking and host ports
but still requires pods to be run with a UID and SELinux context that are allocated
to the namespace. On top of the legacy 'hostnetwork' SCC, it also requires to
drop ALL capabilities and does not allow privilege escalation binaries. It will
also default the seccomp profile to runtime/default if unset, otherwise this
seccomp profile is required.
kubernetes.io/description: nonroot provides all features of the restricted SCC
but allows users to run with any non-root UID. The user must specify the UID
or it must be specified on the by the manifest of the container runtime. On
top of the legacy 'nonroot' SCC, it also requires to drop ALL capabilities and
does not allow privilege escalation binaries. It will also default the seccomp
profile to runtime/default if unset, otherwise this seccomp profile is required.
There are three areas or topics that were touched in the SCC manifest. They are privilege escalation, Linux capabilities, and Seccomp profiles. I'll examine each below.
A. Privilege Escalation
In this case, privilege escalation does not mean gaining privileges to perform a system attack. Privilege escalation is a normal Linux activity that can be exploited, though. The Linux security credential check verifies the privilege escalation bits annotated on the extended attributes of each binary. Those can be general, such as SUID or SGID, allowing users to set a process's user id or group id and, therefore, run as root. In addition, they can be related to capabilities that allow the process to gain specific privileges beyond those inherited from its parent process.
What does that mean for containers? Containers are processes like any other. They can call binaries available in their file systems and create child processes within their namespace context. If any of those binaries are marked with SUID or SGID bits or have file capabilities embedded on them, they may request privilege escalation to perform certain actions. With privilege escalation set to true on both the SCC applied to a container and its security context, they will be granted that privilege. Even if all capabilities are dropped, and the SCC is restricted and doesn't allow the container to run as root, a child process may be created by the container with elevated privileges if its binary has those magic bits. The first line of defense against a security weakness caused by those types of files is using a heavily restricted and smart image scanning process to prevent the files from getting there in the first place.
With privilege escalation set to false, no child process can elevate its privileges by this method even if the file is available to the container, greatly improving security. And if a container tries to run a process with elevated privileges, it will be denied and may be logged for further investigation.
To understand how Linux capabilities in the container world work and how file capabilities could be used to elevate privileges with privilege escalation set to true, read the blog post Linux Capabilities in OpenShift. Some examples would only work with restricted SCC version 1 for having the privilege escalation set to true.
To demonstrate this, I can rerun one of the examples from the Linux capabilities articles with file capabilities.
Suppose you have a container image with the nc command changed to have CAP_NET_BIND capability in the binary itself like below:
RUN setcap 'cap_net_bind_service+ep'/usr/bin/nc
CMD ["/usr/bin/nc", "-lvu", "443"]
Next, create an unprivileged pod exactly as in the previous example:
oc apply -f - <<EOF
- name: nonroot-capset
Notice the pod is in the error state.
oc get pods
NAME READY STATUS RESTARTS AGE
nonroot-capset-6d4b856b6-chdk7 0/1 Error 0 10s
privileged-test-848db66c59-x7k6c 1/1 Running 0 77m
Check the logs:
oc logs nonroot-capset-6d4b856b6-chdk7
exec /usr/bin/nc: operation not permitted
This is due to privilege: false in restricted-v2 SCC, which is attributed to this pod. With that operation, when the nc command runs, it "requests" a privilege escalation to the system, which is not allowed anymore. The result is one more layer blocking that feature. To see the example with the legacy restricted SCC, check the previous blog post on capabilities here.
oc describe pods nonroot-capset-6d4b856b6-chdk7 | grep scc
B. Linux Capabilities
Capabilities are sets of permissions that come from the privileged user. It was conceived to split superuser permissions into smaller chunks and allow non-privileged processes to execute partially privileged tasks. Some are actually quite privileged but still don't have all the capabilities of root. Those SCCs that were supposed to be used with the non-root user were dropping only the capabilities considered more dangerous, such as KILL, MKNOD, SETUID, and SETGID. With v2, they drop ALL and allow only for CAP_NET_BIND_SERVICE, which enables containers to better use their port ranges in their network namespace if one tries to bind to a low range (below 1024) port number. This only covers process capabilities requested per pod or container, not file capabilities like above. You can check the differences in the blog post mentioned above.
C. Seccomp profiles
Seccomp, or Secure Computing, sometimes called secure computing mode, is a kernel feature capable of filtering system calls. Instead of blocking a container for privileges it requested or blocking a child process from elevating its privileges, it will filter system calls performed by an application from user space to the kernel. It uses BPF to catch system calls on the fly, reducing the kernel surface exposed to user applications. It's a completely different perspective since it's not easy to pinpoint what privileges or capabilities grant access to a specific system call. Those are not one-to-one relationships since some capabilities may give access to multiple system calls simultaneously. Another point is that system calls are not tied to a file or a process as a credential that can be validated but can be performed at any point in the lifetime of a process.
Various concerns exist, including some around the unshare system call. It's so powerful that by using the command that executes it, I can start building a container by hand without podman or docker. Here is an old tutorial I wrote several years ago on doing that. So imagine if a container runs an application that can create some child processes that can escape the container. That system call should be blocked by default, especially after some vulnerabilities were discovered (see below).
Since version 1.19, Kubernetes implemented seccomp profiles, applying that feature to containers. Up to OpenShift 4.10, containers were unconfined in terms of systems calls, i.e., running without any Seccomp profile. From OpenShift 4.11, a default profile allows many system calls to be performed. V2 SCCs will grant containers access to what is listed in the runtime/default profile. The filtering rules and system call names are arranged in a JSON file that must be available in the node where the containers run. Custom profiles now can and should be configured with the minimum set of system calls an application needs and, therefore, provide the maximum level of security. Each user is now able to fine-tune access to system calls.
Security concerns addressed
Even having seccomp profiles in place, CRI-O, the OpenShift container engine, has removed access to the unshare syscall for unprivileged processes. I mentioned a dangerous example above. This was good work done by the CRI-O team. You can check that out here and here. So unshare gets filtered from unprivileged processes by mutating the rules. That's a great security improvement in OpenShift.
Read this article published by Acquasec to understand the issue better.
Clearly, there is more to security than Pod Admission and SCCs. Still, those two are the first layer of complexity that users and partners will face trying to run their applications and operators on OpenShift. This brief article shared an overview of the latest developments on those two topics, pointing out some useful links for the ones experimenting with those features for the first time. I hope this information helps you succeed and secure your applications in OpenShift. Thanks for reading it.