Goals

  • Understand the concepts of Taint & Toleration
  • How to use in Taint & Toleration OpenShift 4
  • Understand Toleration in Daemonset

References :

 

Taints are a key feature of controlling scheduling based on node capabilities, there are some confusion related to node labels and taints. Are they the same?.

What is Taint ?

Taint is a kind of labeling that allows a node to repel a set of pods that don’t have the capability to run on them. You apply taints to a node through the node specification (NodeSpec).

What is Toleration ?

Toleration is simply a way to overcome a taint. You apply toleration to a pod through the pod specification (PodSpec)

Taints and toleration work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.

Taint a Node

Taint a node is similar to labeling a node.

oc taint nodes node1 key=value:NoSchedule (1)
  1. places a taint on node node1. The taint has key 'key', value 'value', and taint effect 'NoSchedule'.

Table 1. Taint Effect

Effect

Description

NoSchedule

  • New pods that do not match the taint are not scheduled onto that node.
  • Existing pods on the node remain.

PreferNoSchedule

  • New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.
  • Existing pods on the node remain.

NoExecute

  • New pods that do not match the taint cannot be scheduled on that node.
  • Existing pods on the node that do not have a matching toleration are removed.

Tolerate a Pod

As Tolerations are set at pod level, it will also work with any higher level objects like deployments or even projects.

I will give couple of example of defining toleration

Deployment toleration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 10
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
      tolerations:
      - effect: NoSchedule
        operator: Exists

Project toleration.

applying a tolleration on a namespace level is an interesting point as it allows all (excluding daemonsets) objects created in this namespace to inherit the namespace toleration.

kind: Project
apiVersion: "project.openshift.io/v1"
metadata:
  annotations:
    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: admin
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"Key": "dedicated-node", "Operator":"Equal", "Value": "infra", "effect": "NoSchedule"}]' (1)
    scheduler.alpha.kubernetes.io/tolerationsWhitelist: '[{"operator": "Exists", "effect": "NoSchedule", "key": "dedicated-node"}]' (2)
  name: toleration-prj
spec:
  finalizers:
  - kubernetes
  1. PodTolerationRestriction admission controller will merge the tolerations annotated on the namespace into the tolerations of the pod.
  2. PodTolerationRestriction admission controller will verify the resulting tolerations are checked against the namespace’s whitelist of tolerations.

Remember:

  • Tolerations to a namespace are assigned via annotation keys.
scheduler.alpha.kubernetes.io/defaultTolerations
scheduler.alpha.kubernetes.io/tolerationsWhitelist

Daemon Set toleration

Due to way of scheduling of DaemonSet (DS) (for more info see DaemonSet: Scheduled by default scheduler and How Daemon Pods are Scheduled), so if DaemonSet does not have explicit toleration to node taint it will simply fail to be scheduled.

In the previous section I discussed project toleration, Even though DaemonSet pods go through the admission chain, node assignment is already done by DaemonSet controller before the pods go through the admission chain. So it requires assigning toleration explicitly to the DS spec template. At the time of writing there is a RFE open, for reconciliation between PodTolerationRestriction and DS controller

How Pod Toleration match Taint

  • A toleration “matches” a taint if the keys are the same and the effects are the same
  • if the operator is Equal and the values are equal
  • if the operator is Exists → then value should not be specified
  • if the operator is not specified → it defaults to Equal.

Assume we apply following tainit on node “node1”

oc taint nodes node1 node-type=special:NoSchedule

Table 2. Example of Pod Toleration match Taint

tainttable

Note

There are two special cases:

  • An empty key with operator Exists matches all keys, values and effects which means this will tolerate everything.
  • A specified key with operator Exists, and An empty effect matches all effects with this key.

Debug a Node with taint

OpenShift 4 is based on RHCOS and it is encouraged to not ssh into the hosts. Instead:

oc debug node/<node>

When trying to run oc debug node over a tainted node, the debug pod got terminated.

to get around this create a project named “debug-project” and apply namespace defaultToleration as explained before:

oc patch namespace debug-project --type=merge -p '{"metadata": {"annotations": { "scheduler.alpha.kubernetes.io/defaultTolerations": "[{\"operator\": \"Exists\"}]"}}}'

 

Dedicating resources using Taints and Toleration

Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that shouldn’t be running.

  • Dedicated Nodes: Dedicate a set of nodes for exclusive use by a particular set of users, you can add a taint to those nodes and tolerate the dedicated pods.
  • Nodes with Special Hardware: In a cluster where a small subset of nodes have specialized hardware (for example GPUs), it is desirable to keep pods that don’t need the specialized hardware off of those nodes, thus leaving room for later-arriving pods that do need the specialized hardware. This can be done by tainting the nodes that have the specialized hardware and adding a corresponding toleration to pods that use the special hardware.
  • A common use case in OCP 4 is Infra-node dedication Moving resources to infrastructure MachineSets. You can refer to documentation Using tolerations to control cluster logging pod placement
  • Also some of the openshift components that should run on masters are having toleration for node-role.kubernetes.io/master:NoSchedule.