Since the inception of OpenShift 4, installer-provisioned-infrastructure clusters have utilised the Machine API to manage worker and infrastructure machines via MachineSet resources.
However, while OpenShift 4 clusters have had control-plane machines, these have been standalone and unmanaged.

Users have not had the ability to scale or modify their control-plane machines without manual work and checks to ensure the health of the control-plane.

As adoption of OpenShift grows within an organisation, and the organisation’s clusters have grown in size, the additional pressure put on the control-plane by additional worker machines and the workloads running on those machines, causes users to look to vertically scale their control-plane instances to cope with the additional load.

Since scaling the control-plane requires management of the OpenShift cluster’s etcd cluster, the process of scaling a control-plane manually is very involved and requires specific knowledge and careful execution of steps in the correct order.

As managed services have grown, this process has not evolved and has become a significant bottleneck for the SREs managing these services. An automated solution is required.

Introducing the ControlPlaneMachineSet

The ControlPlaneMachineSet is a new resource within the OpenShift Machine API ecosystem, introduced in 4.12. It manages the cluster’s control-plane machines and adds new automation on top of the existing Machine API concepts.

The OpenShift team often refer to Machine and MachineSet resources as being analogous to Pod and ReplicaSet resources. The MachineSet is to a Machine as the ReplicaSet is to a Pod. If we extend this analogy, a ControlPlaneMachineSet is similar to a StatefulSet.

Rather than managing an arbitrary number of identical Machine resources, like a MachineSet would, the ControlPlaneMachineSet manages a small number of identical Machines and adds special logic on top of the Machines to provide functionality such as rolling-update replacement of the Machines as well as spreading the Machines across multiple failure domains.

ControlPlaneMachineSet installed and active within a cluster, user’s can now modify parameters of their control-plane specification and observe as the ControlPlaneMachineSet automatically, and safely, replaces the control-plane machines with new machines with the updated spec.

What can I use a ControlPlaneMachineSet for?

The ControlPlaneMachineSet can be used to perform rolling update replacements of control-plane Machines within OpenShift. For example, if you need to increase the underlying instance type of the control-plane Machines, by editing the provider specification on the ControlPlaneMachineSet spec, you can trigger a complete rolling replacement of the control-plane Machines within the cluster, allowing you to make automated changes to the infrastructure within the control-plane in a safe and controlled manner.

How does the ControlPlaneMachineSet work?

The ControlPlaneMachineSet constantly monitors the control-plane Machines within the cluster.

It compares the desired specification (from within the resource spec) to the existing configuration of the control-plane Machines.

When it detects that there is a difference, it will iterate through the control-plane Machines and, 1 by 1, replace those with an up-to-date Machine, this is an example of the immutable infrastructure concept. This means that, it creates a new Machine, waits for that Machine to join the cluster, and then marks the old Machine for deletion.

Once the old Machine is removed (ie there should be no more than one additional Machine in the cluster), it will move onto the next control-plane Machine
and repeat the process until all of the Machines have been updated.

If at any point, a control-plane Machine is manually marked for deletion, the ControlPlaneMachineSet will attempt to maintain the cluster by creating
a replacement for that Machine.

What happens to etcd when scaling my control plane?

Starting in OpenShift 4.11, the etcd operator leverages machine lifecycle hooks to implement a quorum protection mechanism when the Machine API is configured within the cluster.

The lifecycle hooks allow the etcd operator to control when the Machine API drains and removes pods on a control-plane machine. Using this hook, the etcd operator prevents removal of an etcd member until it has had an opportunity to migrate that member
onto a new node within the cluster.

While performing a rolling update, the cluster will, for a short period, have 4 control-plane machines. When the 4th control-plane node joins the cluster, the etcd operator starts a new etcd member on the new node. Once it observes that the old control-plane machine has been marked for deletion, it stops the etcd member on the old node and promotes the new etcd member to join the quorum of the cluster.

This mechanism allows the etcd operator precise control over the members within the quorum and allows the Machine API to safely create and remove control-plane machines without specific operational knowledge of the etcd cluster.

When is the ControlPlaneMachineSet available?

Now!

The ControlPlaneMachineSet will be configured and active on all freshly installed OpenShift 4.12 (and onwards) clusters, for the AWS platform.

Support for Azure and GCP is being targeted for a future release.

For clusters upgrading to 4.12 on AWS and Azure, an inactive ControlPlaneMachineSet will be created and maintained on the cluster by the operator. The inactive ControlPlaneMachineSet can then be activated by the user should they wish to enable the functionality of the ControlPlaneMachineSet on their cluster.
This functionality will be available for GCP in a future release.

Where can I learn more about the ControlPlaneMachineSet?

The ControlPlaneMachineSet operator project contains some documentation aimed at users of the project.

If you are interested in the background of the design of the new project, the original design proposal, on which this project was implemented, is available to read on GitHub.
This design proposal includes detailed descriptions of the various features of the ControlPlaneMachineSet as well as the motvation
for decisions taken during the design process.