Since the inception of OpenShift 4, installer-provisioned-infrastructure clusters have utilised the Machine API to manage worker and infrastructure machines via
However, while OpenShift 4 clusters have had control-plane machines, these have been standalone and unmanaged.
Users have not had the ability to scale or modify their control-plane machines without manual work and checks to ensure the health of the control-plane.
As adoption of OpenShift grows within an organisation, and the organisation’s clusters have grown in size, the additional pressure put on the control-plane by additional worker machines and the workloads running on those machines, causes users to look to vertically scale their control-plane instances to cope with the additional load.
Since scaling the control-plane requires management of the OpenShift cluster’s etcd cluster, the process of scaling a control-plane manually is very involved and requires specific knowledge and careful execution of steps in the correct order.
As managed services have grown, this process has not evolved and has become a significant bottleneck for the SREs managing these services. An automated solution is required.
ControlPlaneMachineSet is a new resource within the OpenShift Machine API ecosystem, introduced in 4.12. It manages the cluster’s control-plane machines and adds new automation on top of the existing Machine API concepts.
The OpenShift team often refer to
MachineSet resources as being analogous to
ReplicaSet resources. The
MachineSet is to a
Machine as the
ReplicaSet is to a
Pod. If we extend this analogy, a
ControlPlaneMachineSet is similar to a
Rather than managing an arbitrary number of identical
Machine resources, like a
MachineSet would, the
ControlPlaneMachineSet manages a small number of identical
Machines and adds special logic on top of the
Machines to provide functionality such as rolling-update replacement of the
Machines as well as spreading the
Machines across multiple failure domains.
ControlPlaneMachineSet installed and active within a cluster, user’s can now modify parameters of their control-plane specification and observe as the
ControlPlaneMachineSet automatically, and safely, replaces the control-plane machines with new machines with the updated spec.
ControlPlaneMachineSet can be used to perform rolling update replacements of control-plane
Machines within OpenShift. For example, if you need to increase the underlying instance type of the control-plane
Machines, by editing the provider specification on the
ControlPlaneMachineSet spec, you can trigger a complete rolling replacement of the control-plane
Machines within the cluster, allowing you to make automated changes to the infrastructure within the control-plane in a safe and controlled manner.
ControlPlaneMachineSet constantly monitors the control-plane
Machines within the cluster.
It compares the desired specification (from within the resource spec) to the existing configuration of the control-plane
When it detects that there is a difference, it will iterate through the control-plane
Machines and, 1 by 1, replace those with an up-to-date
Machine, this is an example of the immutable infrastructure concept. This means that, it creates a new
Machine, waits for that
Machine to join the cluster, and then marks the old
Machine for deletion.
Once the old
Machine is removed (ie there should be no more than one additional
Machine in the cluster), it will move onto the next control-plane
and repeat the process until all of the
Machines have been updated.
If at any point, a control-plane
Machine is manually marked for deletion, the
ControlPlaneMachineSet will attempt to maintain the cluster by creating
a replacement for that
Starting in OpenShift 4.11, the etcd operator leverages machine lifecycle hooks to implement a quorum protection mechanism when the Machine API is configured within the cluster.
The lifecycle hooks allow the etcd operator to control when the Machine API drains and removes pods on a control-plane machine. Using this hook, the etcd operator prevents removal of an etcd member until it has had an opportunity to migrate that member
onto a new node within the cluster.
While performing a rolling update, the cluster will, for a short period, have 4 control-plane machines. When the 4th control-plane node joins the cluster, the etcd operator starts a new etcd member on the new node. Once it observes that the old control-plane machine has been marked for deletion, it stops the etcd member on the old node and promotes the new etcd member to join the quorum of the cluster.
This mechanism allows the etcd operator precise control over the members within the quorum and allows the Machine API to safely create and remove control-plane machines without specific operational knowledge of the etcd cluster.
ControlPlaneMachineSet will be configured and active on all freshly installed OpenShift 4.12 (and onwards) clusters, for the AWS platform.
Support for Azure and GCP is being targeted for a future release.
For clusters upgrading to 4.12 on AWS and Azure, an inactive
ControlPlaneMachineSet will be created and maintained on the cluster by the operator. The inactive
ControlPlaneMachineSet can then be activated by the user should they wish to enable the functionality of the
ControlPlaneMachineSet on their cluster.
This functionality will be available for GCP in a future release.
ControlPlaneMachineSet operator project contains some documentation aimed at users of the project.
If you are interested in the background of the design of the new project, the original design proposal, on which this project was implemented, is available to read on GitHub.
This design proposal includes detailed descriptions of the various features of the
ControlPlaneMachineSet as well as the motvation
for decisions taken during the design process.