Background

Open Cluster Management (OCM) is a community-driven project that is focused on multicluster and multicloud scenarios for Kubernetes applications.

In a multicluster environment, users such as administrators usually need to do some configuration on target clusters. In other situations, application developers may want to deploy the workload to specific clusters. The workload might be a Kubernetes Service, Deployment, ConfigMap or a bundle of different Kubernetes objects. The users would have some requirement on the target clusters, which might include the following examples:

  • I want to only configure the clusters on Amazon Web Services (AWS).
  • I want to only deploy this workload to clusters that have the label group=dev.
  • I want the workload always running on the 3 clusters with the most allocatable memory.

To select the target clusters, you can choose to hardcode the target cluster names in the deploy pipelines, or use some form of label selectors. For workloads that have requirements on resources, you need a fine-grained scheduler to dispatch workload to clusters with sufficient resources. The schedule decision should always dynamically update when the cluster attributes change.

In OCM, the previously described scheduling features are achieved by component placement. In this blog, I will explain how placement selects desired clusters, what scheduling capabilities the placement can provide now, and some best practices that you can use to write a placement to suit your requirement. Some advanced scheduling features like supporting taints and tolerations, as well as topological selection (spread) are active discussions in the OCM community.

The placement features are also delivered as new Technology preview features in Red Hat Advanced Cluster Management version 2.4.

Note: The following links provide information about concepts to understand before continuing with this blog:

What is placement?

The Placement API is used to select a set of managed clusters in one or multiple ManagedClusterSets so that the workloads can be deployed to these clusters.

If you define a valid Placement, the placement controller generates a corresponding PlacementDecision with the selected clusters listed in the status. As an end user, you can parse the selected clusters and then operate on the target clusters. You can also integrate a high-level workload orchestrator with the placement decision to leverage its scheduling capabilities.

For example, Argo has an integration with Placement. You can specify a ConfigMap, which is associated with PlacementDecision in the clusterDecisionResource of ApplicationSet, so Argo can use the scheduling decision of the Placement to automatically assign the application to a set of target clusters. For example:

yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: book-import
spec:
generators:
- clusterDecisionResource:
configMapRef: ocm-placement
labelSelector:
matchLabels:
cluster.open-cluster-management.io/placement: local-cluster
requeueAfterSeconds: 30
template:
apiVersion: v1
kind: ConfigMap
metadata:
name: ocm-placement
data:
apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: placementdecisions
statusListKey: decisions
matchKey: clusterName
apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: PlacementDecision
metadata:
labels:
cluster.open-cluster-management.io/placement: local-cluster
name: local-cluster-decision-1
status:
decisions:
- clusterName: cluster1
reason: ""
- clusterName: cluster2
reason: ""

KubeVela, as an implementation of the open application model, can also use the Placement API for workload scheduling.

How does placement select clusters?

Let’s take a deeper look at the Placement API to see how it selects the desired cluster and what scheduling ability it can provide.

The following content is an example of a Placement:

apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: Placement
metadata:
name: placement
namespace: ns1
spec:
numberOfClusters: 4
clusterSets:
- clusterset1
- clusterset2
predicates:
- requiredClusterSelector:
labelSelector:
matchLabels:
vendor: OpenShift
prioritizerPolicy:
mode: Exact
configurations:
- scoreCoordinate:
builtIn: ResourceAllocatableMemor
- scoreCoordinate:
builtIn: Steady
weight: 3
- scoreCoordinate:
type: AddOn
addOn:
resourceName: default
scoreName: cpuratio

The spec contains the following optional four sections:

  • numberOfClusters represents the desired number of ManagedClusters to be selected, which meet the placement requirements.
  • clusterSets represents the ManagedClusterSets from which the ManagedClusters are selected.
  • predicates represents a slice of predicates to select ManagedClusters with label and claim selector. The predicates are ORed.
  • prioritizerPolicy represents the policy of prioritizers. The mode value is used to set whether or not to use the default prioritizers. Specific prioritizers can be configured in configurations. Currently, the default built-in prioritizer includes Balance, Steady, ResourceAllocatableCPU, and ResourceAllocatableMemory. Placement also supports the selection of clusters based on scores provided by third parties defined in addOne. The weight is an integer from -10 to 10 that adjusts the effect of different prioritizer scores on the total score.

A default value is used if the values in a section are not defined. The details of the values in each field are defined in PlacementSpec If the spec is empty, all ManagedClusters from the ManagedClusterSets bound to the placement namespace are selected as possible choices.

If the spec is empty, all ManagedClusters from the ManagedClusterSets bound to the placement namespace are selected as possible choices.

The definition of each section plays a role in the scheduling. A typical scheduling process is shown in the following example:

  1. The scheduling framework identifies available ManagedClusters from the ManagedClusterSets that are defined in the clusterSets.

  2. The scheduling filter plugin selects ManagedClusters by using the label and claim selector that are defined in predicates.

  3. The scheduling prioritizer plugins that are enabled in prioritizerPolicy assign a score to each filtered ManagedCluster and prioritize them by the total score from high to low.

  4. Select the top k clusters and list them in PlacementDecision. The value of k is the number of clusters that are defined in numberOfClusters.

If applied to the previous example, the scheduling process would be similar to the following example:

  1. The scheduling framework identifies clusterset1 and clusterset2 as the available ManagedClusters.

  2. The scheduling filter plugin filters the clusters with label vendor=OpenShift.

  3. The scheduling prioritizer plugins named ResourceAllocatableMemory and Steady assign a score to each filtered ManagedCluster. When addOn is defined, the placement tries to get the cluster score cpuratio from third-party resources. The total score of a cluster is calculated by the following formula:

    1 (the default weight of the ResourceAllocatableMemory prioritizer because no weight is specified) * prioritizer_ResourceAllocatableMemory_score + 3 (the weight value specified for the Steady prioritizer) * prioritizer_Steady_score + 1 (the default weight of the addOn) * cpuratio (the addOn score cpuratio)

  4. The framework prioritizes them by the total score from high to low, and returns the four clusters with the highest scores.

In the score and prioritize step, it is actually a combination of multiple prioritizers. The algorithm of each prioritizer and the weight of each prioritizer impacts the final decision. In the next section, let’s take a deeper look at prioritizers, so that you can have a better understanding about how the placement selects the clusters.

How do placement prioritizers work?

At the time this blog was written, there were four available prioritizers:

  • ResourceAllocatableCPU and ResourceAllocatableMemory: Makes the scheduling decisions based on the resource allocatable CPU or memory of managed clusters. The clusters that have the most allocatable resources are given the highest score (100), while the clusters with the least allocatable resources are given the lowest score (-100).

    The prioritizer addOn also supports selecting clusters based on customized scores. You can enable this selection by providing a new API AddOnPlacementScore to support a more extensible way to schedule.
  • As a user, you can specify the score in the placement yaml content to select clusters.
  • As a score provider, a 3rd party controller can run on either the hub or a managed cluster to maintain the lifecycle ofAddOnPlacementScore and update the score into it.

See enhancements to learn more.

When making cluster decisions, managed clusters are sorted by the final score of each managed cluster, which is the sum of scores from all prioritizers with weights: final_score = sum(prioritizer_x_weight * prioritizer_x_score), while prioritizer_x_weight is the weight of prioritizer x, prioritizer_x_score is the score returned by prioritizer x for a managed cluster.

You can adjust the weights of prioritizers to impact the final score, for example:

  • Set the weight of resource prioritizers to schedule placement based on resource allocatable;
  • Make the placement sensitive to resource usage by setting a higher weight for resource prioritizers;
  • Ignore resource usage change and pin the placement decisions by increasing the weight of the steady prioritizer;

Here are some practical examples to illustrate how multiple prioritizers work together to make the final placement decision. These examples can also be treated as some best practices for the specific use cases.

Assumptions:

  • There are 3 managed clusters that are bound to the example namespace: ns1:
    • cluster1 has 60 MB of allocatable memory
    • cluster2 has 80 MB of allocatable memory
    • cluster3 has 100 MB of allocatable memory

Case 1: Selecting clusters with the largest allocatable memory

In this example, you want to select clusters with the largest allocatable memory. To prioritize clusters by allocatable memory, you can configure ResourceAllocatableMemory in prioritizerPolicy to enable it.

apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: Placement
metadata:
name: demo
namespace: ns1
spec:
numberOfClusters: 2
prioritizerPolicy:
configurations:
- scoreCoordinate:
builtIn: ResourceAllocatableMemory

When this placement is created, you can describe the Placement, check the events to understand how clusters are selected by the prioritizers.

# oc describe placement demo -n ns1
Name: demo
Namespace: ns1
Labels: <none>
Annotations: <none>
API Version: cluster.open-cluster-management.io/v1alpha1
Kind: Placement

Status:
Conditions:
Last Transition Time: 2021-11-09T07:02:14Z
Message: All cluster decisions scheduled
Reason: AllDecisionsScheduled
Status: True
Type: PlacementSatisfied
Number Of Selected Clusters: 2
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DecisionCreate 10s placementController Decision demo-decision-1 is created with placement demo in namespace ns1
Normal DecisionUpdate 10s placementController Decision demo-decision-1 is updated with placement demo in namespace ns1
Normal ScoreUpdate 10s placementController cluster1:0 cluster2:100 cluster3:200

In this example, the Balance and Steady prioritizers are enabled by default with weight value of 1 in Additive mode. ResourceAllocatableMemory is also enabled to make the final decision. The score of a cluster is determined by the following formula:

1 * prioritizer_balance_score + 1 * prioritizer_steady_score + 1 * prioritizer_resourceallocatablememory_score

From the event, the total score of cluster1 is 0, cluster2 is 100 and cluster3 is 200. In this case, cluster2 and cluster3 should be selected.

Describe the PlacementDecision to verify the guess.

# oc describe placementdecision demo-decision-1 -n ns1
Name: demo-decision-1
Namespace: ns1
Labels: cluster.open-cluster-management.io/placement=placement-jkd42
Annotations: <none>
API Version: cluster.open-cluster-management.io/v1alpha1
Kind: PlacementDecision
...
Status:
Decisions:
Cluster Name: cluster2
Reason:
Cluster Name: cluster3
Reason:
Events: <none>

In the PlacementDecision status, cluster2 and cluster3 are listed in the decisions.

Let's try to add a new cluster with allocatable memory a little higher than the selected clusters.

The placement controller watches the managed clusters. When there is a resource change, it starts a reschedule. Now, let's add a new cluster, cluster4 with the allocatable memory of 100 MB, and check the placement event.

# oc describe placement demo -n ns1
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DecisionCreate 100s placementController Decision demo-decision-1 is created with placement demo in namespace ns1
Normal DecisionUpdate 100s placementController Decision demo-decision-1 is updated with placement demo in namespace ns1
Normal ScoreUpdate 100s placementController cluster1:0 cluster2:100 cluster3:200

There's no event update and no placement decision updated. So when adding a new cluster with the allocatable memory of 100 MB, which is a little higher than the allocated 80 MB for cluster2, there's no impact on the placement decision.

Let's try to add a new cluster with allocatable memory much higher than the selected clusters.

Now let's try to add a new cluster cluster4 with allocatable memory 150 MB, and check the placement event again.

# oc describe placement demo -n ns1
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DecisionCreate 2m10s placementController Decision demo-decision-1 is created with placement demo in namespace ns1
Normal DecisionUpdate 2m10s placementController Decision demo-decision-1 is updated with placement demo in namespace ns1
Normal ScoreUpdate 2m10s placementController cluster1:0 cluster2:100 cluster3:200
Normal DecisionUpdate 3s placementController Decision demo-decision-1 is updated with placement demo in namespace ns1
Normal ScoreUpdate 3s placementController cluster1:200 cluster2:145 cluster3:189 cluster4:200

This time, the decision is updated with the change and the placement is rescheduled to cluster3 and cluster4.

# oc describe placementdecision demo-decision-1 -n ns1
...
Status:
Decisions:
Cluster Name: cluster3
Reason:
Cluster Name: cluster4
Reason:

In the previous example, when the resource changes a little, there's no update in PlacementDecision. And when resources change a lot, the changes are reflected in PlacementDecision immediately. This leads to 2 challenges:

  • How can I make my PlacementDecision sensitive to resource changes?
  • How do I make my PlacementDecision steady even if the cluster resource changes a lot?

Remember in prioritizerPolicy we have 4 prioritizers and can adjust the weight of them. Let's solve these two problems by changing the prioritizerPolicy.

Case 2: Selecting clusters with the largest allocatable memory and make placement sensitive to resource changes

To make decisions sensitive to resource changes, this time we explicitly set prioritizer ResourceAllocatableMemory with the weight value of 3.

apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: Placement
metadata:
name: placement7
namespace: ns1
spec:
numberOfClusters: 2
prioritizerPolicy:
configurations:
- scoreCoordinate:
builtIn: ResourceAllocatableMemory
weight: 3

When this placement is created, let's describe the Placement and check the PlacementDecision.

# oc describe placement demo -n ns1
...
Status:
Conditions:
Last Transition Time: 2021-11-09T08:58:40Z
Message: All cluster decisions scheduled
Reason: AllDecisionsScheduled
Status: True
Type: PlacementSatisfied
Number Of Selected Clusters: 2
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DecisionCreate 35s placementController Decision demo-decision-1 is created with placement demo in namespace ns1
Normal DecisionUpdate 35s placementController Decision demo-decision-1 is updated with placement demo in namespace ns1
Normal ScoreUpdate 35s placementController cluster1:-200 cluster2:100 cluster3:400
# oc describe placementdecision demo-decision-1 -n ns1
...
Status:
Decisions:
Cluster Name: cluster2
Reason:
Cluster Name: cluster3
Reason:

The initial placement decision is cluster2 and cluster3.

Now, let's add a new cluster cluster4 with allocatable memory of 100 MB again, and check the placement event.

# oc describe placement demo -n ns1
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DecisionCreate 3m1s placementController Decision demo-decision-1 is created with placement demo in namespace ns1
Normal DecisionUpdate 3m1s placementController Decision demo-decision-1 is updated with placement demo in namespace ns1
Normal ScoreUpdate 3m1s placementController cluster1:-200 cluster2:100 cluster3:400
Normal DecisionUpdate 2s placementController Decision demo-decision-1 is updated with placement demo in namespace ns1
Normal ScoreUpdate 2s placementController cluster1:-200 cluster2:200 cluster3:500 cluster4:400

This time, the PlacementDecision updated. The placement rescheduled to cluster3 and cluster4.

# oc describe placementdecision demo-decision-1 -n ns1
...
Status:
Decisions:
Cluster Name: cluster3
Reason:
Cluster Name: cluster4
Reason:

Case 3: Selecting clusters with the largest allocatable memory and pinning the placement decisions

To make decisions steady, this time we explicitly set prioritizer Steady with a weight value of 3.

apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: Placement
metadata:
name: demo
namespace: ns1
spec:
numberOfClusters: 2
prioritizerPolicy:
configurations:
- scoreCoordinate:
builtIn: ResourceAllocatableMemory
- scoreCoordinate:
builtIn: Steady
weight: 3

When this placement is created, let's describe the Placement and check the PlacementDecision.

# oc describe placement demo -n ns1
...
Status:
Conditions:
Last Transition Time: 2021-11-09T09:05:36Z
Message: All cluster decisions scheduled
Reason: AllDecisionsScheduled
Status: True
Type: PlacementSatisfied
Number Of Selected Clusters: 2
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DecisionCreate 15s placementController Decision demo-decision-1 is created with placement demo in namespace ns1
Normal DecisionUpdate 15s placementController Decision demo-decision-1 is updated with placement demo in namespace ns1
Normal ScoreUpdate 15s placementController cluster1:0 cluster2:100 cluster3:200
# oc describe placementdecision demo-decision-1 -n ns1
...
Status:
Decisions:
Cluster Name: cluster2
Reason:
Cluster Name: cluster3
Reason:

The initial placement decision is cluster2 and cluster3.

Now, let's add a new cluster with the allocatable memory of 150 MB again, and check the placement event. This time there's no event update, which means there are no changes in the PlacementDecision.

# oc describe placement demo -n ns1
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DecisionCreate 80s placementController Decision demo-decision-1 is created with placement demo in namespace ns1
Normal DecisionUpdate 80s placementController Decision demo-decision-1 is updated with placement demo in namespace ns1
Normal ScoreUpdate 80s placementController cluster1:0 cluster2:100 cluster3:200

Double check the PlacementDecision. The decision is unchanged and pinned to cluster2 and cluster3.

# oc describe placementdecision demo-decision-1 -n ns1
...
Status:
Decisions:
Cluster Name: cluster2
Reason:
Cluster Name: cluster3
Reason:

In the previous three examples, we showed how multiple prioritizers work together and how to influence the final decision by adjusting the weight of each prioritizer. You can try adjusting the weight or changing the enabled prioritizers for your own needs.

Summary

You can see how you can use the placement API in different situations. We explained what placement is and how it works with a popular open source project. We examined how placement selects clusters and how multiple placement prioritizers work together to make the decision by using some real examples. At the end of this post, we also give some best-practice placement examples for specific user requirements. Feel free to raise your question in the Open-cluster-management-io GitHub community or contact us using Slack.

Reference

A Model for Multicluster Workloads (in Kubernetes and Beyond)