Deploying a high-availability, fault-tolerant Kubernetes Service on bare metal clusters with MetalLB BGP

April 7, 2022Sabina Aledort, Carlos Goncalves10-minute read

Working with bare-metal clusters? Looking for a LoadBalancer? MetalLB is what you need. MetalLB offers a network load balancer implementation that integrates with standard network equipment and allows you access applications through an external IP address.

In short, it allows you to implement Kubernetes services of type LoadBalancer on baremetal clusters. It has two features that work together to provide this service: address allocation, and external announcement.

Address allocation

You give MetalLB pools of IP addresses that it can use and it will take care of assigning and unassigning individual addresses as services come and go, but it will only ever hand out IPs that are part of its configured pools.

External announcement

After MetalLB has assigned an external IP address to a service, it needs to make the network beyond the cluster-aware that the IP “lives” in the cluster. MetalLB uses standard routing protocols to achieve this: ARP, NDP, or BGP. OpenShift 4.9 introduced General Availability support to ARP and NDP. Starting in OpenShift 4.10, BGP is also a supported mode and will be the focus of this blog post.

BGP

With MetalLB BGP mode you can establish a BGP session between your network routers and your cluster nodes to advertise the IPs of external cluster services. MetalLB has historically offered an in-tree backend implementation for BGP mode that implements parts of the BGP protocol. This backend is commonly referred to as the “native” BGP implementation and is to date the default BGP backend upstream. However, maintaining a BGP protocol implementation stack is challenging, costly and slows down the delivery of new operator-grade features.

With this in mind, Red Hat in collaboration with the upstream MetalLB community, contributed to a second BGP implementation based on the FRRouting (FRR). Using FRR as a backend for the BGP layer allows BGP sessions with BFD support and IPv6 support for BGP and BFD. The FRR backend is the default and the only supported BGP implementation in OpenShift 4.10. The “native” backend is planned to be sunset at some point in the near future, receiving only bug fixes until then.

Figure 1: Example of spine-leaf network topology with BGP sessions between network nodes

MetalLB Operator

To make your life easier when deploying MetalLB in your cluster, we developed the MetalLB Operator. The MetalLB Operator implements the operator pattern for deploying MetalLB and managing its load balancing resources and offers a solid alternative to manifests and Helm charts.

Another great value-add the MetalLB Operator offers is the ability to configure MetalLB resources via Kubernetes Custom Resource (CR) users are so used to instead of a long and complex ConfigMap as is customary in conventional MetalLB environments. A derived benefit that comes from MetalLB CRs is the ability to do validation admission control of user-requested configurations – for example, validation of IPv6 CIDRs and return a response error back to the user when malformated or conflicting.

Next, we present the MetalLB CRs available, how to operate MetalLB via the MetalLB Operator and an example of a distributed HTTP web server behind a scalable and highly-available LoadBalancer type Service on a bare metal Red Hat OpenShift Container Platform.

More info about MetalLB and how to install MetalLB via MetalLB Operator can be found in the official OpenShift 4.10 documentation page.

Once you have installed and verified that MetalLB is running on the cluster, we can configure it to handle external IP assignments for services and advertise them to the world. Starting from OpenShift Container Platform 4.10, MetalLB includes support for BGP advertisement mode in addition to the L2 advertisement mode (ARP/NDP) available since OpenShift Container Platform 4.9. For the purpose of demonstrating the latest features, we will pick BGP mode.

MetalLB Operator custom resources

The MetalLB Operator provides four custom resources.

MetalLB

A MetalLB custom resource is the very first resource that needs to be created. Once created, the MetalLB Operator will take action to deploy all the required control-plane components in the control plane and compute cluster nodes. This includes the MetalLB controller and speaker pods.

AddressPool

MetalLB requires one or more pools of IP addresses that it can assign to a service when you add a service of type LoadBalancer. An address pool includes a list of IP addresses and the protocol (L2 or BGP) of your choice to make the services reachable. When you add an AddressPool custom resource to the cluster, the MetalLB Operator configures MetalLB so that it can assign IP addresses from the pool.

To create an address pool, an AddressPool resource needs to be created. An example of an AddressPool resource is shown below:

$ cat << EOF | oc apply -f -
apiVersion: metallb.io/v1beta1
kind: AddressPool
metadata:
  name: addresspool-sample1
  namespace: metallb-system
spec:
  protocol: bgp
  addresses:
    - 172.18.0.100-172.18.0.255
EOF

When the address pool is successfully added, it will be seamlessly amended by the Operator to the ConfigMap used to configure MetalLB:

$ oc get configmap -n metallb-system config -o yaml
kind: ConfigMap
apiVersion: v1
data:
  config: |
    address-pools:
    - name: addresspool-sample1
      protocol: bgp
      addresses:
      - 172.18.0.100-172.18.0.255

BGPPeer

The BGP peer custom resource identifies the BGP router for MetalLB to communicate with, the Autonomous System Number (ASN) of the router, the ASN for MetalLB, and customizations for route advertisement. MetalLB advertises the routes for service LoadBalancer IP addresses to one or more BGP peers. The service LoadBalancer IP addresses are specified with AddressPool custom resources that set the protocol field to bgp. Examples of BGP peers are Top-of-Rack (ToR), Provider Edge (PE) and Datacenter Gateway (DC-GW) routers.

$ cat << EOF | oc apply -f -
apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
  namespace: metallb-system
  name: bgppeer-sample1
spec:
  peerAddress: 10.0.0.1
  peerASN: 64501
  myASN: 64500
  routerID: 10.10.10.10
  bfdProfile: bfdprofile-sample-1

BFDProfile

The BFD profile custom resource configures Bidirectional Forwarding Detection (BFD) for a BGP peer. BFD provides faster path failure detection than BGP alone provides.

$ cat << EOF | oc apply -f -
apiVersion: metallb.io/v1beta1
kind: BFDProfile
metadata:
  name: bfdprofile-sample-1
  namespace: metallb-system
spec:
  receiveInterval: 300
  transmitInterval: 300
  detectMultiplier: 3
  minimumTtl: 254

Deploying a load-balanced HTTP web server

We deploy a load balancer type Service that serves an HTTP web server deployment with two replicas running on compute 1 and compute 2. The Service's external IP is 172.18.0.100 and is advertised via BGP from FRR-backed MetalLB to its BGP peering router (Router). It is important to note that all nodes running a MetalLB speaker will advertise the same external IP. Consequently, the Router adds an Equal-Cost Multi-Path (ECMP) route via the two possible routes to the Service.

We also enable Bidirectional Forwarding Detection (BFD) between the computes and the Router. BFD is a network protocol that is used to quickly detect faults between two network nodes. BGP has a built-in mechanism to detect connection failures, although the shortest detection window is still in the order of magnitude of seconds, whereas BFD can do milliseconds.

Figure 2: MetalLB advertising Service external IPs via BGP to a fabric router

Create a LoadBalancer Service

Create a Service of type LoadBalancer:

$ cat << EOF | oc apply -f -
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer
EOF

Observe the external IP address allocated to the Service by MetalLB:

$ oc get svc nginx
NAME    TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)        AGE
nginx   LoadBalancer   10.106.88.254   172.18.0.100   80:31641/TCP   53m

Validating peering between OpenShift cluster and Router

Now that MetalLB is configured to peer with the Router and to advertise the Service external IP (172.18.0.100), we can see the BGP and BFD states on the Router side:

$ vtysh -c "show bgp ipv4"
BGP table version is 1, local router ID is 10.0.0.1, vrf id 0
Default local pref 100, local AS 64513
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network           Next Hop         Metric     LocPrf   Weight Path
*= 172.18.0.100/32  10.0.0.2          0          0       64512 i
*=                  10.0.0.4          0          0       64512 i
*>                  10.0.0.3          0          0       64512 i

Displayed  1 routes and 3 total paths

MetalLB honors the .spec.externalTrafficPolicy field in the Service resource. This field can take two values: Cluster (route external traffic to cluster-wide endpoints; default) and Local (route external traffic to node-local endpoints). In this particular case, our Service defaulted to Cluster so every node in the cluster attracts traffic for the service's external IP. On each node, the traffic is subjected to a second layer of load balancing (provided by the default CNI network provider, e.g. OpenShift SDN or OVN-Kubernetes), which directs the traffic to individual pods.

$ vtysh -c "show bfd peers"
BFD Peers:
        peer 10.0.0.3 vrf default interface eth0
                ID: 627585521
                Remote ID: 1008119853
                Active mode
                Status: up
                Uptime: 4 minute(s), 59 second(s)
                Diagnostics: ok
                Remote diagnostics: ok
                Peer Type: dynamic
                [...]

        peer 10.0.0.2 vrf default interface eth0
                ID: 1404540001
                Remote ID: 1534501536
                Active mode
                Status: up
                Uptime: 5 minute(s), 0 second(s)
                Diagnostics: ok
                Remote diagnostics: ok
                Peer Type: dynamic
                [...]

        peer 10.0.0.4 vrf default interface eth0
                ID: 2578290078
                Remote ID: 380566073
                Active mode
                Status: up
                Uptime: 5 minute(s), 0 second(s)
                Diagnostics: ok
                Remote diagnostics: ok
                Peer Type: dynamic
                [...]

As can be seen from the above command outputs, there are BGP and BFD peering sessions per node.

Validating application availability and fault-tolerance

To validate that our HTTP web server is routable and reachable, check what the next-hop node is and send an HTTP request from the Router:

$ ip route
172.18.0.100 nhid 26 proto bgp metric 20 
	nexthop via 10.0.0.4 dev eth0 weight 1 
	nexthop via 10.0.0.2 dev eth0 weight 1 
	nexthop via 10.0.0.3 dev eth0 weight 1
[...]

$ ip route get 172.18.0.100
172.18.0.100 via 10.0.0.4 dev eth0 src 10.0.0.1 uid 0

$ curl --head 172.18.0.100 
HTTP/1.1 200 OK
Server: nginx/1.21.6
[..]

Our service is operational and reachable via Compute 2 (10.0.0.4) as the next-hop node.

Now, let’s cause a network link failure so that Compute 1 is no longer a valid next-hop so that we see how BGP and BFD can quickly detect and react to a topology change. We can simulate a failure by setting network interface eth0 down in Compute 2:

$ sudo ip link set eth0 down

Figure 3: Example of a network topology change; application is still accessible

Thanks to the fast link failure detection provided by BFD, Compute 2 next-hop route was promptly removed from DC Gateway’s kernel routing table, yet there are still two possible next-hops to the Service:

$ ip route
[...]
172.18.0.100 nhid 26 proto bgp metric 20 
	nexthop via 10.0.0.2 dev eth0 weight 1 
	nexthop via 10.0.0.3 dev eth0 weight 1

$ ip route get 172.18.0.100
172.18.0.100 via 10.0.0.3 dev eth0 src 10.0.0.1 uid 0

$ curl --head 172.18.0.100 
HTTP/1.1 200 OK
Server: nginx/1.21.6
[...]

Learn more about MetalLB

MetalLB is an open-source, operator-grade implementation of LoadBalancer type Kubernetes Services using standard routing protocols. It is a feature-rich system deployed at large-scale clouds, actively developed by its upstream community and supported by Red Hat in OpenShift Container Platform.

MetalLB offers an extensive and powerful API far beyond the examples presented in this blog post. We encourage you to read more at About MetalLB and the MetalLB Operator.

Do you have a feature request? Please contact your Red Hat account representative, file a Request for Enhancement (RFE) at Red Hat Jira or upstream.

About the authors

Sabina Aledort

Senior Software Engineer

Sabina Aledort is a software engineer working at Red Hat since 2020.

Read full bio

Carlos Goncalves

Browse by channel

Explore all channels

Select a language