Introduction

With the release of OpenShift 4.10, the BGP or Border Gateway Protocol mode for the MetalLB operator became generally available.

BGP mode offers a novel way to statelessly load balance client traffic towards the applications running on baremetal and bare metal-like OpenShift clusters, by using standard routing facilities, and at the same time providing high availability for the services.

The purpose of this article is to introduce MetalLB design and goals and cover in more detail this new mode. Throughout this article, I will discuss its configuration and usage, and in addition, provide mechanisms to verify that it is working properly.

Brief on MetalLB, BGP and BFD

Some concepts have to be conveyed in order to follow the subsequent sections where I will walk through the environment and the configuration procedure.

First and foremost, I need to address what MetalLB is.

MetalLB is an open source project that provides the ability to create LoadBalancer type of Kubernetes services on top of a bare-metal OpenShift/Kubernetes cluster and in turn create the same user experience that one would get on a public cloud provider as with AWS, GCP, or Azure.

A service of type LoadBalancer in Kubernetes will automatically request the cloud provider to provision a load balancer that will direct traffic to the right cluster node and port tuples, and subsequently assign an externally accessible IP to it. Then, information like the assigned IP address, will be updated in the status section of the new service resource. In contrast, in an on prem environment, Kubernetes will create a ClusterIP` service and assign a node port and wait for the administrator to create the load balancer side configuration, which is not very convenient.

To achieve a similar behavior as the one described for public cloud providers, MetalLB has to handle two main tasks: address allocation, which means managing the address pools that the new LoadBalancer services will use, and, secondly, the external announcement of these services' IP addresses, so that external entities to the cluster can learn how to reach them.

In regard to address allocation, MetalLB has to be told which IP addresses or IP address ranges it can assign when a new service is created, and this is accomplished through the AddressPool custom resource.

As to external announcement of the IP addresses, MetalLB offers two alternatives:

1. Layer 2 mode. This mode has been generally available since OpenShift 4.9 GA release and basically relies on the ARP protocol for IPv4 and NDP protocol for IPv6, to advertise which cluster node is the owner of the service IP address. Even though the method helps and it is acceptable for many use cases, it does have some drawbacks. Namely, the node owning the IP will become a bottleneck and limit performance; secondly, the time to failover in case the node is gone may be slow; and thirdly, the address allocation space has to be part of the cluster node's network. This article does not intend to focus on this mode, but you can find out more in the official documentation.

2. Layer 3 mode (BGP mode). This mode overcomes the problems described for layer 2 mode. It is based, of course, in the “two-napkin” protocol, henceforth BGP, which allows the cluster administrator to establish peering sessions between BGP speaker pods placed in a selected number of nodes from the cluster to an external BGP speaker, like the upstream router. Therefore, the incoming requests directed to the service external IP address that arrive at the router will be properly forwarded to the nodes. This method allows appropriate load balancing of the ingress traffic if the external router is distributing the incoming requests to the cluster nodes acting as peers.

In the remaining part of this article, I am going to focus on the latter, BGP mode.

When MetalLB works in BGP mode, it makes use of FRRouting (FRR), an open source project that started as a spin-off from Quagga, and provides GNU/Linux-based fully featured IP routing capabilities. It supports all sorts of routing protocols, and among them the one MetalLB needs: BGP.

BGP is the internet and industry de facto protocol for routing, which you have probably heard of. In brief, BGPv4 was defined under RFC1654 as an exterior gateway path-vector routing protocol, with the goal of conveying network reachability between autonomous systems. The autonomous system can be a flexible term, but its conception refers to a set of routers under a single technical administration, ideally with a common internal gateway protocol in use, common metrics, and others. Each AS will be identified by an ASN or autonomous system number that will be typically a unique 16-bit number (or 32 bits with RFC 4893).

Once the BGP session is established among two peers, the routers will send keep-alive messages periodically to control the peers' status. The minimum allowed time to keep a failed session (hold time) is 3 seconds, which may be an unacceptable amount of time for a service to be unavailable. Therefore, in addition to the standard BGP speaking capabilities, MetalLB (and FRR) also supports configuring Bi-directional Forwarding Detection (BFD) protocol. BFD adds further resiliency to the solution providing faster failure detection. It allows two routers to detect faults in the bidirectional path among them at subsecond intervals. BFD works at layer 3, and basically offers a lightweight liveness detection protocol (also known as "hello" protocol) that is independent of the routing protocol itself. It will only limit itself to notify the routing protocol about the failure and not to take any corrective action.

Prerequisites

For MetalLB to work in BGP mode, the following prerequisites are needed:

  • First and foremost a bare metal or bare metal-like cluster in place. It can be deployed either with IPI or UPI. Also other platforms that do not provide a native load balancer can benefit from MetalLB.
  • A BGP speaking external router to act as a peer, which will be described in the Environment section.
  • The external network infrastructure should be able to route client traffic directed to the LoadBalancer services’ addresses through the external router that is set as a BGP peer.
  • The addresses that will be assigned to the AddressPool custom resources for BGP mode should be reserved and should not be part of the OpenShift nodes’ network ( network.machineNetwork[].cidr from install-config.yaml).
  • There should be open communication between the external router and the cluster nodes on port 179/TCP (BGP), 3784/UDP and 3785/UDP (BFD).

Environment

The network where the OpenShift cluster runs on will have an adjacent router that is able to speak BGP and BFD. Moreover, the router will be able to do ECMP (Equal Cost Multi-Path) routing in order to distribute the incoming requests uniformly across multiple OpenShift worker nodes. Thus, if a LoadBalancer service is advertised from multiple worker nodes at the same time, each of these nodes should get a fairly even amount of requests from the router.

To emulate that router, I am going to use an isolated instance of FRR running containerized within an external system.

In this example, the external router lives in the same network segment as the OpenShift nodes (single hop network topology). Multi-hop network topologies are also possible by passing an extra flag spec.ebgpMultiHop to the BGPPeer custom resource, which will be introduced in a later section.

Figure 1: Network Architecture

Also for the purpose of testing MetalLB in BGP mode, I have already installed a cluster of Red Hat OpenShift Container Platform 4.10.3 using bare-metal UPI deployment method and also using Openshift SDN as CNI plug-in with default configuration.

MetalLB is supported in both OpenShift SDN and OVN-Kubernetes CNI plug-ins.

IMPORTANT: Red Hat recommends OVN-Kubernetes CNI for Telecom/CNF use cases as the CNI plugin. This is transparent from metalLB user configuration perspective, and this blog can safely be reused for Telecom/CNF use cases as long as OVN-Kubernetes CNI is used.

$ oc get clusterversion
NAME  VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.3 True    False     23h Cluster version is 4.10.3

$ oc get nodes
NAME                      STATUS   ROLES AGE   VERSION
ice-ocp4-master-0.lab.local   Ready master   15h   v1.23.3+e419edf
ice-ocp4-master-1.lab.local   Ready master   15h   v1.23.3+e419edf
ice-ocp4-master-2.lab.local   Ready master   14h   v1.23.3+e419edf
ice-ocp4-worker-0.lab.local   Ready worker   28m   v1.23.3+e419edf
ice-ocp4-worker-1.lab.local   Ready worker   28m   v1.23.3+e419edf

$ oc get network cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Network
metadata:
 creationTimestamp: "2022-03-16T18:51:34Z"
 generation: 2
 name: cluster
 resourceVersion: "5174"
 uid: b4f13c36-bef8-44fb-9cec-780b6c565eda
spec:
 clusterNetwork:
 - cidr: 10.128.0.0/14
hostPrefix: 23
 externalIP:
policy: {}
 networkType: OpenShiftSDN
 serviceNetwork:
 - 172.30.0.0/16
status:
 clusterNetwork:
 - cidr: 10.128.0.0/14
hostPrefix: 23
 clusterNetworkMTU: 1450
 networkType: OpenShiftSDN
 serviceNetwork:
 - 172.30.0.0/16

$ oc get co
NAME                             VERSION    AVAILABLE  PROGRESSING DEGRADED SINCE MESSAGE
authentication                   4.10.3 True      False       False   48m
baremetal                        4.10.3 True      False       False  15h
cloud-controller-manager         4.10.3 True      False       False  16h
cloud-credential                 4.10.3 True      False       False  16h
cluster-autoscaler               4.10.3 True      False       False  15h
config-operator                  4.10.3 True      False       False  15h
console                          4.10.3 True      False       False  40m
csi-snapshot-controller          4.10.3 True      False       False  15h
dns                              4.10.3 True      False       False  15h
etcd                             4.10.3 True      False       False  14h
image-registry                   4.10.3 True      False       False  14h
ingress                          4.10.3 True      False       False  53m
insights                         4.10.3 True      False       False  15h
kube-apiserver                   4.10.3 True      False       False  14h
kube-controller-manager          4.10.3 True      False       False  15h
kube-scheduler                   4.10.3 True      False       False  15h
kube-storage-version-migrator    4.10.3 True      False       False  15h
machine-api                      4.10.3 True      False       False  15h
machine-approver                 4.10.3 True      False       False  15h
machine-config                   4.10.3 True      False       False  14h
marketplace                      4.10.3 True      False       False  15h
monitoring                       4.10.3 True      False       False  20m
network                          4.10.3 True      False       False  15h
node-tuning                      4.10.3 True      False       False  14h
openshift-apiserver              4.10.3 True      False       False  14h
openshift-controller-manager     4.10.3 True      False       False  15h
openshift-samples                4.10.3 True      False       False  14h
operator-lifecycle-manager       4.10.3 True      False       False  15h
operator-lifecycle-manager-ca…   4.10.3 True      False       False  15h
operator-lifecycle-manager-pa…   4.10.3 True      False       False  14h
service-ca                       4.10.3 True      False       False  15h
storage                          4.10.3 True      False       False  15h          15h        

Since this environment is using OpenShift SDN, and MetalLB respects the externalTrafficPolicy  (which is by default cluster), the traffic for a new service using this policy should be distributed evenly to every node. Then, within every node, kube-proxy will take care of distributing the traffic to the available endpoints for that service. This can be simply verified by inspecting the iptables rules created by kube-proxy on every node associated with the LoadBalancer services created.

It is important to understand that in order to distribute the traffic, the ECMP implementation of the external router will in fact use flow-based hashing. Therefore, all traffic associated with a particular flow will use the same next hop and a consistent path across the network. In other words, ingress traffic from distinct sources will be properly distributed, although an outstanding amount of ingress traffic from a given IP address will end up in the same worker node, which potentially means unbalanced traffic.

Procedure

The procedure to test MetalLB consists of three parts: deploying the external router that will act as BGP peer, deploying and configuring the MetalLB operator itself, and finally verifying and testing the resulting scenario.

External Router

To emulate the external router, I will also use FRR deployed in a container on an external system in the same network as the OpenShift worker nodes.

The container will use a configuration directory (in this example $HOME/frr) holding three files: 

  • frr.conf - Main FRRouting config file
  • daemons - Sets which daemons will be enabled
  • vtysh.conf - Configuration file for vtysh (CLI interface to FRR). It has to exist for the container to start but can be empty.

Mainly, we want to set the router identity, hostname, own ASN, address, and so forth and then set the parameters needed for establishing a BGP session with the speaker pods that will run within the OpenShift cluster, grouping them, passing their IP addresses, setting the ASN for them, enabling BFD, setting some BGP timers, and others.

In the Appendix 1 section, you can learn in detail the content of the files and find the explanation of the settings that need to be adjusted in them.

Now the external router container can be started.

$ tree /home/frr/
/home/frr/
├── daemons
├── frr.conf
└── vtysh.conf
$ sudo podman run -d --rm  -v /home/maur0x/frr:/etc/frr:Z --net=host --name frr-upstream --privileged quay.io/frrouting/frr
Bdf2b6a9ebe087acc7df8b57f86dbf7e3f253f5df66ced360a18d2d40574b8f1

One final comment about the external router setup is that the environment should allow BGP (179/TCP) and BFD (3784/UDP and 3785/UDP) communications to go through in case there is a firewall between the OpenShift nodes and the router, especially applicable in multi-hop architectures.

The external router is ready.


MetalLB

The following steps will install the MetalLB operator and MetalLB and configure them. We will be reserving some addresses for the LoadBalancer services, and then finally pair MetalLB appropriately with the external router created in the previous section.

STEP 1. Install the MetalLB operator

$ oc get packagemanifest | grep metal
metallb-operator                                Red Hat Operators         10m
$ cat << _EOF_ | oc apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
name: metallb-system
spec: {}
_EOF_
namespace/metallb-system created
$ cat << _EOF_ | oc apply -f -
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: metallb-operator
namespace: metallb-system
spec:
targetNamespaces:
- metallb-system
_EOF_
operatorgroup.operators.coreos.com/metallb-operator created
$ cat << _EOF_ | oc apply -f -
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: metallb-operator-sub
namespace: metallb-system
spec:
name: metallb-operator
channel: "stable"
source: redhat-operators
sourceNamespace: openshift-marketplace
_EOF_
subscription.operators.coreos.com/metallb-operator-sub created

STEP 2. Check the install plan is approved and the installation progress

$ oc get installplan -n metallb-system
NAME            CSV                                         APPROVAL         APPROVED
install-brb9w   metallb-operator.4.10.0-202203081809   Automatic   true
$ oc get csv -n metallb-system -o custom-columns='NAME:.metadata.name, VERSION:.spec.version, PHASE:.status.phase'
NAME                                   VERSION               PHASE
metallb-operator.4.10.0-202203111548   4.10.0-202203111548   Succeeded

STEP 3. Create a MetalLB custom resource and check the controller deployment

Once the operator is fully installed, we can proceed to deploy the MetalLB instance.

The MetalLB custom resource represents MetalLB itself. There can only be one resource of this kind in the cluster. When created, MetalLB is deployed, and upon creation, it will instantiate a controller Deployment and a speaker DaemonSet.

If the custom resource is deleted, MetalLB is removed from the cluster. The MetalLB operator needs to be explicitly uninstalled afterwards.

In this step, it is possible to select which nodes will be running the speaker pods via spec.nodeSelector, and if needed also adding tolerations to the speaker DaemonSet via spec.speakerTolerations. It will typically target worker nodes or infra nodes. More details for the MetalLB custom resource in the official documentation.

$ cat << _EOF_ | oc apply -f -
---
apiVersion: metallb.io/v1beta1
kind: MetalLB
metadata:
name: metallb
namespace: metallb-system
spec:
nodeSelector:
       node-role.kubernetes.io/worker: ""
_EOF_
metallb.metallb.io/metallb created

Check that the controller` deployment is up. This pod is in charge of assigning IP addresses from the pools that will be created next to the LoadBalancer services that will be requested.

$ oc get deployment controller -n metallb-system
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
controller   1/1         1                1               91s
$ oc get deployment -n metallb-system controller -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
       deployment.kubernetes.io/revision: "1"
creationTimestamp: "2022-03-17T12:42:31Z"
generation: 1
labels:
       app: metallb
       component: controller
name: controller
namespace: metallb-system
ownerReferences:
- apiVersion: metallb.io/v1beta1
       blockOwnerDeletion: true
       controller: true
       kind: MetalLB
       name: metallb
       uid: 10fe2c8f-7122-43ee-a735-b3062caf97e6
resourceVersion: "464444"
uid: 3bcc9c38-0489-4005-9fcc-ad6341461e5f
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 3
[…]
$ oc get pods -n metallb-system
NAME                                        READY   STATUS        RESTARTS   AGE
controller-5bcbccf6d4-lhp95                 2/2     Running   0        71s
metallb-operator-controller-manager-654df86cc5-szk96 1/1 Running   0   4m
speaker-bt8z2                               6/6     Running   0        71s
speaker-czhc5                               6/6     Running   0        71s
$ oc get ds -n metallb-system
NAME  DESIRED CURRENT READY UP-TO-DATE AVAILABLE  NODE SELECTOR  AGE
speaker 2     2       2     2      2 node-role.kubernetes.io/worker= 110s

STEP 4. Create an Address Pool

As mentioned before, the AddressPool custom resource will tell MetalLB which external IP addresses are valid to be assigned to a LoadBalancer service.

These addresses can be specified as a subnet range, or individual addresses like the example below.

To avoid collisions, these IP addresses should be available and reserved for this use only. It is also important that the addresses in the pool do not collide with the OpenShift nodes’ network (network.machineNetwork[].cidr).

The protocol has to be set to bgp. The other protocol option is layer2 mode (as explained in a previous section) which relies on the ARP protocol for IPv4 and NDP protocol for IPv6 to advertise the addresses.

$ cat << _EOF_ | oc apply -f -
---
apiVersion: metallb.io/v1beta1
kind: AddressPool
metadata:
name: address-pool-bgp
namespace: metallb-system
spec:
addresses:
- 192.168.155.150/32
- 192.168.155.151/32
- 192.168.155.152/32
- 192.168.155.153/32
- 192.168.155.154/32
- 192.168.155.155/32
autoAssign: true
protocol: bgp
_EOF_
addresspool.metallb.io/address-pool-bgp created

More examples can be found in the official documentation.

STEP 5. Create a BFD profile

The BFDProfile custom resource holds the configuration for the BFD protocol, where the administrator can tune, for instance, intervals, echo and passive modes, and so forth. The profile in the example below sets some basic values in order to pair with another FRR instance. The meaning of each of these parameters can be found in the official documentation and also in  FRR documentation for BFD.

$ cat << _EOF_ | oc apply -f -
---
apiVersion: metallb.io/v1beta1
kind: BFDProfile
metadata:
name: test-bfd-prof
namespace: metallb-system
spec:
detectMultiplier: 37
echoMode: true
minimumTtl: 10
passiveMode: true
receiveInterval: 35
transmitInterval: 35
_EOF_
bfdprofile.metallb.io/test-bfd-prof created

STEP 6. Create a BGPPeer resource

The last step is to create the BGPPeer resource, which will configure the speaker pods and will pass to the FRR container within the pod the right ASN number for itself and for the remote peer and the remote peer’s IP address.

This custom resource accepts some configuration that is worth bringing up:

  • If the eBGP peer is multiple hops away, spec.ebgpMultiHop has to be set to true.
  • The BFD profile to use via spec.bfdProfile.
  • Which subset of nodes running speaker pods should establish a session with this particular BGP peer via spec.nodeSelector.
  • Setting holdtime via spec.holdTime, and keepalive via spec.keepaliveTime.
  • More parameters are available in the official documentation.
$ cat << _EOF_ | oc apply -f -
---
apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
name: peer-test
namespace: metallb-system
spec:
bfdProfile: test-bfd-prof
myASN: 64520
peerASN: 64521
peerAddress: 192.168.133.1
_EOF_
bgppeer.metallb.io/peer-test created

At this point, the DaemonSet for the speaker pods should have the new peer configuration set and be establishing the sessions with the external router. We will review this configuration within the speaker pods in the following section.


Verification

BGP and BFD session status

Now that the environment is up and running, let's verify it is behaving as expected.

First, I will check that there are valid BGP sessions established within the speaker pods, taking as example: speaker-6jsfc.

The BGP state should be Established, and BFD status should be Up.

$ oc -n metallb-system exec -it speaker-6jsfc -c frr -- vtysh -c "show ip bgp neighbor"
BGP neighbor is 192.168.133.1, remote AS 64521, local AS 64520, external link
Hostname: ice-lab-01.lab.local
BGP version 4, remote router ID 192.168.133.1, local router ID 192.168.133.71
BGP state = Established, up for 04:20:09
Last read 00:00:00, Last write 00:00:03
Hold time is 15, keepalive interval is 5 seconds
Configured hold time is 90, keepalive interval is 30 seconds
Neighbor capabilities:
       4 Byte AS: advertised and received
       AddPath:
        IPv4 Unicast: RX advertised IPv4 Unicast and received
        IPv6 Unicast: RX advertised IPv6 Unicast
       Route refresh: advertised and received(old & new)
       Address Family IPv4 Unicast: advertised and received
       Address Family IPv6 Unicast: advertised
       Hostname Capability: advertised (name: ice-ocp4-worker-0.lab.local,domain name: n/a) received (name: ice-lab-01.lab.local,domain name: n/a)
       Graceful Restart Capability: advertised and received
        Remote Restart timer is 120 seconds
        Address families by peer:
          none
Graceful restart information:
       End-of-RIB send: IPv4 Unicast
       End-of-RIB received: IPv4 Unicast
       Local GR Mode: Helper*
       Remote GR Mode: Helper
       R bit: False
       Timers:
        Configured Restart Time(sec): 120
        Received Restart Time(sec): 120
       IPv4 Unicast:
        F bit: False
        End-of-RIB sent: Yes
        End-of-RIB sent after update: Yes
        End-of-RIB received: Yes
        Timers:
          Configured Stale Path Time(sec): 360
       IPv6 Unicast:
        F bit: False
        End-of-RIB sent: No
        End-of-RIB sent after update: No
        End-of-RIB received: No
        Timers:
          Configured Stale Path Time(sec): 360
Message statistics:
       Inq depth is 0
       Outq depth is 0
                           Sent           Rcvd
       Opens:                      1              1
       Notifications:              0              0
       Updates:                    1              1
       Keepalives:              3122           5204
       Route Refresh:              2              0
       Capability:                 0              0
       Total:                   3126           5206
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
Update group 3, subgroup 3
Packet Queue length 0
Community attribute sent to this neighbor(all)
Inbound path policy configured
Route map for incoming advertisements is *192.168.133.1-in
0 accepted prefixes
For address family: IPv6 Unicast
Not part of any update group
Community attribute sent to this neighbor(all)
Inbound path policy configured
Route map for incoming advertisements is *192.168.133.1-in
0 accepted prefixes
Connections established 1; dropped 0
Last reset 04:20:10,  Waiting for peer OPEN
Local host: 192.168.133.71, Local port: 37226
Foreign host: 192.168.133.1, Foreign port: 179
Nexthop: 192.168.133.71
Nexthop global: ::
Nexthop local: ::
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on  Write thread: on  FD used: 22
BFD: Type: single hop
       Detect Multiplier: 3, Min Rx interval: 300, Min Tx interval: 300
       Status: Up, Last update: 0:04:19:58

We can also check the FRR instance running within the speaker pod and its configuration.

$ oc -n metallb-system exec -it speaker-bt8z2 -c frr -- vtysh -c "show running"
Building configuration...
Current configuration:
!
frr version 7.5
frr defaults traditional
hostname ice-ocp4-worker-0.lab.local
log file /etc/frr/frr.log informational
log timestamp precision 3
service integrated-vtysh-config
!
router bgp 64520
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
no bgp network import-check
neighbor 192.168.133.1 remote-as 64521
neighbor 192.168.133.1 bfd profile test-bfd-prof
neighbor 192.168.133.1 timers 30 90
!
address-family ipv4 unicast
neighbor 192.168.133.1 activate
neighbor 192.168.133.1 route-map 192.168.133.1-in in
exit-address-family
!
address-family ipv6 unicast
neighbor 192.168.133.1 activate
neighbor 192.168.133.1 route-map 192.168.133.1-in in
exit-address-family
!
route-map 192.168.133.1-in deny 20
!
route-map 192.168.133.1-out permit 1
!
ip nht resolve-via-default
!
ipv6 nht resolve-via-default
!
line vty
!
bfd
profile test-bfd-prof
detect-multiplier 37
transmit-interval 35
receive-interval 35
passive-mode
echo-mode
minimum-ttl 10
!
!
end

Similarly, checking the external router FRR pod should display an equivalent output against each peer:

$ sudo podman exec -it frr-upstream vtysh -c "show bgp summary"
IPv4 Unicast Summary (VRF default):
BGP router identifier 192.168.133.1, local AS number 64521 vrf-id 0
BGP table version 2
RIB entries 1, using 184 bytes of memory
Peers 2, using 1433 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V  AS  MsgRcvd  MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
192.168.133.71 4 64520   9        12    0        0     0 00:00:27   1            1 N/A
192.168.133.72 4 64520   15       23    0        0        0 00:00:58   1            1 N/A
Total number of neighbors 2
$ sudo podman exec -it  frr-upstream vtysh -c "show ip bgp neighbor"
BGP neighbor is 192.168.133.71, remote AS 64520, local AS 64521, external link
Hostname: ice-ocp4-worker-0.lab.local
Member of peer-group metallb for session parameters
BGP version 4, remote router ID 192.168.133.71, local router ID 192.168.133.1
BGP state = Established, up for 04:35:09
Last read 00:00:03, Last write 00:00:03
Hold time is 15, keepalive interval is 3 seconds
Configured hold time is 15, keepalive interval is 3 seconds
Neighbor capabilities:
       4 Byte AS: advertised and received
       Extended Message: advertised
       AddPath:
        IPv4 Unicast: RX advertised and received
       Long-lived Graceful Restart: advertised
       Route refresh: advertised and received(old & new)
       Enhanced Route Refresh: advertised
       Address Family IPv4 Unicast: advertised and received
       Address Family IPv6 Unicast: received
       Hostname Capability: advertised (name: ice-lab-01.lab.local,domain name: n/a) received (name: ice-ocp4-worker-0.lab.local,domain name: n/a)
       Graceful Restart Capability: advertised and received
        Remote Restart timer is 120 seconds
        Address families by peer:
          none
Graceful restart information:
       End-of-RIB send: IPv4 Unicast
       End-of-RIB received: IPv4 Unicast
       Local GR Mode: Helper*
       Remote GR Mode: Helper
       R bit: True
       Timers:
        Configured Restart Time(sec): 120
        Received Restart Time(sec): 120
       IPv4 Unicast:
        F bit: False
        End-of-RIB sent: Yes
        End-of-RIB sent after update: Yes
        End-of-RIB received: Yes
        Timers:
          Configured Stale Path Time(sec): 360
Message statistics:
       Inq depth is 0
       Outq depth is 0
                           Sent           Rcvd
       Opens:                      2              2
       Notifications:              0              2
       Updates:                    2              2
       Keepalives:              5550           3330
       Route Refresh:              0              2
       Capability:                 0              0
       Total:                   5554           3338
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
metallb peer-group member
Update group 2, subgroup 2
Packet Queue length 0
NEXT_HOP is always this router
Community attribute sent to this neighbor(all)
0 accepted prefixes
Connections established 2; dropped 1
Last reset 04:35:35,  No AFI/SAFI activated for peer
Local host: 192.168.133.1, Local port: 179
Foreign host: 192.168.133.71, Foreign port: 37226
Nexthop: 192.168.133.1
Nexthop global: ::
Nexthop local: ::
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on  Write thread: on  FD used: 26
BFD: Type: single hop
Detect Multiplier: 3, Min Rx interval: 300, Min Tx interval: 300
Status: Up, Last update: 0:04:34:58
BGP neighbor is 192.168.133.72, remote AS 64520, local AS 64521, external link
Hostname: ice-ocp4-worker-1.lab.local
Member of peer-group metallb for session parameters
BGP version 4, remote router ID 192.168.133.72, local router ID 192.168.133.1
BGP state = Established, up for 04:35:09
Last read 00:00:03, Last write 00:00:03
Hold time is 15, keepalive interval is 3 seconds
Configured hold time is 15, keepalive interval is 3 seconds
Neighbor capabilities:
       4 Byte AS: advertised and received
       Extended Message: advertised
       AddPath:
        IPv4 Unicast: RX advertised and received
       Long-lived Graceful Restart: advertised
       Route refresh: advertised and received(old & new)
       Enhanced Route Refresh: advertised
       Address Family IPv4 Unicast: advertised and received
       Address Family IPv6 Unicast: received
       Hostname Capability: advertised (name: ice-lab-01.lab.local,domain name: n/a) received (name: ice-ocp4-worker-1.lab.local,domain name: n/a)
       Graceful Restart Capability: advertised and received
        Remote Restart timer is 120 seconds
        Address families by peer:
          none
Graceful restart information:
       End-of-RIB send: IPv4 Unicast
       End-of-RIB received: IPv4 Unicast
       Local GR Mode: Helper*
       Remote GR Mode: Helper
       R bit: True
       Timers:
        Configured Restart Time(sec): 120
        Received Restart Time(sec): 120
       IPv4 Unicast:
        F bit: False
        End-of-RIB sent: Yes
        End-of-RIB sent after update: Yes
        End-of-RIB received: Yes
        Timers:
          Configured Stale Path Time(sec): 360
Message statistics:
       Inq depth is 0
       Outq depth is 0
                           Sent           Rcvd
       Opens:                      2              2
       Notifications:              0              2
       Updates:                    2              2
       Keepalives:              5550           3330
       Route Refresh:              0              2
       Capability:                 0              0
       Total:                   5554           3338
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
metallb peer-group member
Update group 2, subgroup 2
Packet Queue length 0
NEXT_HOP is always this router
Community attribute sent to this neighbor(all)
0 accepted prefixes
Connections established 2; dropped 1
Last reset 04:35:35,  No AFI/SAFI activated for peer
Local host: 192.168.133.1, Local port: 179
Foreign host: 192.168.133.72, Foreign port: 50294
Nexthop: 192.168.133.1
Nexthop global: ::
Nexthop local: ::
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on  Write thread: on  FD used: 27
BFD: Type: single hop
Detect Multiplier: 3, Min Rx interval: 300, Min Tx interval: 300
Status: Up, Last update: 0:04:34:58

The BFD sessions can also be inspected in more detail with the following command.

$ sudo podman exec -it frr-upstream   vtysh -c "show bfd peers"
BFD Peers:
       peer 192.168.133.71 local-address 192.168.133.1 vrf default interface virbr2
          ID: 2094083731
          Remote ID: 1259657737
          Active mode
          Status: up
          Uptime: 2 day(s), 2 hour(s), 58 minute(s), 5 second(s)
          Diagnostics: ok
          Remote diagnostics: ok
          Peer Type: dynamic
          Local timers:
              Detect-multiplier: 3
              Receive interval: 300ms
              Transmission interval: 300ms
              Echo receive interval: 50ms
              Echo transmission interval: disabled
          Remote timers:
              Detect-multiplier: 37
              Receive interval: 35ms
              Transmission interval: 35ms
              Echo receive interval: 50ms
       peer 192.168.133.72 local-address 192.168.133.1 vrf default interface virbr2
          ID: 1112508781
          Remote ID: 2587821932
          Active mode
          Status: up
          Uptime: 1 day(s), 5 hour(s), 38 minute(s), 0 second(s)
          Diagnostics: ok
          Remote diagnostics: ok
          Peer Type: dynamic
          Local timers:
              Detect-multiplier: 3
              Receive interval: 300ms
              Transmission interval: 300ms
              Echo receive interval: 50ms
              Echo transmission interval: disabled
          Remote timers:
              Detect-multiplier: 37
              Receive interval: 35ms
              Transmission interval: 35ms
              Echo receive interval: 50ms

Creating a New Service to Test MetalLB

To understand how a new service of type LoadBalancer will behave, let's create a simple example service using a hello-node deployment and verify that MetalLB is working as expected.

$ oc new-project test-metallb
Now using project "test-metallb" on server "https://api.t1.lab.local:6443".
[…]
$ oc create deployment hello-node --image=k8s.gcr.io/e2e-test-images/agnhost:2.33 -- /agnhost serve-hostname
deployment.apps/hello-node created
$ cat  << __EOF__ | oc apply -f -
---
apiVersion: v1
kind: Service
metadata:
name: test-frr
spec:
selector:
       app: hello-node
ports:
       - port: 80
        protocol: TCP
        targetPort: 9376
type: LoadBalancer
__EOF__
service/test-frr created

The brand new LoadBalancer service is healthy. It is getting the first external IP from the defined address pool, it has the right endpoint, and it is being announced via BGP from both worker nodes at the same time.

$ oc get svc
NAME       TYPE               CLUSTER-IP       EXTERNAL-IP       PORT(S)              AGE
test-frr   LoadBalancer 172.30.169.126   192.168.155.150   80:30194/TCP   33s
$ oc describe svc test-frr
Name:                         test-frr
Namespace:                    test-metallb
Labels:                       app=hello-node
Annotations:                  <none>
Selector:                     app=hello-node
Type:                         LoadBalancer
IP Family Policy:             SingleStack
IP Families:                  IPv4
IP:                           172.30.169.126
IPs:                          172.30.169.126
LoadBalancer Ingress:         192.168.155.150
Port:                         <unset>  80/TCP
TargetPort:                   9376/TCP
NodePort:                     <unset>  30194/TCP
Endpoints:                    10.131.1.165:9376
Session Affinity:             None
External Traffic Policy:  Cluster
Events:
Type        Reason            Age   From                    Message
----        ------            ----  ----                    -------
Normal  nodeAssigned  60s   metallb-speaker         announcing from node "ice-ocp4-worker-0.lab.local"
Normal  IPAllocated   57s   metallb-controller  Assigned IP ["192.168.155.150"]
Normal  nodeAssigned  56s   metallb-speaker         announcing from node "ice-ocp4-worker-1.lab.local"

From the external router point of view, the route to the new service is learned properly via BGP from the two speaker pods running in the worker nodes, and the service is reachable on its external IP from other nodes in the network.

$ sudo podman exec -it frr-upstream   vtysh -c "show ip route"
Codes: K - kernel route, C - connected, S - static, R - RIP,
         O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
         T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
         f - OpenFabric,
         > - selected route, * - FIB route, q - queued, r - rejected, b - backup
         t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/425] via 192.168.3.254, br0, 00:01:55
C>* 192.168.3.250/32 is directly connected, br0, 00:01:55
K>* 192.168.3.254/32 [0/20425] is directly connected, br0, 00:01:55
C>* 192.168.4.0/24 is directly connected, vlan1001, 00:01:55
C>* 192.168.122.0/24 is directly connected, virbr0, 00:01:55
C>* 192.168.133.0/24 is directly connected, virbr2, 00:01:55
B>* 192.168.155.150/32 [20/0] via 192.168.133.71, virbr2, weight 1, 00:01:14
*                               via 192.168.133.72, virbr2, weight 1, 00:01:14
$ sudo podman exec -it frr-upstream ip r
default via 192.168.3.254 dev br0 proto static metric 425
192.168.0.0/24 dev virbr1 proto kernel scope link src 192.168.0.254 linkdown
192.168.3.254 dev br0 proto static scope link metric 20425
192.168.4.0/24 dev vlan1001 proto kernel scope link src 192.168.4.2 metric 400
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1
192.168.133.0/24 dev virbr2 proto kernel scope link src 192.168.133.1
192.168.155.150 nhid 575 proto bgp metric 20
  nexthop via 192.168.133.71 dev virbr2 weight 1
  nexthop via 192.168.133.72 dev virbr2 weight 1
$ curl -l 192.168.155.150
Hello-node-78bd88f59b-btbpc

When a LoadBalancer service is created, some interesting options can be used to enhance the external IP placement, such as indicating which IP address we want for the service, requesting any IP address but from a specific pool, or even sharing an external IP for more than one service.

For example, the service will be re-created, but this time requesting a specific IP from the pool by indicating spec.loadBalancerIP.

$ oc delete svc/test-frr
service "test-frr" deleted
$ cat  << __EOF__ | oc apply -f -
---
apiVersion: v1
kind: Service
metadata:
name: test-frr
annotations:
       metallb.universe.tf/address-pool: address-pool-bgp
spec:
selector:
       app: hello-node
ports:
       - port: 80
        protocol: TCP
        targetPort: 9376
type: LoadBalancer
loadBalancerIP: 192.168.155.151
__EOF__
service/test-frr created

Yet again, the service is being announced properly and reachable externally in its new IP address: 192.168.155.151.

$ oc describe svc/test-frr
Name:                         test-frr
Namespace:                    test-metallb
Labels:                       <none>
Annotations:                  metallb.universe.tf/address-pool: address-pool-bgp
Selector:                     app=hello-node
Type:                         LoadBalancer
IP Family Policy:             SingleStack
IP Families:                  IPv4
IP:                           172.30.140.200
IPs:                          172.30.140.200
IP:                           192.168.155.151
LoadBalancer Ingress:         192.168.155.151
Port:                         <unset>  80/TCP
TargetPort:                   9376/TCP
NodePort:                     <unset>  31605/TCP
Endpoints:                    10.131.0.5:9376
Session Affinity:             None
External Traffic Policy:  Cluster
Events:
Type        Reason            Age   From                    Message
----        ------            ----  ----                    -------
Normal  nodeAssigned  35s   metallb-speaker         announcing from node "ice-ocp4-worker-0.lab.local"
Normal  IPAllocated   35s   metallb-controller  Assigned IP ["192.168.155.151"]
Normal  nodeAssigned  34s   metallb-speaker         announcing from node "ice-ocp4-worker-1.lab.local"
$ curl -l 192.168.155.151
hello-node-78bd88f59b-btbpc

For troubleshooting details, check out  the official documentation.

Conclusions

MetalLB is getting more and more mature as a project, and using it in BGP mode offers a novel way to statelessly load balance traffic into OpenShift cluster services using standard routing facilities, instead of a regular load balancer network device.

As we observed throughout this article, MetalLB also aids to achieve a substantially similar experience to the one we would get in a public cloud provider, but on an on-prem platform provider. Moreover, the even distribution of the traffic can also help to achieve higher availability, resiliency, and performance.

Both Red Hat OpenShift Container Platform documentation and MetalLB upstream documentation point out some limitations when MetalLB is working in BGP mode. They should be kept in mind when architecting or implementing the solution. The main limitation is how BGP handles a peer going down, where the active connections associated with the node will be re-distributed to other nodes potentially breaking stateful connections. Faster detection via BFD can help to speed up the transition.

Moving forward, other interesting configurations could be attempted, like using multi-hop network topology, IPv6 stack, and so on.

Appendix 1: FRR Configuration Files

Annotated FRRouting configuration files needed to start the external router pod.

The important parameters to adjust in the frr.conf configuration file are indicated below.

frr version 8.0.1_git
frr defaults traditional
hostname frr-upstream                                  👈[1]
!
debug bgp updates
debug bgp neighbor
debug zebra nht
debug bgp nht
debug bfd peer
log file /tmp/frr.log debugging
log timestamp precision 3
!
interface virbr2                                       👈[2]
ip address 192.168.133.1/24                           👈[3]
!
router bgp 64521                                       👈[4]
bgp router-id 192.168.133.1                           👈[5]
timers bgp 3 15                                       👈[6]
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
no bgp network import-check
neighbor metallb peer-group
neighbor metallb remote-as 64520                      👈[7]
neighbor 192.168.133.71 peer-group metallb            👈[8]
neighbor 192.168.133.71 bfd                           👈[9]
neighbor 192.168.133.72 peer-group metallb
neighbor 192.168.133.72 bfd
!
address-family ipv4 unicast
neighbor 192.168.133.71 next-hop-self                👈[10]
neighbor 192.168.133.71 activate                     👈[11]
neighbor 192.168.133.72 next-hop-self
neighbor 192.168.133.72 activate
exit-address-family
!
line vty

 [1] hostname <NAME>: the router hostnamefrr-upstream.

 [2] interface <DEV>: the interface name that is in the same subnet as the OpenShift worker nodes.

 [3] ip address <IP/PREFIX>: External host IP address and prefix, 192.168.133.1/24.

 [4] router bgp <ASN>: pick the ASN for the external router, 64521.

 [5] bgp router-id <IP>: pick the IP for the external router host, 192.168.133.1.

 [6] timers bgp 3 15: BGP hold time (15 secs) and keepalive timeout (3 secs). It can be adjusted to your needs.

 [7] neighbor metallb remote-as <ASN>: the remote (MetalLB) ASN, 64520.

 [8] neighbor <IP> peer-group metallb: each OpenShift node that runs a speaker pod should be identified as a neighbor. I also mark these peers as part of the peer-group metallb.

 [9] neighbor <IP> bfd: Enable BFD with the neighbor in question.

 [10] neighbor <IP> next-hop-self: tells FRR that the routes learned from this neighbor will have the BGP router address as the next hop.

 [11] neighbor <IP> activate: states that the IPs listed will have the IPv4 address family enabled, and will receive announcements from this router.

For more details on FRR BGP configuration, check their documentation.

The daemons file only needs to ensure the right daemons are enabled in bgpd and bfdd.

# This file tells the frr package which daemons to start.
#
# Sample configurations for these daemons can be found in
# /usr/share/doc/frr/examples/.
#
# ATTENTION:
#
# When activating a daemon for the first time, a config file, even if it is
# empty, has to be present *and* be owned by the user and group "frr", else
# the daemon will not be started by /etc/init.d/frr. The permissions should
# be u=rw,g=r,o=.
# When using "vtysh" such a config file is also needed. It should be owned by
# group "frrvty" and set to ug=rw,o= though. Check /etc/pam.d/frr, too.
#
# The watchfrr, zebra and staticd daemons are always started.
#
bgpd=yes                                          👈
ospfd=no
ospf6d=no
ripd=no
ripngd=no
isisd=no
pimd=no
ldpd=no
nhrpd=no
eigrpd=no
babeld=no
sharpd=no
pbrd=no
bfdd=yes                                          👈
fabricd=no
vrrpd=no
pathd=no
#
# If this option is set the /etc/init.d/frr script automatically loads
# the config via "vtysh -b" when the servers are started.
# Check /etc/pam.d/frr if you intend to use "vtysh"!
#
#
vtysh_enable=yes
zebra_options="  -A 127.0.0.1 -s 90000000"
bgpd_options="   -A 127.0.0.1"
ospfd_options="  -A 127.0.0.1"
ospf6d_options=" -A ::1"
ripd_options="   -A 127.0.0.1"
ripngd_options=" -A ::1"
isisd_options="  -A 127.0.0.1"
pimd_options="   -A 127.0.0.1"
ldpd_options="   -A 127.0.0.1"
nhrpd_options="  -A 127.0.0.1"
eigrpd_options=" -A 127.0.0.1"
babeld_options=" -A 127.0.0.1"
sharpd_options=" -A 127.0.0.1"
pbrd_options="   -A 127.0.0.1"
staticd_options="-A 127.0.0.1"
bfdd_options="   -A 127.0.0.1"
fabricd_options="-A 127.0.0.1"
vrrpd_options="  -A 127.0.0.1"
pathd_options="  -A 127.0.0.1"
# configuration profile
#
#frr_profile="traditional"
#frr_profile="datacenter"
#
# This is the maximum number of FD's that will be available.
# Upon startup this is read by the control files and ulimit
# is called.  Uncomment and use a reasonable value for your
# setup if you are expecting a large number of peers in
# say BGP.
MAX_FDS=1024
# The list of daemons to watch is automatically generated by the init script.
#watchfrr_options=""
# To make watchfrr create/join the specified netns, use the following option:
#watchfrr_options="--netns"
# This only has an effect in /etc/frr/<somename>/daemons, and you need to
# start FRR with "/usr/lib/frr/frrinit.sh start <somename>".
# for debugging purposes, you can specify a "wrap" command to start instead
# of starting the daemon directly, e.g. to use valgrind on ospfd:
#   ospfd_wrap="/usr/bin/valgrind"
# or you can use "all_wrap" for all daemons, e.g. to use perf record:
#   all_wrap="/usr/bin/perf record --call-graph -"
# the normal daemon command is added to this at the end.

About the author

Mauro Oddi is a Senior Cloud Success Architect with more than 10 years of experience in the Red Hat product portfolio, and has been helping customers within the EMEA region since 2017. Oddi focuses primarily on emerging technologies like Red Hat OpenShift, Red Hat OpenStack and Red Hat Ceph Storage.

Read full bio