Red Hat blog
Introduction
With the release of OpenShift 4.10, the BGP or Border Gateway Protocol mode for the MetalLB operator became generally available.
BGP mode offers a novel way to statelessly load balance client traffic towards the applications running on baremetal and bare metal-like OpenShift clusters, by using standard routing facilities, and at the same time providing high availability for the services.
The purpose of this article is to introduce MetalLB design and goals and cover in more detail this new mode. Throughout this article, I will discuss its configuration and usage, and in addition, provide mechanisms to verify that it is working properly.
Brief on MetalLB, BGP and BFD
Some concepts have to be conveyed in order to follow the subsequent sections where I will walk through the environment and the configuration procedure.
First and foremost, I need to address what MetalLB is.
MetalLB is an open source project that provides the ability to create LoadBalancer type of Kubernetes services on top of a bare-metal OpenShift/Kubernetes cluster and in turn create the same user experience that one would get on a public cloud provider as with AWS, GCP, or Azure.
A service of type LoadBalancer in Kubernetes will automatically request the cloud provider to provision a load balancer that will direct traffic to the right cluster node and port tuples, and subsequently assign an externally accessible IP to it. Then, information like the assigned IP address, will be updated in the status section of the new service resource. In contrast, in an on prem environment, Kubernetes will create a ClusterIP` service and assign a node port and wait for the administrator to create the load balancer side configuration, which is not very convenient.
To achieve a similar behavior as the one described for public cloud providers, MetalLB has to handle two main tasks: address allocation, which means managing the address pools that the new LoadBalancer services will use, and, secondly, the external announcement of these services' IP addresses, so that external entities to the cluster can learn how to reach them.
In regard to address allocation, MetalLB has to be told which IP addresses or IP address ranges it can assign when a new service is created, and this is accomplished through the AddressPool custom resource.
As to external announcement of the IP addresses, MetalLB offers two alternatives:
1. Layer 2 mode. This mode has been generally available since OpenShift 4.9 GA release and basically relies on the ARP protocol for IPv4 and NDP protocol for IPv6, to advertise which cluster node is the owner of the service IP address. Even though the method helps and it is acceptable for many use cases, it does have some drawbacks. Namely, the node owning the IP will become a bottleneck and limit performance; secondly, the time to failover in case the node is gone may be slow; and thirdly, the address allocation space has to be part of the cluster node's network. This article does not intend to focus on this mode, but you can find out more in the official documentation.
2. Layer 3 mode (BGP mode). This mode overcomes the problems described for layer 2 mode. It is based, of course, in the “two-napkin” protocol, henceforth BGP, which allows the cluster administrator to establish peering sessions between BGP speaker pods placed in a selected number of nodes from the cluster to an external BGP speaker, like the upstream router. Therefore, the incoming requests directed to the service external IP address that arrive at the router will be properly forwarded to the nodes. This method allows appropriate load balancing of the ingress traffic if the external router is distributing the incoming requests to the cluster nodes acting as peers.
In the remaining part of this article, I am going to focus on the latter, BGP mode.
When MetalLB works in BGP mode, it makes use of FRRouting (FRR), an open source project that started as a spin-off from Quagga, and provides GNU/Linux-based fully featured IP routing capabilities. It supports all sorts of routing protocols, and among them the one MetalLB needs: BGP.
BGP is the internet and industry de facto protocol for routing, which you have probably heard of. In brief, BGPv4 was defined under RFC1654 as an exterior gateway path-vector routing protocol, with the goal of conveying network reachability between autonomous systems. The autonomous system can be a flexible term, but its conception refers to a set of routers under a single technical administration, ideally with a common internal gateway protocol in use, common metrics, and others. Each AS will be identified by an ASN or autonomous system number that will be typically a unique 16-bit number (or 32 bits with RFC 4893).
Once the BGP session is established among two peers, the routers will send keep-alive messages periodically to control the peers' status. The minimum allowed time to keep a failed session (hold time) is 3 seconds, which may be an unacceptable amount of time for a service to be unavailable. Therefore, in addition to the standard BGP speaking capabilities, MetalLB (and FRR) also supports configuring Bi-directional Forwarding Detection (BFD) protocol. BFD adds further resiliency to the solution providing faster failure detection. It allows two routers to detect faults in the bidirectional path among them at subsecond intervals. BFD works at layer 3, and basically offers a lightweight liveness detection protocol (also known as "hello" protocol) that is independent of the routing protocol itself. It will only limit itself to notify the routing protocol about the failure and not to take any corrective action.
Prerequisites
For MetalLB to work in BGP mode, the following prerequisites are needed:
- First and foremost a bare metal or bare metal-like cluster in place. It can be deployed either with IPI or UPI. Also other platforms that do not provide a native load balancer can benefit from MetalLB.
- A BGP speaking external router to act as a peer, which will be described in the Environment section.
- The external network infrastructure should be able to route client traffic directed to the LoadBalancer services’ addresses through the external router that is set as a BGP peer.
- The addresses that will be assigned to the AddressPool custom resources for BGP mode should be reserved and should not be part of the OpenShift nodes’ network ( network.machineNetwork[].cidr from install-config.yaml).
- There should be open communication between the external router and the cluster nodes on port 179/TCP (BGP), 3784/UDP and 3785/UDP (BFD).
Environment
The network where the OpenShift cluster runs on will have an adjacent router that is able to speak BGP and BFD. Moreover, the router will be able to do ECMP (Equal Cost Multi-Path) routing in order to distribute the incoming requests uniformly across multiple OpenShift worker nodes. Thus, if a LoadBalancer service is advertised from multiple worker nodes at the same time, each of these nodes should get a fairly even amount of requests from the router.
To emulate that router, I am going to use an isolated instance of FRR running containerized within an external system.
In this example, the external router lives in the same network segment as the OpenShift nodes (single hop network topology). Multi-hop network topologies are also possible by passing an extra flag spec.ebgpMultiHop to the BGPPeer custom resource, which will be introduced in a later section.
Figure 1: Network Architecture
Also for the purpose of testing MetalLB in BGP mode, I have already installed a cluster of Red Hat OpenShift Container Platform 4.10.3 using bare-metal UPI deployment method and also using Openshift SDN as CNI plug-in with default configuration.
MetalLB is supported in both OpenShift SDN and OVN-Kubernetes CNI plug-ins.
IMPORTANT: Red Hat recommends OVN-Kubernetes CNI for Telecom/CNF use cases as the CNI plugin. This is transparent from metalLB user configuration perspective, and this blog can safely be reused for Telecom/CNF use cases as long as OVN-Kubernetes CNI is used.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.3 True False 23h Cluster version is 4.10.3
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ice-ocp4-master-0.lab.local Ready master 15h v1.23.3+e419edf
ice-ocp4-master-1.lab.local Ready master 15h v1.23.3+e419edf
ice-ocp4-master-2.lab.local Ready master 14h v1.23.3+e419edf
ice-ocp4-worker-0.lab.local Ready worker 28m v1.23.3+e419edf
ice-ocp4-worker-1.lab.local Ready worker 28m v1.23.3+e419edf
$ oc get network cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Network
metadata:
creationTimestamp: "2022-03-16T18:51:34Z"
generation: 2
name: cluster
resourceVersion: "5174"
uid: b4f13c36-bef8-44fb-9cec-780b6c565eda
spec:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
externalIP:
policy: {}
networkType: OpenShiftSDN
serviceNetwork:
- 172.30.0.0/16
status:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
clusterNetworkMTU: 1450
networkType: OpenShiftSDN
serviceNetwork:
- 172.30.0.0/16
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.10.3 True False False 48m
baremetal 4.10.3 True False False 15h
cloud-controller-manager 4.10.3 True False False 16h
cloud-credential 4.10.3 True False False 16h
cluster-autoscaler 4.10.3 True False False 15h
config-operator 4.10.3 True False False 15h
console 4.10.3 True False False 40m
csi-snapshot-controller 4.10.3 True False False 15h
dns 4.10.3 True False False 15h
etcd 4.10.3 True False False 14h
image-registry 4.10.3 True False False 14h
ingress 4.10.3 True False False 53m
insights 4.10.3 True False False 15h
kube-apiserver 4.10.3 True False False 14h
kube-controller-manager 4.10.3 True False False 15h
kube-scheduler 4.10.3 True False False 15h
kube-storage-version-migrator 4.10.3 True False False 15h
machine-api 4.10.3 True False False 15h
machine-approver 4.10.3 True False False 15h
machine-config 4.10.3 True False False 14h
marketplace 4.10.3 True False False 15h
monitoring 4.10.3 True False False 20m
network 4.10.3 True False False 15h
node-tuning 4.10.3 True False False 14h
openshift-apiserver 4.10.3 True False False 14h
openshift-controller-manager 4.10.3 True False False 15h
openshift-samples 4.10.3 True False False 14h
operator-lifecycle-manager 4.10.3 True False False 15h
operator-lifecycle-manager-ca… 4.10.3 True False False 15h
operator-lifecycle-manager-pa… 4.10.3 True False False 14h
service-ca 4.10.3 True False False 15h
storage 4.10.3 True False False 15h 15h
Since this environment is using OpenShift SDN, and MetalLB respects the externalTrafficPolicy (which is by default cluster), the traffic for a new service using this policy should be distributed evenly to every node. Then, within every node, kube-proxy will take care of distributing the traffic to the available endpoints for that service. This can be simply verified by inspecting the iptables rules created by kube-proxy on every node associated with the LoadBalancer services created.
It is important to understand that in order to distribute the traffic, the ECMP implementation of the external router will in fact use flow-based hashing. Therefore, all traffic associated with a particular flow will use the same next hop and a consistent path across the network. In other words, ingress traffic from distinct sources will be properly distributed, although an outstanding amount of ingress traffic from a given IP address will end up in the same worker node, which potentially means unbalanced traffic.
Procedure
The procedure to test MetalLB consists of three parts: deploying the external router that will act as BGP peer, deploying and configuring the MetalLB operator itself, and finally verifying and testing the resulting scenario.
External Router
To emulate the external router, I will also use FRR deployed in a container on an external system in the same network as the OpenShift worker nodes.
The container will use a configuration directory (in this example $HOME/frr) holding three files:
- frr.conf - Main FRRouting config file
- daemons - Sets which daemons will be enabled
- vtysh.conf - Configuration file for vtysh (CLI interface to FRR). It has to exist for the container to start but can be empty.
Mainly, we want to set the router identity, hostname, own ASN, address, and so forth and then set the parameters needed for establishing a BGP session with the speaker pods that will run within the OpenShift cluster, grouping them, passing their IP addresses, setting the ASN for them, enabling BFD, setting some BGP timers, and others.
In the Appendix 1 section, you can learn in detail the content of the files and find the explanation of the settings that need to be adjusted in them.
Now the external router container can be started.
$ tree /home/frr/
/home/frr/
├── daemons
├── frr.conf
└── vtysh.conf
$ sudo podman run -d --rm -v /home/maur0x/frr:/etc/frr:Z --net=host --name frr-upstream --privileged quay.io/frrouting/frr
Bdf2b6a9ebe087acc7df8b57f86dbf7e3f253f5df66ced360a18d2d40574b8f1
One final comment about the external router setup is that the environment should allow BGP (179/TCP) and BFD (3784/UDP and 3785/UDP) communications to go through in case there is a firewall between the OpenShift nodes and the router, especially applicable in multi-hop architectures.
The external router is ready.
MetalLB
The following steps will install the MetalLB operator and MetalLB and configure them. We will be reserving some addresses for the LoadBalancer services, and then finally pair MetalLB appropriately with the external router created in the previous section.
STEP 1. Install the MetalLB operator
$ oc get packagemanifest | grep metal
metallb-operator Red Hat Operators 10m
$ cat << _EOF_ | oc apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
name: metallb-system
spec: {}
_EOF_
namespace/metallb-system created
$ cat << _EOF_ | oc apply -f -
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: metallb-operator
namespace: metallb-system
spec:
targetNamespaces:
- metallb-system
_EOF_
operatorgroup.operators.coreos.com/metallb-operator created
$ cat << _EOF_ | oc apply -f -
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: metallb-operator-sub
namespace: metallb-system
spec:
name: metallb-operator
channel: "stable"
source: redhat-operators
sourceNamespace: openshift-marketplace
_EOF_
subscription.operators.coreos.com/metallb-operator-sub created
STEP 2. Check the install plan is approved and the installation progress
$ oc get installplan -n metallb-system
NAME CSV APPROVAL APPROVED
install-brb9w metallb-operator.4.10.0-202203081809 Automatic true
$ oc get csv -n metallb-system -o custom-columns='NAME:.metadata.name, VERSION:.spec.version, PHASE:.status.phase'
NAME VERSION PHASE
metallb-operator.4.10.0-202203111548 4.10.0-202203111548 Succeeded
STEP 3. Create a MetalLB custom resource and check the controller deployment
Once the operator is fully installed, we can proceed to deploy the MetalLB instance.
The MetalLB custom resource represents MetalLB itself. There can only be one resource of this kind in the cluster. When created, MetalLB is deployed, and upon creation, it will instantiate a controller Deployment and a speaker DaemonSet.
If the custom resource is deleted, MetalLB is removed from the cluster. The MetalLB operator needs to be explicitly uninstalled afterwards.
In this step, it is possible to select which nodes will be running the speaker pods via spec.nodeSelector, and if needed also adding tolerations to the speaker DaemonSet via spec.speakerTolerations. It will typically target worker nodes or infra nodes. More details for the MetalLB custom resource in the official documentation.
$ cat << _EOF_ | oc apply -f -
---
apiVersion: metallb.io/v1beta1
kind: MetalLB
metadata:
name: metallb
namespace: metallb-system
spec:
nodeSelector:
node-role.kubernetes.io/worker: ""
_EOF_
metallb.metallb.io/metallb created
Check that the controller` deployment is up. This pod is in charge of assigning IP addresses from the pools that will be created next to the LoadBalancer services that will be requested.
$ oc get deployment controller -n metallb-system
NAME READY UP-TO-DATE AVAILABLE AGE
controller 1/1 1 1 91s
$ oc get deployment -n metallb-system controller -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2022-03-17T12:42:31Z"
generation: 1
labels:
app: metallb
component: controller
name: controller
namespace: metallb-system
ownerReferences:
- apiVersion: metallb.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: MetalLB
name: metallb
uid: 10fe2c8f-7122-43ee-a735-b3062caf97e6
resourceVersion: "464444"
uid: 3bcc9c38-0489-4005-9fcc-ad6341461e5f
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 3
[…]
$ oc get pods -n metallb-system
NAME READY STATUS RESTARTS AGE
controller-5bcbccf6d4-lhp95 2/2 Running 0 71s
metallb-operator-controller-manager-654df86cc5-szk96 1/1 Running 0 4m
speaker-bt8z2 6/6 Running 0 71s
speaker-czhc5 6/6 Running 0 71s
$ oc get ds -n metallb-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
speaker 2 2 2 2 2 node-role.kubernetes.io/worker= 110s
STEP 4. Create an Address Pool
As mentioned before, the AddressPool custom resource will tell MetalLB which external IP addresses are valid to be assigned to a LoadBalancer service.
These addresses can be specified as a subnet range, or individual addresses like the example below.
To avoid collisions, these IP addresses should be available and reserved for this use only. It is also important that the addresses in the pool do not collide with the OpenShift nodes’ network (network.machineNetwork[].cidr).
The protocol has to be set to bgp. The other protocol option is layer2 mode (as explained in a previous section) which relies on the ARP protocol for IPv4 and NDP protocol for IPv6 to advertise the addresses.
$ cat << _EOF_ | oc apply -f -
---
apiVersion: metallb.io/v1beta1
kind: AddressPool
metadata:
name: address-pool-bgp
namespace: metallb-system
spec:
addresses:
- 192.168.155.150/32
- 192.168.155.151/32
- 192.168.155.152/32
- 192.168.155.153/32
- 192.168.155.154/32
- 192.168.155.155/32
autoAssign: true
protocol: bgp
_EOF_
addresspool.metallb.io/address-pool-bgp created
More examples can be found in the official documentation.
STEP 5. Create a BFD profile
The BFDProfile custom resource holds the configuration for the BFD protocol, where the administrator can tune, for instance, intervals, echo and passive modes, and so forth. The profile in the example below sets some basic values in order to pair with another FRR instance. The meaning of each of these parameters can be found in the official documentation and also in FRR documentation for BFD.
$ cat << _EOF_ | oc apply -f -
---
apiVersion: metallb.io/v1beta1
kind: BFDProfile
metadata:
name: test-bfd-prof
namespace: metallb-system
spec:
detectMultiplier: 37
echoMode: true
minimumTtl: 10
passiveMode: true
receiveInterval: 35
transmitInterval: 35
_EOF_
bfdprofile.metallb.io/test-bfd-prof created
STEP 6. Create a BGPPeer resource
The last step is to create the BGPPeer resource, which will configure the speaker pods and will pass to the FRR container within the pod the right ASN number for itself and for the remote peer and the remote peer’s IP address.
This custom resource accepts some configuration that is worth bringing up:
- If the eBGP peer is multiple hops away, spec.ebgpMultiHop has to be set to true.
- The BFD profile to use via spec.bfdProfile.
- Which subset of nodes running speaker pods should establish a session with this particular BGP peer via spec.nodeSelector.
- Setting holdtime via spec.holdTime, and keepalive via spec.keepaliveTime.
- More parameters are available in the official documentation.
$ cat << _EOF_ | oc apply -f -
---
apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
name: peer-test
namespace: metallb-system
spec:
bfdProfile: test-bfd-prof
myASN: 64520
peerASN: 64521
peerAddress: 192.168.133.1
_EOF_
bgppeer.metallb.io/peer-test created
At this point, the DaemonSet for the speaker pods should have the new peer configuration set and be establishing the sessions with the external router. We will review this configuration within the speaker pods in the following section.
Verification
BGP and BFD session status
Now that the environment is up and running, let's verify it is behaving as expected.
First, I will check that there are valid BGP sessions established within the speaker pods, taking as example: speaker-6jsfc.
The BGP state should be Established, and BFD status should be Up.
$ oc -n metallb-system exec -it speaker-6jsfc -c frr -- vtysh -c "show ip bgp neighbor"
BGP neighbor is 192.168.133.1, remote AS 64521, local AS 64520, external link
Hostname: ice-lab-01.lab.local
BGP version 4, remote router ID 192.168.133.1, local router ID 192.168.133.71
BGP state = Established, up for 04:20:09
Last read 00:00:00, Last write 00:00:03
Hold time is 15, keepalive interval is 5 seconds
Configured hold time is 90, keepalive interval is 30 seconds
Neighbor capabilities:
4 Byte AS: advertised and received
AddPath:
IPv4 Unicast: RX advertised IPv4 Unicast and received
IPv6 Unicast: RX advertised IPv6 Unicast
Route refresh: advertised and received(old & new)
Address Family IPv4 Unicast: advertised and received
Address Family IPv6 Unicast: advertised
Hostname Capability: advertised (name: ice-ocp4-worker-0.lab.local,domain name: n/a) received (name: ice-lab-01.lab.local,domain name: n/a)
Graceful Restart Capability: advertised and received
Remote Restart timer is 120 seconds
Address families by peer:
none
Graceful restart information:
End-of-RIB send: IPv4 Unicast
End-of-RIB received: IPv4 Unicast
Local GR Mode: Helper*
Remote GR Mode: Helper
R bit: False
Timers:
Configured Restart Time(sec): 120
Received Restart Time(sec): 120
IPv4 Unicast:
F bit: False
End-of-RIB sent: Yes
End-of-RIB sent after update: Yes
End-of-RIB received: Yes
Timers:
Configured Stale Path Time(sec): 360
IPv6 Unicast:
F bit: False
End-of-RIB sent: No
End-of-RIB sent after update: No
End-of-RIB received: No
Timers:
Configured Stale Path Time(sec): 360
Message statistics:
Inq depth is 0
Outq depth is 0
Sent Rcvd
Opens: 1 1
Notifications: 0 0
Updates: 1 1
Keepalives: 3122 5204
Route Refresh: 2 0
Capability: 0 0
Total: 3126 5206
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
Update group 3, subgroup 3
Packet Queue length 0
Community attribute sent to this neighbor(all)
Inbound path policy configured
Route map for incoming advertisements is *192.168.133.1-in
0 accepted prefixes
For address family: IPv6 Unicast
Not part of any update group
Community attribute sent to this neighbor(all)
Inbound path policy configured
Route map for incoming advertisements is *192.168.133.1-in
0 accepted prefixes
Connections established 1; dropped 0
Last reset 04:20:10, Waiting for peer OPEN
Local host: 192.168.133.71, Local port: 37226
Foreign host: 192.168.133.1, Foreign port: 179
Nexthop: 192.168.133.71
Nexthop global: ::
Nexthop local: ::
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on Write thread: on FD used: 22
BFD: Type: single hop
Detect Multiplier: 3, Min Rx interval: 300, Min Tx interval: 300
Status: Up, Last update: 0:04:19:58
We can also check the FRR instance running within the speaker pod and its configuration.
$ oc -n metallb-system exec -it speaker-bt8z2 -c frr -- vtysh -c "show running"
Building configuration...
Current configuration:
!
frr version 7.5
frr defaults traditional
hostname ice-ocp4-worker-0.lab.local
log file /etc/frr/frr.log informational
log timestamp precision 3
service integrated-vtysh-config
!
router bgp 64520
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
no bgp network import-check
neighbor 192.168.133.1 remote-as 64521
neighbor 192.168.133.1 bfd profile test-bfd-prof
neighbor 192.168.133.1 timers 30 90
!
address-family ipv4 unicast
neighbor 192.168.133.1 activate
neighbor 192.168.133.1 route-map 192.168.133.1-in in
exit-address-family
!
address-family ipv6 unicast
neighbor 192.168.133.1 activate
neighbor 192.168.133.1 route-map 192.168.133.1-in in
exit-address-family
!
route-map 192.168.133.1-in deny 20
!
route-map 192.168.133.1-out permit 1
!
ip nht resolve-via-default
!
ipv6 nht resolve-via-default
!
line vty
!
bfd
profile test-bfd-prof
detect-multiplier 37
transmit-interval 35
receive-interval 35
passive-mode
echo-mode
minimum-ttl 10
!
!
end
Similarly, checking the external router FRR pod should display an equivalent output against each peer:
$ sudo podman exec -it frr-upstream vtysh -c "show bgp summary"
IPv4 Unicast Summary (VRF default):
BGP router identifier 192.168.133.1, local AS number 64521 vrf-id 0
BGP table version 2
RIB entries 1, using 184 bytes of memory
Peers 2, using 1433 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
192.168.133.71 4 64520 9 12 0 0 0 00:00:27 1 1 N/A
192.168.133.72 4 64520 15 23 0 0 0 00:00:58 1 1 N/A
Total number of neighbors 2
$ sudo podman exec -it frr-upstream vtysh -c "show ip bgp neighbor"
BGP neighbor is 192.168.133.71, remote AS 64520, local AS 64521, external link
Hostname: ice-ocp4-worker-0.lab.local
Member of peer-group metallb for session parameters
BGP version 4, remote router ID 192.168.133.71, local router ID 192.168.133.1
BGP state = Established, up for 04:35:09
Last read 00:00:03, Last write 00:00:03
Hold time is 15, keepalive interval is 3 seconds
Configured hold time is 15, keepalive interval is 3 seconds
Neighbor capabilities:
4 Byte AS: advertised and received
Extended Message: advertised
AddPath:
IPv4 Unicast: RX advertised and received
Long-lived Graceful Restart: advertised
Route refresh: advertised and received(old & new)
Enhanced Route Refresh: advertised
Address Family IPv4 Unicast: advertised and received
Address Family IPv6 Unicast: received
Hostname Capability: advertised (name: ice-lab-01.lab.local,domain name: n/a) received (name: ice-ocp4-worker-0.lab.local,domain name: n/a)
Graceful Restart Capability: advertised and received
Remote Restart timer is 120 seconds
Address families by peer:
none
Graceful restart information:
End-of-RIB send: IPv4 Unicast
End-of-RIB received: IPv4 Unicast
Local GR Mode: Helper*
Remote GR Mode: Helper
R bit: True
Timers:
Configured Restart Time(sec): 120
Received Restart Time(sec): 120
IPv4 Unicast:
F bit: False
End-of-RIB sent: Yes
End-of-RIB sent after update: Yes
End-of-RIB received: Yes
Timers:
Configured Stale Path Time(sec): 360
Message statistics:
Inq depth is 0
Outq depth is 0
Sent Rcvd
Opens: 2 2
Notifications: 0 2
Updates: 2 2
Keepalives: 5550 3330
Route Refresh: 0 2
Capability: 0 0
Total: 5554 3338
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
metallb peer-group member
Update group 2, subgroup 2
Packet Queue length 0
NEXT_HOP is always this router
Community attribute sent to this neighbor(all)
0 accepted prefixes
Connections established 2; dropped 1
Last reset 04:35:35, No AFI/SAFI activated for peer
Local host: 192.168.133.1, Local port: 179
Foreign host: 192.168.133.71, Foreign port: 37226
Nexthop: 192.168.133.1
Nexthop global: ::
Nexthop local: ::
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on Write thread: on FD used: 26
BFD: Type: single hop
Detect Multiplier: 3, Min Rx interval: 300, Min Tx interval: 300
Status: Up, Last update: 0:04:34:58
BGP neighbor is 192.168.133.72, remote AS 64520, local AS 64521, external link
Hostname: ice-ocp4-worker-1.lab.local
Member of peer-group metallb for session parameters
BGP version 4, remote router ID 192.168.133.72, local router ID 192.168.133.1
BGP state = Established, up for 04:35:09
Last read 00:00:03, Last write 00:00:03
Hold time is 15, keepalive interval is 3 seconds
Configured hold time is 15, keepalive interval is 3 seconds
Neighbor capabilities:
4 Byte AS: advertised and received
Extended Message: advertised
AddPath:
IPv4 Unicast: RX advertised and received
Long-lived Graceful Restart: advertised
Route refresh: advertised and received(old & new)
Enhanced Route Refresh: advertised
Address Family IPv4 Unicast: advertised and received
Address Family IPv6 Unicast: received
Hostname Capability: advertised (name: ice-lab-01.lab.local,domain name: n/a) received (name: ice-ocp4-worker-1.lab.local,domain name: n/a)
Graceful Restart Capability: advertised and received
Remote Restart timer is 120 seconds
Address families by peer:
none
Graceful restart information:
End-of-RIB send: IPv4 Unicast
End-of-RIB received: IPv4 Unicast
Local GR Mode: Helper*
Remote GR Mode: Helper
R bit: True
Timers:
Configured Restart Time(sec): 120
Received Restart Time(sec): 120
IPv4 Unicast:
F bit: False
End-of-RIB sent: Yes
End-of-RIB sent after update: Yes
End-of-RIB received: Yes
Timers:
Configured Stale Path Time(sec): 360
Message statistics:
Inq depth is 0
Outq depth is 0
Sent Rcvd
Opens: 2 2
Notifications: 0 2
Updates: 2 2
Keepalives: 5550 3330
Route Refresh: 0 2
Capability: 0 0
Total: 5554 3338
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
metallb peer-group member
Update group 2, subgroup 2
Packet Queue length 0
NEXT_HOP is always this router
Community attribute sent to this neighbor(all)
0 accepted prefixes
Connections established 2; dropped 1
Last reset 04:35:35, No AFI/SAFI activated for peer
Local host: 192.168.133.1, Local port: 179
Foreign host: 192.168.133.72, Foreign port: 50294
Nexthop: 192.168.133.1
Nexthop global: ::
Nexthop local: ::
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on Write thread: on FD used: 27
BFD: Type: single hop
Detect Multiplier: 3, Min Rx interval: 300, Min Tx interval: 300
Status: Up, Last update: 0:04:34:58
The BFD sessions can also be inspected in more detail with the following command.
$ sudo podman exec -it frr-upstream vtysh -c "show bfd peers"
BFD Peers:
peer 192.168.133.71 local-address 192.168.133.1 vrf default interface virbr2
ID: 2094083731
Remote ID: 1259657737
Active mode
Status: up
Uptime: 2 day(s), 2 hour(s), 58 minute(s), 5 second(s)
Diagnostics: ok
Remote diagnostics: ok
Peer Type: dynamic
Local timers:
Detect-multiplier: 3
Receive interval: 300ms
Transmission interval: 300ms
Echo receive interval: 50ms
Echo transmission interval: disabled
Remote timers:
Detect-multiplier: 37
Receive interval: 35ms
Transmission interval: 35ms
Echo receive interval: 50ms
peer 192.168.133.72 local-address 192.168.133.1 vrf default interface virbr2
ID: 1112508781
Remote ID: 2587821932
Active mode
Status: up
Uptime: 1 day(s), 5 hour(s), 38 minute(s), 0 second(s)
Diagnostics: ok
Remote diagnostics: ok
Peer Type: dynamic
Local timers:
Detect-multiplier: 3
Receive interval: 300ms
Transmission interval: 300ms
Echo receive interval: 50ms
Echo transmission interval: disabled
Remote timers:
Detect-multiplier: 37
Receive interval: 35ms
Transmission interval: 35ms
Echo receive interval: 50ms
Creating a New Service to Test MetalLB
To understand how a new service of type LoadBalancer will behave, let's create a simple example service using a hello-node deployment and verify that MetalLB is working as expected.
$ oc new-project test-metallb
Now using project "test-metallb" on server "https://api.t1.lab.local:6443".
[…]
$ oc create deployment hello-node --image=k8s.gcr.io/e2e-test-images/agnhost:2.33 -- /agnhost serve-hostname
deployment.apps/hello-node created
$ cat << __EOF__ | oc apply -f -
---
apiVersion: v1
kind: Service
metadata:
name: test-frr
spec:
selector:
app: hello-node
ports:
- port: 80
protocol: TCP
targetPort: 9376
type: LoadBalancer
__EOF__
service/test-frr created
The brand new LoadBalancer service is healthy. It is getting the first external IP from the defined address pool, it has the right endpoint, and it is being announced via BGP from both worker nodes at the same time.
$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-frr LoadBalancer 172.30.169.126 192.168.155.150 80:30194/TCP 33s
$ oc describe svc test-frr
Name: test-frr
Namespace: test-metallb
Labels: app=hello-node
Annotations: <none>
Selector: app=hello-node
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 172.30.169.126
IPs: 172.30.169.126
LoadBalancer Ingress: 192.168.155.150
Port: <unset> 80/TCP
TargetPort: 9376/TCP
NodePort: <unset> 30194/TCP
Endpoints: 10.131.1.165:9376
Session Affinity: None
External Traffic Policy: Cluster
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal nodeAssigned 60s metallb-speaker announcing from node "ice-ocp4-worker-0.lab.local"
Normal IPAllocated 57s metallb-controller Assigned IP ["192.168.155.150"]
Normal nodeAssigned 56s metallb-speaker announcing from node "ice-ocp4-worker-1.lab.local"
From the external router point of view, the route to the new service is learned properly via BGP from the two speaker pods running in the worker nodes, and the service is reachable on its external IP from other nodes in the network.
$ sudo podman exec -it frr-upstream vtysh -c "show ip route"
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/425] via 192.168.3.254, br0, 00:01:55
C>* 192.168.3.250/32 is directly connected, br0, 00:01:55
K>* 192.168.3.254/32 [0/20425] is directly connected, br0, 00:01:55
C>* 192.168.4.0/24 is directly connected, vlan1001, 00:01:55
C>* 192.168.122.0/24 is directly connected, virbr0, 00:01:55
C>* 192.168.133.0/24 is directly connected, virbr2, 00:01:55
B>* 192.168.155.150/32 [20/0] via 192.168.133.71, virbr2, weight 1, 00:01:14
* via 192.168.133.72, virbr2, weight 1, 00:01:14
$ sudo podman exec -it frr-upstream ip r
default via 192.168.3.254 dev br0 proto static metric 425
192.168.0.0/24 dev virbr1 proto kernel scope link src 192.168.0.254 linkdown
192.168.3.254 dev br0 proto static scope link metric 20425
192.168.4.0/24 dev vlan1001 proto kernel scope link src 192.168.4.2 metric 400
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1
192.168.133.0/24 dev virbr2 proto kernel scope link src 192.168.133.1
192.168.155.150 nhid 575 proto bgp metric 20
nexthop via 192.168.133.71 dev virbr2 weight 1
nexthop via 192.168.133.72 dev virbr2 weight 1
$ curl -l 192.168.155.150
Hello-node-78bd88f59b-btbpc
When a LoadBalancer service is created, some interesting options can be used to enhance the external IP placement, such as indicating which IP address we want for the service, requesting any IP address but from a specific pool, or even sharing an external IP for more than one service.
For example, the service will be re-created, but this time requesting a specific IP from the pool by indicating spec.loadBalancerIP.
$ oc delete svc/test-frr
service "test-frr" deleted
$ cat << __EOF__ | oc apply -f -
---
apiVersion: v1
kind: Service
metadata:
name: test-frr
annotations:
metallb.universe.tf/address-pool: address-pool-bgp
spec:
selector:
app: hello-node
ports:
- port: 80
protocol: TCP
targetPort: 9376
type: LoadBalancer
loadBalancerIP: 192.168.155.151
__EOF__
service/test-frr created
Yet again, the service is being announced properly and reachable externally in its new IP address: 192.168.155.151.
$ oc describe svc/test-frr
Name: test-frr
Namespace: test-metallb
Labels: <none>
Annotations: metallb.universe.tf/address-pool: address-pool-bgp
Selector: app=hello-node
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 172.30.140.200
IPs: 172.30.140.200
IP: 192.168.155.151
LoadBalancer Ingress: 192.168.155.151
Port: <unset> 80/TCP
TargetPort: 9376/TCP
NodePort: <unset> 31605/TCP
Endpoints: 10.131.0.5:9376
Session Affinity: None
External Traffic Policy: Cluster
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal nodeAssigned 35s metallb-speaker announcing from node "ice-ocp4-worker-0.lab.local"
Normal IPAllocated 35s metallb-controller Assigned IP ["192.168.155.151"]
Normal nodeAssigned 34s metallb-speaker announcing from node "ice-ocp4-worker-1.lab.local"
$ curl -l 192.168.155.151
hello-node-78bd88f59b-btbpc
For troubleshooting details, check out the official documentation.
Conclusions
MetalLB is getting more and more mature as a project, and using it in BGP mode offers a novel way to statelessly load balance traffic into OpenShift cluster services using standard routing facilities, instead of a regular load balancer network device.
As we observed throughout this article, MetalLB also aids to achieve a substantially similar experience to the one we would get in a public cloud provider, but on an on-prem platform provider. Moreover, the even distribution of the traffic can also help to achieve higher availability, resiliency, and performance.
Both Red Hat OpenShift Container Platform documentation and MetalLB upstream documentation point out some limitations when MetalLB is working in BGP mode. They should be kept in mind when architecting or implementing the solution. The main limitation is how BGP handles a peer going down, where the active connections associated with the node will be re-distributed to other nodes potentially breaking stateful connections. Faster detection via BFD can help to speed up the transition.
Moving forward, other interesting configurations could be attempted, like using multi-hop network topology, IPv6 stack, and so on.
Appendix 1: FRR Configuration Files
Annotated FRRouting configuration files needed to start the external router pod.
The important parameters to adjust in the frr.conf configuration file are indicated below.
frr version 8.0.1_git
frr defaults traditional
hostname frr-upstream 👈[1]
!
debug bgp updates
debug bgp neighbor
debug zebra nht
debug bgp nht
debug bfd peer
log file /tmp/frr.log debugging
log timestamp precision 3
!
interface virbr2 👈[2]
ip address 192.168.133.1/24 👈[3]
!
router bgp 64521 👈[4]
bgp router-id 192.168.133.1 👈[5]
timers bgp 3 15 👈[6]
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
no bgp network import-check
neighbor metallb peer-group
neighbor metallb remote-as 64520 👈[7]
neighbor 192.168.133.71 peer-group metallb 👈[8]
neighbor 192.168.133.71 bfd 👈[9]
neighbor 192.168.133.72 peer-group metallb
neighbor 192.168.133.72 bfd
!
address-family ipv4 unicast
neighbor 192.168.133.71 next-hop-self 👈[10]
neighbor 192.168.133.71 activate 👈[11]
neighbor 192.168.133.72 next-hop-self
neighbor 192.168.133.72 activate
exit-address-family
!
line vty
[1] hostname <NAME>: the router hostnamefrr-upstream.
[2] interface <DEV>: the interface name that is in the same subnet as the OpenShift worker nodes.
[3] ip address <IP/PREFIX>: External host IP address and prefix, 192.168.133.1/24.
[4] router bgp <ASN>: pick the ASN for the external router, 64521.
[5] bgp router-id <IP>: pick the IP for the external router host, 192.168.133.1.
[6] timers bgp 3 15: BGP hold time (15 secs) and keepalive timeout (3 secs). It can be adjusted to your needs.
[7] neighbor metallb remote-as <ASN>: the remote (MetalLB) ASN, 64520.
[8] neighbor <IP> peer-group metallb: each OpenShift node that runs a speaker pod should be identified as a neighbor. I also mark these peers as part of the peer-group metallb.
[9] neighbor <IP> bfd: Enable BFD with the neighbor in question.
[10] neighbor <IP> next-hop-self: tells FRR that the routes learned from this neighbor will have the BGP router address as the next hop.
[11] neighbor <IP> activate: states that the IPs listed will have the IPv4 address family enabled, and will receive announcements from this router.
For more details on FRR BGP configuration, check their documentation.
The daemons file only needs to ensure the right daemons are enabled in bgpd and bfdd.
# This file tells the frr package which daemons to start.
#
# Sample configurations for these daemons can be found in
# /usr/share/doc/frr/examples/.
#
# ATTENTION:
#
# When activating a daemon for the first time, a config file, even if it is
# empty, has to be present *and* be owned by the user and group "frr", else
# the daemon will not be started by /etc/init.d/frr. The permissions should
# be u=rw,g=r,o=.
# When using "vtysh" such a config file is also needed. It should be owned by
# group "frrvty" and set to ug=rw,o= though. Check /etc/pam.d/frr, too.
#
# The watchfrr, zebra and staticd daemons are always started.
#
bgpd=yes 👈
ospfd=no
ospf6d=no
ripd=no
ripngd=no
isisd=no
pimd=no
ldpd=no
nhrpd=no
eigrpd=no
babeld=no
sharpd=no
pbrd=no
bfdd=yes 👈
fabricd=no
vrrpd=no
pathd=no
#
# If this option is set the /etc/init.d/frr script automatically loads
# the config via "vtysh -b" when the servers are started.
# Check /etc/pam.d/frr if you intend to use "vtysh"!
#
#
vtysh_enable=yes
zebra_options=" -A 127.0.0.1 -s 90000000"
bgpd_options=" -A 127.0.0.1"
ospfd_options=" -A 127.0.0.1"
ospf6d_options=" -A ::1"
ripd_options=" -A 127.0.0.1"
ripngd_options=" -A ::1"
isisd_options=" -A 127.0.0.1"
pimd_options=" -A 127.0.0.1"
ldpd_options=" -A 127.0.0.1"
nhrpd_options=" -A 127.0.0.1"
eigrpd_options=" -A 127.0.0.1"
babeld_options=" -A 127.0.0.1"
sharpd_options=" -A 127.0.0.1"
pbrd_options=" -A 127.0.0.1"
staticd_options="-A 127.0.0.1"
bfdd_options=" -A 127.0.0.1"
fabricd_options="-A 127.0.0.1"
vrrpd_options=" -A 127.0.0.1"
pathd_options=" -A 127.0.0.1"
# configuration profile
#
#frr_profile="traditional"
#frr_profile="datacenter"
#
# This is the maximum number of FD's that will be available.
# Upon startup this is read by the control files and ulimit
# is called. Uncomment and use a reasonable value for your
# setup if you are expecting a large number of peers in
# say BGP.
MAX_FDS=1024
# The list of daemons to watch is automatically generated by the init script.
#watchfrr_options=""
# To make watchfrr create/join the specified netns, use the following option:
#watchfrr_options="--netns"
# This only has an effect in /etc/frr/<somename>/daemons, and you need to
# start FRR with "/usr/lib/frr/frrinit.sh start <somename>".
# for debugging purposes, you can specify a "wrap" command to start instead
# of starting the daemon directly, e.g. to use valgrind on ospfd:
# ospfd_wrap="/usr/bin/valgrind"
# or you can use "all_wrap" for all daemons, e.g. to use perf record:
# all_wrap="/usr/bin/perf record --call-graph -"
# the normal daemon command is added to this at the end.
About the author
Mauro Oddi is a Senior Cloud Success Architect with more than 10 years of experience in the Red Hat product portfolio, and has been helping customers within the EMEA region since 2017. Oddi focuses primarily on emerging technologies like Red Hat OpenShift, Red Hat OpenStack and Red Hat Ceph Storage.