Resilience on Red Hat OpenShift Service on AWS (ROSA)
Ryan Niksch (AWS) and Charlotte Fung (Red Hat) discuss how Red Hat OpenShift and Amazon Web Services work together to provide customers with the resilience they need for their business goals, including architecture resilience and autoscaling.
To view this video within our in-depth learning path, please visit the Getting started with Red Hat OpenShift Service on AWS (ROSA) page.
Ryan Niksch (00:00):
Greetings. Welcome. My name is Ryan Niksch. I am a Principal Solutions Architect with Amazon Web Services. Joining me here today is Charlotte, who is one of the Red Hat Managed OpenShift Black Belt team members. Charlotte, say hello.
Charlotte Fung (00:16):
Hi everyone. My name is Charlotte Fung, and I'm a Managed OpenShift Black Belt, as Ryan said.
Overview
Ryan Niksch (00:22):
Right. So what we have here, Charlotte, is an architecture diagram of the Red Hat OpenShift Service on AWS, or ROSA. And this is a very generic architecture for a public-facing cluster. So we've got these entry points where developers, customer administrators, or Red Hat’s SREs are coming in over the internet through a collection of AWS load balancers, and we can see that here on the top of the diagram. I want to quickly talk to you about what are some of the resilience factors of OpenShift in itself and how those complement resilience on AWS. One of the things that you've got in this architecture is, OpenShift has a control plane made up of master nodes, and there are three of those. Why three? Why is it always this magical number of three, and what happens if one of those control plane nodes has to fail, whether there is something wrong from a software or hardware perspective?
Control plane
Charlotte Fung (01:32):
Thank you, Ryan, for that question. As you said, the default deployment always comes with three control planes because that's the brain. This is what controls your cluster and you want to make sure that is always available because, if one fails, then you have to carry over the job of the one that failed.
Ryan Niksch (01:56):
So it's a continuity element. Is there also a quorum decision-making process that's being aided by there being three? So if one fails, there is no sort of situation of there being a split-brain that failover is facilitated correctly?
Failure detection
Charlotte Fung (02:17):
Yeah. It's more about just making sure that at least there is a minimum of two, if one does fail, why the cluster is able to spin up or why our SRE team is able to detect that there's a failure in one of the clusters like one of the control plane and be able to spin up a third. And so those two can still be able to handle all the API calls that come into the cluster and be able to do whatever needs to be done.
Ryan Niksch (02:50):
And the fact that there's two remaining, you don't see a performance degradation, you don't see any impact to the customer. In a traditional OpenShift or a self-managed OpenShift – so if we look at OCP, or the OpenShift Container Platform – the customer would be managing all of this. So the customer would detect this and they would be responsible for correcting that failed node. Even though OpenShift still is continuing to work, you'd still need to fix the failure. With managed OpenShift, that's not necessarily the case. With ROSA, the customer doesn't need to worry about this.
Proactive monitoring
Charlotte Fung (03:25):
That's right, because behind the scenes, SRE is proactively monitoring your clusters, and once this fails, you won't even notice that there was a failure in one of your control planes because the team, like the SRE team that manages the clusters, would spin up another one for you.
Ryan Niksch (03:45):
Okay. So this is more a case of, I'm not being paged in the middle of the night, I'm not needing to react to tickets, I'm coming back to work and getting an email that states there was this event, these are the actions we've taken and have a place in existence.
Charlotte Fung (03:59):
That's correct.
Ryan Niksch (04:00):
And that's exactly what I, as a business owner, am hopeful for. I want to shift to the infrastructure layer most importantly, this router, this router layer that we have over here. This is facilitating how my customers get to the actual application workloads. So my customers are coming in through a collection of load balancers. They get to the router layer, and that router layer routes traffic to my worker nodes where my pods or my applications would be running. What happens if a router fails? So let's, for example, blow up this one. How does that function from an OpenShift resilience perspective?
Router failure
Charlotte Fung (04:43):
So when a router fails, and that's why we have the two infra nodes to account for the high resiliency such that if one router fails, traffic is automatically routed to the other, and that also gives our SREs time to spin up another because this is still managed for you. So you really don't get to know that there was a failure. And then there's the second infra that gets spun up for you and takes over. So usually when one fails, everything gets routed to the other one, that's the up and running, and then you really don't get to feel the impact of that failure.
Ryan Niksch (05:23):
And configuration-wise, the registry, the route and the monitoring layers, they're actually a lot simpler than the Etcd or the actual controllers. So these are even easier and faster to replace should something fail. I would argue that infrastructure teams could probably replace these within a few minutes with the automation that they have at their hands.
(05:48):
If we take this architecture and we take what OpenShift is bringing and we combine that with AWS, am I correct in saying we would take AWS's Multi-AZ constructs and spread these across multiple Availability Zones? So put one control play node, one infrastructure node into each AZ as such?
Deployment
Charlotte Fung (06:09):
You are absolutely correct. So this deployment that we have here is just the default, as we said at the beginning, for anyone that wants to get started. But for your production, we highly recommend that you make use of the Multi-AZ deployment, which helps you spread out your resources in three different Availability Zones, which accounts for high availability, high resiliency for tolerance, such that if one AZ is down, you still have the other two AZs up and running and you don't really notice any effect. And as you said, you have one control plane per AZ, one infra node per AZ and one worker node per AZ, which is the minimum.
Ryan Niksch (06:53):
Okay. And these Availability Zones, they are the closest construct that AWS has to a physical data center. They're not actually data centers, they're collections of physical buildings, but they're the closest construct we have to a physical data center. So what you're actually saying is you're taking that OpenShift cluster and you're making sure that there are control infrastructure and worker nodes spread evenly across separate physical data centers. So if I compare this to an on-premises environment, it's like me having three distinct data centers protecting against failure. And really we've got the OpenShift availability constructs complimented by the AWS availability constructs.
(07:40):
Quickly shifting to the actual worker nodes or the compute worker locations. This is where the applications are running. These are a little bit more disposable. If one of these had to fail, my expectation would be that Kubernetes would detect that and try and move the workloads around. OpenShift also has the ability to deal with that, not just from a Kubernetes workload perspective, but also from an infrastructure perspective. We've got an autoscaling mechanism built into OpenShift. How does that work?
Autoscaling
Charlotte Fung (08:16):
So with autoscaling, when you do deploy your cluster, you have the option of enabling autoscaling and you use that minimum number of worker nodes that you want, which will be like two, at least two for a single AZ deployment. And you can have more than two for Multi-AZ deployment. And then based on what your limits are for your autoscaling, when your workload begins to increase, OpenShift or ROSA will be able to detect that you are getting more parts being spun up and then it's going to automatically scale up your worker nodes, and it could be based on CPU or whatever metrics you set for your scaling and it will scale that up to the desire amount that you want up to the max. And once the workload begins to decrease, it would also sense that and scale down automatically to your minimum.
Ryan Niksch (09:14):
So we have two things that are working in combination here. We have the Kubernetes Pod Scaler, which is managing the workloads on the nodes and moving them around, creating additional pods. And that is working in combination with, I believe, it is-
Charlotte Fung (09:14):
The machine...
Ryan Niksch (09:39):
... the machine sets.
Charlotte Fung (09:40):
Yeah.
Ryan Niksch (09:45):
And these machine sets are doing two things. They're scaling the compute for when we need more compute, when there's more resources required, but they're also facilitating a recovery or resilience model that, if we lose a node, it will launch a new compute node or a new AWS EC2 instance. And once that new instance comes online, then the Pod Autoscaler will then shift workloads to that, balancing out the workload. So again, we've got that resilience mechanism of OpenShift from an application or a pod perspective, combined with the infrastructure side coming through from that machine set. And this is all in a single region taking advantage of OpenShift and multiple Availability Zones and scaling constructs.
(10:35):
Charlotte, thank you for joining me. As always, a fantastic pleasure to work with you, and thank you for joining us here.