Implementing ROSA into existing enterprise networks
In this video, Red Hat Manager of OpenShift Black Belt Thatcher Hubbard and Amazon Web Services Principal Solutions Architect Ryan Niksch dive into the intricacies of networking multiple clusters on an enterprise level using Red Hat OpenShift Service on Amazon Web Services (ROSA). The two experts weigh the pros and cons of using VPC peering vs. Amazon Transit Gateway, for a more manageable experience across hefty business needs.
To learn more about applying Red Hat applications for your business, please visit our Learning Hub.
Ryan Niksch (00:00):
Greetings. My name is Ryan Niksch. I am a principal solutions architect with Amazon Web Services. Joining me here today is Thatcher. Thatcher, say hello and give us a brief description of your role.
Thatcher Hubbard (00:12):
My name is Thatcher Hubbard. I'm a manager of OpenShift Black Belt with Red Hat, and our role is to help customers as a transition from on-premises OpenShift cloud platform to managed services in the cloud.
Ryan Niksch (00:26):
Right. So that's an interesting one. Managed services, I'm assuming the managed service that we're most interested in is the Red Hat OpenShift Service on AWS.
Thatcher Hubbard (00:36):
On AWS at ROSA.
Ryan Niksch (00:37):
And this is OpenShift. Nothing really changes except there's now an SRE team doing the undifferentiated heavy lifting for you, the customer.
Thatcher Hubbard (00:45):
Yes, the platform is managed for you by Red Hat, but all the tools that you know and love, all the features of OCP, they're all still there.
Ryan Niksch (00:54):
I want to drag you into an ever-spiraling complicated journey, and it starts off with none of my customers deploy one ROSA cluster.
Thatcher Hubbard (01:06):
My favorite client.
Networking multiple clusters
Ryan Niksch (01:07):
So every customer I'm working with has a ROSA cluster for non-prod, and they have a separate ROSA cluster for production workloads. These exist in different AWS accounts, or they'll have a non-prod account and they'll have a production account. They'll have AWS support attached to that, that is developer support maybe for the non-prod, and they'll have business or enterprise level support for the production environment. Because these are in different AWS accounts, we immediately end up with two challenges as a, how do I network all of us together in a larger enterprise context? How do I monitor, get visibility on these? So let's flesh this out. So I think what we're typically going to see with customers is, if they're a relatively small customer, there might be some sort of VPC peering, but at an enterprise level that doesn't work so well.
Thatcher Hubbard (02:13):
VPC peering breaks down pretty quickly in terms of manageability.
Ryan Niksch (02:15):
Yeah. It's a scale problem.
Thatcher Hubbard (02:18):
Ryan Niksch (02:18):
Because you're manually creating the links to and from it, it becomes an administrative nightmare.
Thatcher Hubbard (02:25):
Yeah. There's an N one, N plus one there that gets pretty ugly.
Ryan Niksch (02:28):
So what are you seeing as the alternative to that in most enterprise customers?
Thatcher Hubbard (02:33):
It is, not surprisingly, the AWS transit gateway product, or TGW as we often call it, which of course is rather than being a peering arrangement, is more like your own little virtualized router that you get to configure in many of the ways that you would configure a router.
Ryan Niksch (02:49):
You come from a previous network where-
Thatcher Hubbard (02:51):
Ryan Niksch (02:52):
Right. Likewise myself. So, when you say I have my own personal routing layer that is super shiny. So let's slap down a transit gateway over here. I think to couple that, many of our customers are not just building a transit gateway, they are also building in a shared services VPC-type environment in there as well. So if we have a transit gateway in the environment or just a TGW.
Thatcher Hubbard (03:20):
Right. I'll go ahead and draw that.
Ryan Niksch (03:22):
You mentioned that it was a router. So we're going to end up with a routing table over here.
Thatcher Hubbard (03:22):
Ryan Niksch (03:30):
And that defines the route this transit gateway can interact with.
Thatcher Hubbard (03:34):
Right. And then we'll talk about how those get in there in a minute here. As when we go to actually attach these other environments that we're building,
Ryan Niksch (03:43):
So in this case, non-prod and prod, they don't necessarily talk to each other, but customers do want to separate them. Is there ever a situation where the OpenShift cluster in this account needs to interact with the OpenShift cluster in that account?
Thatcher Hubbard (03:58):
If someone were to show me a diagram of their organization that was labeled thus, non-prod and prod, I would say no, probably not. But often what you might see is different parts of a larger organization might have their own cluster. Those two clusters could have interdependencies on each other for services. I think that's not good.
Ryan Niksch (04:14):
We might see it if there is another business unit with another OpenShift cluster. So you could end up with a separate implementation of ROSA that is again, production-
Thatcher Hubbard (04:27):
Ryan Niksch (04:28):
But where the services, the application workloads are interacting with each other.
Thatcher Hubbard (04:33):
Yeah. Service level interaction, not cluster to cluster.
Ryan Niksch (04:35):
No, not the clusters themselves. So it creates a routing dependency, if anything.
Thatcher Hubbard (04:43):
Ryan Niksch (04:43):
Yeah. So how do we stitch this all together?
Thatcher Hubbard (04:47):
Okay, well, we've got the TGW drawn and you've hopefully drawn the route table. Those little entries in the route table happen when each of these VPCs get attached to a TGW. And that is how the Amazon console actually presents it as an attachment. It's an object that you create that associates a VPC with the TGW.
Ryan Niksch (05:09):
So each of these would have an attachment.
Thatcher Hubbard (05:11):
Would have an attachment.
Ryan Niksch (05:13):
And that links back to my TGW-
Thatcher Hubbard (05:17):
Ryan Niksch (05:18):
Thatcher Hubbard (05:18):
It creates the association. It also, in the case of VPCs, because it's all behind the AWS API, the default route table also knows the IP space that that VPC encompasses. So you get a route in your default route table in the TGW that describes all the addressable IPs inside that VPC.
Ryan Niksch (05:38):
Is that always automatic? My experience, it's not necessary. I could get the attachment, but I'd need to actually validate bidirectional flow. So the transit gateway would be aware of, say the VPC, do I need to create a return route somewhere else?
Thatcher Hubbard (05:57):
You do. The responsibility for, and a lot of this I think is because transit gateways can bridge accounts, the responsibility for making sure traffic that originates inside the VPC has a route to describe how to get to the TGW rests with whoever administers that VPC.
Ryan Niksch (06:14):
So we're talking about an additional route table.
Thatcher Hubbard (06:16):
Right. Your VPC route tables need to get entries that describe how to reach other VPCs or environments.
Ryan Niksch (06:24):
We're not talking about thousands of statically managed routes or routes. Really, we're talking about a default gateway to the TGW. TGW needs to know about each of these and potentially anything outside.
Thatcher Hubbard (06:37):
Right. But the TGW, I described it as a router. It's the router's job to manage all of that. Really for each individual VPC, it's more about saying anything that lives inside maybe this big range, which is an IP address management question inside an organization that it's reachable behind this interface and the interface is the VPC attachment.
Ryan Niksch (06:59):
Now, application owners, developers, and owners of an application running on top of OpenShift, they're dependent on other things. Infrastructure components such as directory services, name resolution, but there's more than that. There are-
Thatcher Hubbard (07:15):
Sharing components between owners/teams
Ryan Niksch (07:15):
Things much more relevant to app owners that are also shared. What are those and where do those exist?
Thatcher Hubbard (07:21):
Okay. So if you were to ask me to list some things off the top of my head, really important ones are often if there's a central ICICD platform, even if the automation runs perhaps inside VPCs runners or that sort of thing, oftentimes the platform itself might be shared across various dev teams.
Ryan Niksch (07:40):
What are we talking here? This is-
Thatcher Hubbard (07:42):
Ryan Niksch (07:42):
Jenkins of old?
Thatcher Hubbard (07:44):
Something like a GitLab or a hosted GitLab instance, something like that. That would be possible. Another one that I see a lot is a shared sort of artifact repository. So rather than just being-
Ryan Niksch (07:57):
Like an Artifactory-
Thatcher Hubbard (07:58):
Ryan Niksch (07:59):
Thatcher Hubbard (08:00):
Yes. The thing that most people probably jump to. And then sometimes for developers, centralized logging is also pretty valuable. Knowing that you can go one place and find the logs that your application is generating, especially if you are in a development phase.
Ryan Niksch (08:17):
You're more commonly referring to something like a third party solution, a Splunk type environment.
Thatcher Hubbard (08:24):
Ryan Niksch (08:25):
There's any number of them out there.
Thatcher Hubbard (08:25):
Ryan Niksch (08:27):
Right. Where would that typically exist? I'm assuming it's a shared services VPC within the environment, which is also connected to the transit gateway. It's called [writes ‘Shared SRU’]
Thatcher Hubbard (08:39):
And that's exactly what I would call it too.
Ryan Niksch (08:42):
Now for me, shared services, coming from an infrastructure background, I typically see things such as active directory, DNS.
Thatcher Hubbard (08:50):
Ryan Niksch (08:52):
Being in those environments. I don't see security solutions there. Those exist somewhere else. But you mentioned, I think it was two things, Artifactory.
Thatcher Hubbard (09:04):
Yeah. Some sort of artifact repository, usually that goes outside just like a container registry.
Ryan Niksch (09:10):
Would we see container repos here as well?
Thatcher Hubbard (09:15):
You would if you had a separate one. Artifactory certainly can serve as a container repo. If we're talking about Artifactory specifically, but-
Ryan Niksch (09:22):
I typically separate them because I see my container repo as something where if you're in a non-prod context, could be something that hasn't been scrutinized.
Thatcher Hubbard (09:32):
Ryan Niksch (09:32):
And if we're talking about an artifact repo, the same ultimate object, but it's gone through, it's been allowlisting and b-list process.
Thatcher Hubbard (09:40):
Okay. There's another thing that I would add that you might see inside a shared services VPC, especially if you are a customer who's using ROSA, which is the Red Hat Advanced Cluster Management product for fleet management. A sensible place to put that would be your shared services VPC.
Ryan Niksch (10:00):
So that is Red Hat's Advanced Cluster Manager. That creates a sort of layer of visibility over here that allows me to get visibility on all of my OpenShift clusters, irrespective of where they are?
Thatcher Hubbard (10:22):
Yes. As long as they're reachable via the TGW. Yes. It gets you visibility into metrics from the clusters. It also serves as a sort of centralized point for policy definition and enforcement.
Configuring on-premises cluster(s) with Advanced Cluster Manager
Ryan Niksch (10:34):
Let's throw a spanner in the works. So every customer I work with is a hybrid customer. We mentioned RHACM can control things that it can get to. How do I get to my on-premises self-managed OCP cluster that's also registered with RHACM?
Thatcher Hubbard (10:47):
Well, or might be where RHACM is running. Well, that's what I love about transit gateways. Certainly you can attach VPCs to them. There's a lot of other things you can attach to a transit gateway too. And notably for the thing you're talking about, you can land a VPN gateway, an AWS VPN gateway, virtualized directly to a TGW, and you can also attach a Direct Connect gateway directly to a TGW.
Ryan Niksch (11:14):
And I'm assuming that's going directly to customer premises or for that time.
Thatcher Hubbard (11:17):
Right. The VPN is usually, it's a static VPN, and so it is over the public internet, but of course it goes through a tunnel. Direct Connect is actually dedicated physical links that run between a customer and an AWS location.
Ryan Niksch (11:29):
I'm commonly seeing both. I'm seeing Direct Connect for the preferred high speed connection, many customers creating a secondary VPN connection-
Thatcher Hubbard (11:39):
As a backup.
Ryan Niksch (11:39):
Slower, but as a catch all. Keep me honest here, we typically see something like an on-prem environment where we would have, what do you most commonly see? OCP? Self-managed OpenShift?
Thatcher Hubbard (11:56):
Ryan Niksch (11:58):
CP sitting over there. And you mentioned that we're going to go Direct Connect and that would come directly into the TGW, or do you more commonly see something like an ingress/egress infrastructure?
Thatcher Hubbard (12:10):
I've seen both. I think with static VPN, I think I often see them attached directly to a TGW. And that's a middle ground for organizations that maybe can't justify the expense of a Direct Connect. But notably these are attachments. It's in a different attachment type, but you get the same result as when you do a VPC attachment, which is, a route goes on the route table. I would say when you attach a VPC, because it's behind the AWS API, it automatically knows about the IP space behind that. When you attach a VPN or Direct Connects, there's a couple extra steps there to define IP ranges that are reachable behind that attachment.
Ryan Niksch (12:50):
But likewise, an enterprise probably wouldn't expose every single address version.
Thatcher Hubbard (12:54):
Ryan Niksch (12:55):
They're probably a bit more selective. I just want to come back here very quickly. A lot of customers I work with are creating an ingress or egress sort of VPC, and there would typically be some sort of security layer that they control centrally there. So either making sure that everything coming in is filtered and screened, or in some cases a firewall solution that very meticulously controls what's going out or what we're exposing. So if I look at ROSA, even if it was a production private cluster, something like a private link, the customer could re-expose that through whatever security device there is.
Thatcher Hubbard (13:38):
Right, even in a VPC environment that has no NAT gateways, no internet gateways, no way to get to the public internet itself, the solution there, as you said, is this ingress/egress VPC. Where the transit gateway becomes the default hub in those cases for those isolated VPCs. And then the transit gateway knows to route traffic for the public internet out via the ingress/egress VPC. And like you said, there might be products, right? There might be firewalls running there. So a variety of things. Usually done for security purposes. Usually quite a lot of complicated configuration. It makes much more sense to isolate that in one space to make sure all that security configuration is there and present and can be audited rather than trying to distribute it over potentially a number of VPCs depending on the size of the organization.
Ryan Niksch (14:30):
Now all of these contexts that we've spoken about, non-prod, prod, different accounts, different VPCs, on premises, that could all be in a single AWS region. What if I had a customer who was either trying to get closer to their customer and they've got workloads running in different regions, or somebody who's exploring something like a multi-region DR sort of context? Long and short, what if there was another region that we needed to select?
Thatcher Hubbard (15:00):
Right. If we assume that all of this is roughly in the same place, say US East one, what if somebody had something overseas and they wanted connectivity to it? It's important to note, transit gateways are associated with a specific AWS region when you create them. It's not necessarily super explicit, especially if you look in the GUI but that is the case. That said, transit gateway support, or it's called gateway peering, and we could create another TGW in another region and create a peering relationship between them that allows for routing of traffic between anything that is attached to either of these TDWs. So things in region one could reach things in region two this way, vice versa.
Ryan Niksch (15:43):
You would essentially have everything here duplicated in the other region as well?
Thatcher Hubbard (15:50):
Potentially. If you were talking about, like you said, getting closer to your customers, maybe that would be somewhere and all of your services, but you wouldn't necessarily need to.
Ryan Niksch (16:01):
Correct. I think it is very application-specific. What are the application tolerances in terms of latency, those sorts of things.
Thatcher Hubbard (16:09):
Ryan Niksch (16:10):
Also, do I want to have these as hard and fast separations? Do they need to be completely independent of each other? The other element is Advanced Cluster Manager that we have over here. Would you typically see a second implementation of that in the other region or would I just have one RHACM implementation managing everything?
Thatcher Hubbard (16:32):
I would think that because it is more of a monitoring and back plane tool, I would say that it's probably fine to run a single one in the network. Obviously you want some resiliency plan for RHACM, even if you're running one instance. But there's no reason why if you had a big enough fleet, you couldn't actually have... If you had business units that were largely operated independently, perhaps in a different country, it might make more sense to have a second RHACM depending on... That's more about business needs, I think, than technical.
Ryan Niksch (17:03):
I think it boils down to how invested or dependent your business has become on RHACM as a tool set. It's a workload that runs on top of OpenShift. If it's not there, it's not overly complicated to deploy a new RHACM instance and register things to it.
Thatcher Hubbard (17:03):
Ryan Niksch (17:19):
You could bring that up relatively quickly and you may or may not be using the provisioning capabilities of it. So depending if you're rolling out hundreds of clusters or what your recovery phase is, it might not be that important. What have we missed here? I think we've covered the lion share of use cases here and connectivity options, but I feel like we may be missing one.
Thatcher Hubbard (17:44):
Okay. Well, something that's not necessarily a use case, but I do think is important to note, we've talked about the default route table that gets created here by the TGW. It's an option to just have every attachment use the default route table so everything can see everything or at least knows where it lives. And it's important to note that this doesn't bypass the security groups or network ACLs that exist in each VPC. You still have the ability to do very fine grain control over which traffic is allowed where. It's really about what's routable. TGWs also allow you to create and associate multiple route tables and those route tables are associated at the attachment level. So for example, if a VPC down here had no reason to even know about things that existed in region two, you might have that VPC attachment only really know about this data center and the shared services VPC. Those would be the only IP ranges in that route table.
Ryan Niksch (18:46):
But then what we're talking about is we are creating a transit gateway specific route table here at the attachment level?
Thatcher Hubbard (18:53):
Ryan Niksch (18:54):
And then defining specifically-
Thatcher Hubbard (18:56):
Ryan Niksch (18:56):
-what this non-prod environment should interact with.
Thatcher Hubbard (18:59):
Ryan Niksch (19:00):
And I think it is a very valid call out. We are talking about just connectivity and flow of routing. This is not security related. This is not a name resolution. Those are layers that stack on top of this.
Thatcher Hubbard (19:11):
Right. It's literally just what's reachable and known by any given VPC based on what's in its associated route table. And you make a good point, we want to be really clear. The route tables that exist inside the VPC are still separate from that. They still need to know how to reach the route table. The route table we're talking about is at the attachment level, which it's much like an interface on a router and how you can have virtual route tables that are assigned to different interfaces. If you come from a router background, that's a normal thing.
Ryan Niksch (19:43):
Now I think the other thing that is noteworthy here is everybody is handcrafting these through the AWS consult. No, they're not.
Thatcher Hubbard (19:51):
No, they're not. Hopefully not.
Ryan Niksch (19:54):
They're more likely utilizing something like AWS organizations and organizations allows them to vend these accounts as they are being created. Whether they're non-product accounts or production accounts, they can define how those accounts link to their billing interfaces. They can specify things such as who the owners are for that. So a lot come from the organizational management things.
Then there is AWS service catalog and service catalog really could just be a catalog of infrastructures, code templates that defines what do each of these VPCs look like? What does this transit gateway implement? All the attachments to that transit gateway. And a lot of these are simply reusable building blocks that as I extend this empire, I am not needing to resort to manual-
Thatcher Hubbard (20:52):
Ryan Niksch (20:53):
Process, it's reusable. From an OpenShift perspective, there's a couple of things infrastructures codes on as well. I see a lot of customers are scripting out the ROSA CLI. I've seen some customers wrap the ROSA CLI in Terraform. And there's a lot of interesting things that are coming down the pipeline in terms of capability of both ROSA as well as RHACM and OCM. So when you combine all of these things together, there's a lot of potential for customers to get to a zero touch environment.
Thatcher Hubbard (21:24):
Righ,. And that's certainly from a product perspective where we want to get people. Where they could pull something out of the service catalog and not just get a VPC and the attachment and the supporting services they need, but potentially a ROSA cluster itself as part of instantiating an entry in the service catalog.
Ryan Niksch (21:44):
Talk to me about, might be stepping outside of my bounds here, but in terms of Red Hat Advanced Cluster Manager, it's able to manage the workloads on OpenShift clusters, move them around, do configurations on the clusters. Does that replace something like Red Hat Ansible or do they compliment each other?
Thatcher Hubbard (22:01):
They can complement each other. I would say that for organizations that aren't Ansible users today, Ansible is a configuration management tool. It's very capable. If you have significant Ansible experience, there's very good tooling to hook Ansible into OpenShift and ROSA. And you could easily use that if you already had the experience. But if you aren't already an Ansible user, I would say Red Hat Advanced Cluster Manager would be the place to start with that level of remote management and configurability.
Ryan Niksch (22:36):
Now Thatcher, you and I can sit here and talk about customer use case and architecture for the better part of eight weeks. So I think let's not do that. As always, it's fantastic having you here. Thank you for joining me.
Thatcher Hubbard (22:48):
Thanks for having me.
Ryan Niksch (22:49):
Thank you for joining us.