We recently interviewed several Red Hat OpenShift customers to better understand their path to migrating from OpenShift Container Platform 3.0 to 4.0 and highlight some of their challenges and field-tested best practices. What follows is a condensed version of that discussion, followed by a Q&A, edited for clarity.
Matthew Sweikert - moderator, Red Hat
Morgan Peterman - moderator, Red Hat
Nhat Duong - Software Engineer, Cigna
Josh Dziedzic - Software Engineer, Cigna
Walid Shaari - Analytics Platform Support Engineer, Saudi Aramco
Nhat, if you wouldn't mind, maybe talk a little bit about how many clusters you are managing? What is the size of those clusters and maybe some generic information about some of the workloads that are on there?
I've been an engineer at Cigna on the OpenShift team for about three or four years now, and Josh Dziedzic is also on the team. At Cigna, we started out with OpenShift 3 in 2016. We started with a very small cluster, but we had this objective of getting a good size of applications onto OpenShift and onboarding, and working through all the potholes that we'd run into to deliver.
Before we migrated to OpenShift 4, we had four clusters: two in non-production – which is your development, your testing, your QA clusters – and two clusters in production, and they were pretty sizable at the end of OpenShift 3. I think the clusters were about 100 to 200 nodes each, and they were pretty beefy virtual machines. We're talking like 24 to 32 virtual cores, and some of them were 128 gigs of RAM. Some of them were 256 gigs of RAM. I think towards the end of the OpenShift 3 life, we had probably 200 sizable production applications or namespaces running, and probably two or three times as many in the non-production environment.
Walid, tell us a little bit about your environment.
We started around 2015 and we are on-prem mostly. We needed a Kubernetes distribution that we can install on-prem, and we started with a proof of concept. We started with a small cluster consisting of nine nodes: three infrastructure, three controlplane, and three worker nodes. And from this, at the end of the year, we moved to pilot, and same sizing, and we had this role of application and we didn't know which one were important, which one were not important.
Today, we have four clusters. One is dedicated to what we call IR 4, which is machine learning.. It's around 16 nodes. Another one for developing initiatives, which is nine nodes, and two more for quality assurance and testing. However, I believe that we are going to expand on the number of clusters because of security, governance, and some of the clusters need the data encryption, some require a certain application not to be hosted with others, and stuff like this. So we are thinking of what's the best choice, should we increase the nodes or should we increase the clusters?
So while you're here, Walid, maybe you could talk a little bit about your journey. What was planning like? What was the timeline like? How did you guys execute? How did you approach app teams to migrate?
The journey was bumpy. We are a large enterprise with some regulation and lots of security controls. Internet access is restricted if not disconnected, and we have to get approvals for everything. We started spreading awareness of containers around 2016. However, we started basically doing workshops. So our journey started with simple awareness workshops for different departments, for different application support, for operations, and for the local community. We were very active in the community at that time, and that's how we spread the word of containers and the Kubernetes. After I got my certification in 2017, one of my supervisors asked me, "Why don’t we implement it? Why shouldn't we implement OpenShift, and basically, let people try it out?" He said, "The moment we have it, people will come and start using it."
I wanted to document the rules and the security controls and it was impossible to justify it. I worked with the security team. We were able to push it around. I worked with Red Hat. They provided us with free workshops for a couple of weeks, for development, for operations, for security. The proof of concept helped big time. The proof of concept was a sandbox, there was no regulation whatsoever. So people can pull images from a local registry, they have to tell us what images they want. I pull them offline, upload them to the internal OpenShift registry, even though we didn't have Nexus at that time.
The pilot was more controlled, more restricted to certain projects for capacity reasons. And later on, the organization noticed that we do need an enterprise company for distribution because of the IR 4 activities and IR 4 initiative. That is a big initiative in my company that basically requires lots of machine learning, requires the use of GPUs, and stuff like that. So OpenShift streamlines these kinds of activities, especially with the GPU operator.
So the journey in the beginning was bumpy. Now, it's still bumpy because of the different mindset and the different technologies. We chose the bimodal Gartner approach that we would like to host only modern applications on this platform and whatever is legacy, let it be, but don't try to mix. But we cannot fight this, and are expected to integrate with Oracle databases, with document processing, workflows, and with things that will not integrate straightforwardly, or the developers think that the integration should be different from the OpenShift way or the Kubernetes way.
So you find there's kind of a culture hurdle, not just a technical hurdle.
Yes. The team is very small, the team is just two people or three people from time to time, depending on the assignments. So we don't have time to listen to the customers. We don't have time to communicate back with the customer, the needs, the requirements, and where they want to go, or have office hours or awareness sessions where we can improve the communication, improve the expectations because our user's expectation that this will solve everything, which is not true. I mean, especially when we are in a constrained environment, you have certain limits. I cannot, for example, create a similar test environment that will exactly mimic production. So yes, the culture is basically the number one challenge.
Oh, thank you for sharing. I was going to ask Josh the same question. What was the journey from 3 to 4 like? Did you kind of have a similar issue with the culture that you had to work with internally?
Yeah, so I think Nhat and I would definitely agree that most of the challenges, while there are some technical, are largely cultural. With any large enterprise, more often than not, process gets in the way and internal processes for how projects are managed and how funding is allocated to different areas. So we actually started just before the pandemic, really. In February of 2019, we started building our first clusters on OpenShift 4. And we began to socialize amongst our customers that OpenShift 4 was coming and they needed to start planning a migration for their applications to move over, to kind of get things moving a little bit more because we didn't really have a funding source to provide application teams to move.
What we basically had to do was force their hand. So we began stopping the creation of new projects on OpenShift 3, blocking deployments from happening. So right around May and July of 2019 is when we started to really lock down any changes or new applications moving to OpenShift 3, and we set a mandate that everything going forward would be OpenShift 4. So as you can imagine, that takes a significant amount of time. We had hundreds, if not thousands of applications running on OpenShift 3.
And as time goes on, development teams move on, and they don't have the resources allocated to make some of the changes, even though it's not very significant, I would argue, in a lot of cases. Getting teams to put those resource hours onto moving stuff was challenging. And we had some great project managers that helped us along the way. And we actually just finally shut down OpenShift 3 about a month ago with the last applications being migrated off probably about two to three months ago. So it took a significant amount of time to move the entire workload, but we got there.
Whoever wants to answer this question is fine.
What kind of Red Hat resources did you leverage to help you out, from the people such as your account teams or additional services that you picked up, or the documentation on redhat.com or docs.openshift.com?
A little bit of everything you just mentioned. So on our OpenShift 3 to 4 journey, we knew at Cigna that we wanted to get onto OpenShift 4 just to simplify the environment, make sure we keep up to date, and make sure we're getting all the great new stuff that Red Hat and the OpenShift community is pumping out with Kubernetes. Having Red Hat services, the architects, the engineers that are actually tied to our team, understanding our environment, and what we deal with day in and day out was really great at the time when we were moving because OpenShift 3 to 4 is not an in-place upgrade. This was also a good opportunity to take a look back at our architecture, see if we wanted to change anything, and then leverage those Red Hat architects and the knowledge base to see what the best practices are.
For example, in OpenShift 3, we only have one router or router sharp for a huge cluster. And it worked, but sometimes it made us a little bit nervous because it was such a big blast radius. In OpenShift 4, we had the opportunity to say, "Hey, new pattern." Every cluster gets at least three Ingress router charts so that the application teams now know when we're onboarding them instead of having to move them over. We just tell them, "Hey you're going to start off on Ingress 3 or something," and that's a lot better than having to migrate them over, so that was a huge plus.
Using the knowledge base is great because you guys have a community of knowledge and temp services are great because when we go to you with a question, the first thing I ask is, "Hey, have any of your other customers experienced this?" Or "Hey, is it safe to upgrade to a certain version or to do this?" And usually, there's a pretty good knowledge base behind that or a good reason why not to do or to do something. Red Hat’s consulting services, knowledge base, the engineers, architects, and the temp services were really good from my point of view for Cigna.
Great. Walid, what kind of Red Hat resources did you leverage?
So we worked with our Red Hat account manager, starting with two workshops, one for operators and one for developers, and a series of conference calls with different entities from operations, from security, from everyone, even outside of our business unit to figure out what's the best route, and things like this.
OpenShift 4 was a must for us. I mean, we can see the differences. We were on CRI-O, not on Docker, when we installed OpenShift 3. But we saw so many challenges and so many differences, and our migration was like reprovisioning because we were in pilot and we didn't have many production applications that we needed to move ourselves.
The developers had to back up and restore whatever they wanted in terms of manifest, in terms of data. Mostly we relied on discussions with the Red Hat team.
Great. So let's talk a little bit about the challenges. Walid, what challenges did you really run into with the migration and how did you get past them?
The first migration from 3 to 4 was less challenging. The challenge was, mostly, we were provisioning under a new environment, VMware Cloud Foundation, for which we didn't really have much skills yet. We just started the journey there and we started with OpenShift 4. So before OpenShift 4.6, there was a requirement for DHCP. We couldn't get DHCP working. We worked with VMware. And the subnets we were using were mixed, so it was a hurdle. Then, we used the helper node to help us create ISOs and stuff like that to install and provision.
So we spent like one week just on the DHCP issue, before settling down on the helper node and the core ISO maker. We wanted to control the provisioning of namespaces. We didn't want it to be open like before. So we created a template for annotations and labels so that we know exactly which namespace belongs to who, and whether there are any issues. One of the things about our migration, for example, was we didn't do a direct migration, but now, I'm doing direct migration from one infrastructure to another infrastructure. And one question that comes up is who owns this project and who should I contact?
So now, I just look at the annotation and I know who to contact. And if I'm migrating an application, I want to make sure that there are no crash loops or everything is hunky-dory or everything is working fine.
One of the other challenges, not even related to migration, is moving from one storage to another time and trying to keep the same names.
Just as a follow-up here, do you feel like your migration from 3 to 4 is also an opportunity to maybe mature all of your smaller processes and other things at the same time?
What I want to emphasize is that it's not just a single migration from 3 to 4. You might need to migrate again for a different reason, such as migrating to a cloud or migrating to a different infrastructure, and you will face some challenges again. Think about the migration like a disaster recovery, but from a different container runtime, storage, or cloud provider, and think about what you need to have. We didn't have an application mobility tool. We didn't have cluster management. So if you ask me now, “what do I need to overcome the challenges?” I need to really plan it ahead of time and assume I would do a migration at least once a year.
So again, the answer is definitely yes.
Q & A
Question 1: What hurdles did the application team experience and how did they overcome those hurdles?
The first thing was the API. At that time, there were no tools to warn you about API duplication or the difference of APIs and stuff like this. The second thing they have instructions when it comes to storage, like you have to keep the resist and falling, playing the same name, and you have to make sure that there is a [VV 00:24:50] that will be found and stuff like that. Other than that, we do the onboarding. It was not really self-service platform yet, so basically...
From the Cigna side, I would say the biggest asset or the biggest concept that Cigna did really well on the application team side or our customer side is that we developed a pretty active community of OpenShift and Kubernetes subject matter experts for the application teams. We have a community of over a thousand users keeping an eye on it. And the idea is that maybe at the start of our journey in 2016, probably 0% of the company knew anything about Kubernetes, but over time, as people got onto the platform, we want to make sure that if somebody was going to come onto the platform or be migrating from 3 to 4, instead of having to come directly to the platform team, there was an opportunity to leverage the Cigna community, right, to answer some basic questions.
And that's what alleviated a lot of the pain from our side and allowed us to scale to the amount of customers that we had. It was kind of like a snowball effect where somebody teaches somebody else and they learn it. And now, they have the same question and now they're answering their questions. Instead of us, a small team of probably 10 engineers, having to manage a thousand different applications, they were able to basically come together as a community and support themselves, which is really helpful.
I think the other thing that we did early on is we recognized that the applications carried a lot of tech debt with them from OpenShift 3, things that weren't architected well to live in an environment like OpenShift and Kubernetes. A lot of app teams kind of took their apps and just bumped them into a container and called it good enough and ran one pod, and anytime we'd do maintenance, they would start screaming that there was an outage. So it kind of gated entry to production with tech reviews, with some of our lead engineers. And that kind of allowed us to catch some of that stuff that they were doing that's not right or good for a cloud platform and catch it early on, and get them to correct it. So we would review their deployments with them, how they were architecting their app, secret management, stuff like that. So it gave us a chance to eliminate some of that tech debt that we had hanging around.
Question 2: How did you approach automating post installation configuration of a cluster?
We recognized early on in OpenShift 3 that we ended up with a lot of what we referred to as “snowflake clusters,” where we made one off changes, whether it be with Ansible or to gamble directly, and it got lost. It didn't always make it between clusters. We ended up with a non-prod that really looked like production. So very early on, we recognized that going OpenShift 4, we had to mandate automation end-to-end. Everything about our clusters is committed to code and is managed from the infrastructure side through Terraform.
And then when we get to our day two operations, configuring identity management, storage, monitoring, whatever kind of stuff you would configure after the clusters are up and running, we do all of that use in our GoCD. That's now packaged as OpenShift GitOps. It makes it so that any changes that we make, we commit to source code once. Argo lives on each one of our clusters, and immediately realizes that those changes happened and updates our clusters. We do absolutely everything that way, right down to even machine configs for the nodes.
Same as Josh, basically. For example, previously, when developers wanted to modify their applications, they didn't write back to the manifest. Instead, they edit the objects directly. This meant they lost what they edited because it's not in the manifest anymore. To fix that, we had the manifest for the clusters for the day-two operations like NTB, DNS, certain configurations, Cron, and authentication. We just moved to Terraform for provisioning the nodes. By the end of next month, we should have everything fully automated.
Question 3: If you had to go back, what would you do differently?
The biggest thing that bothered me was just the timing issue of certain architectural decisions that we made. And the context for that is we, as Cigna, had a demand to run and manage, for OpenShift 4, let's say, 10 or 20 clusters. We had to go out there and do our own homegrown automation and engineering growth to get a system where we could manage 20 clusters, but then Red Hat came out with advanced cluster management a couple of months after we made some pretty important architectural decisions that we can't go back on without a lot of effort.
And I would say the biggest thing that I would have done differently is I, would've probably talked to Red Hat a little bit more just to understand the roadmap and the timelines for some of that stuff to see, "Hey, if we waited out maybe three or six more months, we could have been on the same page as Red Hat and not have to diverge," and then have this whole effort of, "Hey, when do we jump back on the Red Hat ship for some of this architectural stuff?"
When it comes to things like OpenStack integration for UPI or IPI deployments or ATM, or some of the image registry, decisions that we made are leading to a lot of operational issues.
One of our challenges is that we don't have a global load balancer. When we migrated the users, the API endpoints, and the routes, all the cluster names changed. It would have been better if we had a global load balancer so that we could keep those things I mentioned at a universal URLs or endpoints. I am still thinking about this. The other thing is to start with the application, and as Nhat said, invest in some tooling like Red Hat Advanced Cluster Manager, or the mobility application tool, like Custom, where it'll help us with the transformation.
I think some of our biggest regrets just come from maybe being a little too bleeding edge to meet timelines. I think it's very hard to go back on some of the decisions we made, andtake advantage of some of these technologies because of the disruption it would cause to implement them.
Question 4: How has OpenShift 4 improved how you guys handle your operations over OpenShift 3?
Some of the biggest changes are in automation. The way we do things now with GitOps and Terraform is significantly different from the way we managed things with Playbooks and Ansible. Upgrades are certainly much easier than they were in the past. Now, we just click the button in the UI and we know that the version we're coming from is a good candidate to go to the version that we're going t. All that dependency mapping is done for us. We don't have to guess and hope that we can upgrade to next version. We're getting to the point now where not only do we manage all of our operations through GitOps and Argo, we've made an instance available to the application development team where we're starting to standardize and enforce that deployments be done through Argo. So we had a lot of teams that were just writing shell scripts, right, to call OC commands. And there wasn't really a whole lot of repeatability to it, so we've gotten rid of that.
I wanted to just reiterate what Josh said about those Ansible playbooks, we had 200 node clusters that would take more than 24 hours for those playbooks to run sometimes in some cases. You fire the playbooks and pray. At the end of the run, it might or might not work.
And the biggest peace of mind going from Ansible to the way that the nodes and the control plane are managing OpenShift 4 is that now I have a better sense at night that, "Hey, if we misconfigured something or something happens to a VM, we can now use Terraform and take it out the back and destroy it and recreate it." And in most cases, it has worked for a lot of the scenarios that we've seen where a node behaves poorly. We'll spend a couple minutes looking into it and gathering logs and sending them to Red Hat and see what's wrong, but after that it's being destroyed and recreated. So that's the biggest peace of mind that I've gotten from 3 to 4, in my opinion.
What I like about OpenShift 4 is the OperatorHub. The ecosystem of the OperatorHub is very holistic. I don't think anybody can cover what is available on OperatorHub these days. I mean, it's growing, and it's very easy for me, for example, to install OpenShift Data Foundation, for example. I can install serverless easily. The web console, for example, is a big hit with our customers, even though it's in tech review. So to get software installed, provisioned, supported, and upgraded is straightforward.
The operation of excellence, via the operator model is awesome.
The Openshift Community, including OpenShift TV, Christian Hernandez’s GitOps series, and other free resources help us learn how to do things in OpenShift 4, on top of all the webcasts and RHLS training material. For example, now, if I want to do migration, there is a migration course on RHLS using the MCT. And this is one of the things that I said that was challenging for us because you need a mobility tool. So there is an open source mobility tool that comes out of the OperatorHub, and there is a training course for it.
Question 5: Do you provide persistent storage in OpenShift? If so, what is the backend type, and what challenges did you have, if any, during the migration between OpenShift 3 to 4?
I am of the opinion that application teams should not rely on any persistent backup storage on OpenShift. And what I mean by that is, "Hey, don't..." Even though there are operators out there to run databases and to persist your data and to store long-term data on OpenShift, don't do it. You want to rely on an enterprise service for now, right? That's outside of OpenShift, just because of... I don't think that storage and Kubernetes are at a place right now where I can sleep pretty soundly at night because I've had those calls where you wake up and your volume can't mount to your pod for some reason.
And you're just sitting there trying to figure out what storage engineer to call or how to get a hold of a person to dig into a problem that's very complicated at the hypervisor level. So what we tell every single applications team is, "Hey, you can use PVCs and whatnot on OpenShift, but if there's a disaster, don't expect that data to be backed up."
We do use SolidFire from NetApp and the Trident Operator to manage the mounting, unmounting, and creation of those PVCs. So we do make it available. We have different quality of services that you offer through storage classes and plenty of application teams do take advantage of it, but we do make recommendations around the persistence of that data.
For ReadWriteOnce, we provide vSphere 10 and vSphere CSI drivers. For ReadWriteMany, which we are against, but some of our developers really love, we steer them towards NFS.
I think one of the biggest things we learned as we moved between 3 and 4 was that it was much better to have lots of smaller size clusters than the gigantic monolith clusters that we had built in 3. We had some clusters that were in excess of 200 nodes and it made upgrades, and things like that very challenging. By moving to many smaller size clusters, there's been a significant advantage to how we manage the platform.
If anybody out there is still on OpenShift 3 and they're trying to look for a reason to go to 4, consider this: operationally, it's a lot simpler and it's going to give you an opportunity to re-architect and do all the things that we talked about, such as fixing tech debt, implementing GitOp and automation throughout. It's worth it to push OpenShift 4 just because of all the new features that you're eventually going to fall behind on over time. Definitely, worth it to upgrade.