What happens in an OpenShift update vs. an upgrade? Changes with Red Hat OpenShift Container Platform can happen regularly, so we deploy a new version, which can be a bit scary because no one wants to push changes to an environment that might break stuff. This week on Ask an OpenShift Admin, we answered questions about when and what happens in an update vs an upgrade, pets vs cattle clusters and why we still need upgrade paths, the process of upgrading between minor versions, and how you can visualize Red Hat's recommended updates with the Red Hat OpenShift Container Platform Update Graph. 

As always, please see the list below for additional links to specific topics, questions, and supporting materials for the episode!

If you’re interested in more streaming content, please subscribe to the Red Hat livestreaming calendar to see the upcoming episode topics and to receive any schedule changes. If you have questions or topic suggestions for the Ask an OpenShift Admin Office Hour, please contact us via Discord, Twitter, or come join us live, Wednesdays at 11am EDT / 1500 UTC, on YouTube and Twitch.

Episode 53 recorded stream:

 

Use this link to jump directly to where we start talking about today’s topic. 

This week’s top of mind topics:

  • Following up from last year, the log4shell vulnerability is still in the news and something we should be cognizant of. If you haven’t evaluated the impact of cve-2021-44228 on your OpenShift clusters and applications, we highly encourage you to do that now!
  • The second topic we talked about was about resizing nodes in a deployed OpenShift cluster. You can do this, though the method for doing it is different depending on the type of node (control plane vs compute), infrastructure type, and deployment method you used. We talked about it during this segment, but you can see an example with AWS/Azure here.
  • An extension of the previous topic, we talked about using “hot-add” to give more resources to the RHEL CoreOS virtual machines. While this works to add more resources to the virtual machines, unfortunately kubelet doesn’t recognize the new resources without a restart. As a result, the usefulness of this is somewhat limited.
  • One option to add resources is to either create a new MachineSet or modify an existing one and scale up. In the latter case, you may want to change the deletion policy to remove the original nodes first to keep the new configuration.
  • Finally, we wanted to highlight that the docs team is working hard to improve the overall experience with search. The first step in this process is to redirect - and sometimes de-index - old documentation pages that are no longer relevant. You can also configure a custom search engine for your browser (e.g. Chrome) to search directly against specific versions of the OpenShift documentation. For OpenShift 4.9, this would be the same as typing in “<search term> site:https://docs.openshift.com/container-platform/4.9/” to get results that only apply to that version of OpenShift.

Questions answered and topics discussed during the stream:

  • A viewer asked about how Operators work, specifically around how the Kubernetes API understands that an extension has been added and what triggers an action. We also talked about this subject in-depth during episode 41.
  • This topic was inspired by a question Jonny received offline: why do we have “pet” clusters instead of treating them all as disposable? The answer here is, ultimately, specific to each organization. Jonny highlights things like resources and security policies, but we also need to be cognizant of the application’s (and user’s) ability to move between distinct clusters. If the application cannot be seamlessly moved to a new cluster, then updating a cluster in place makes sense.
  • The product management team published a blog post which goes into detail around the update process for OpenShift 4: The Ultimate Guide to OpenShift Release and Upgrade Process for Cluster Administrators.
  • Another viewer question about updating clusters and deploying new with the recent changes to the VMware virtual hardware version requirement. For clusters being newly deployed with OpenShift 4.9 or later, virtual hardware version 15 is used. For earlier versions of OpenShift, virtual hardware version 13 is used. In the not-too-distant future the in-tree storage provisioner will be removed from Kubernetes, at that point the CSI driver - which requires virtual hardware version 15 or later - will be mandatory with OpenShift on VMware vSphere deployments.
  • A viewer asks, “how do we address management concerns about wanting to update less frequently despite Kubernetes and OpenShift having releases every 3-4 months?
  • What is the scope of change between an update, e.g. 4.9.11 -> 4.9.13, versus an upgrade, e.g. 4.8.z -> 4.9.z? Upgrades, where the minor version changes, often include updates to Kubernetes and are where API changes happen. Updates, where new z-streams are applied to the current minor release, are for fixing bugs and security issues without making API changes.
  • We were asked “does cluster upgrade time vary based on the number of nodes?” Yes, most significantly during the Machine Config Operator update, which will reboot the nodes to apply RHCOS updates and other things. This can be made faster by changing the number of concurrent nodes affected using the MachineConfigPool’s maxUnavaiable value (which can be a specific number or percentage of the pool).
  • I confused my response a little when a viewer commented that EUS -> EUS updates, as of 4.8 upgrading to (in the future) 4.10, can do a “skip” for the compute nodes. This was discussed in the recent “What’s Next” roadmap session where the compute node Machine Config Pools are paused while the control plane is updated, serially, through the intermediate version. Once there, the compute node Machine Config Pool is unpaused and the nodes only need a single reboot to go from, in this example, 4.8 -> 4.10.
  • What are some basic troubleshooting steps if a cluster update/upgrade gets stuck? Start with the cluster Operators overview (`oc get co`) and then investigate any which are not progressing. This may involve doing a describe of the cluster Operator to identify objects related to it, then looking at the logs for those objects.
  • During the stream, we referenced sharing some info about how to trigger updates in multiple ways. Since I was unable to share my screen during the livestream, I’ve instead put the commands into this gist. Please don’t hesitate to leave a comment there with any questions!
  • We revisit the troubleshooting steps for a part two here in the stream, including how to “unstick” a node reboot operation.
  • The update process today does not take into account the infrastructure or unused parts of your cluster when blocking updates/upgrades. For example, if your cluster is deployed to vSphere, but there’s a blocking bug for clusters deployed to AWS, your vSphere-based cluster would still be unable to update. We talked about some details of targeted edge blocking during the stream, which means that clusters that aren’t affected by a blocking bug will still have the option to update. You can also see some information in the roadmap presentation from last November.
  • The last thing we talked about (and showed!) was the Upgrade Path Tool. This is a great tool for identifying the specific versions you want to update your cluster to if you’re making significant leaps. For example, you’re on an early 4.8.z and want to update to the current 4.9.z. The tool will show you exactly which 4.8.z you need to update to before updating to 4.9.
  • A viewer asked about which versions are stable, `n`, `n-1`, and/or `n-2`? If you’re using the stable channel, then you can expect all of them to be stable. Often times the reason why an update edge is blocked is because of the update process - some component may experience an issue with the update, so we wait to identify the reason and determine if a fix is needed.