What’s your Recovery Time Objective (RTO) and Recovery Point Objective (RPO)? Do you have a disaster recovery plan for both OpenShift and the applications? Not sure what they are or how they're related? Let’s talk about what RPO and RTO are, their similarities and differences, and why you need to analyze your application priority to balance between resources and application availability.
In our last episode, we discussed high availability for OpenShift. Now we get to the big question: what happens if things go wrong? This week we discuss disaster recovery scenarios and strategies, including some suggestions on what is, or isn’t, important to protect and recover.
As always, please see the list below for additional links to specific topics, questions, and supporting materials for the episode!
If you’re interested in more streaming content, please subscribe to the Red Hat livestreaming calendar to see the upcoming episode topics and to receive any schedule changes. If you have questions or topic suggestions for the Ask an OpenShift Admin Office Hour, please contact us via Discord, Twitter, or come join us live, Wednesdays at 11am EDT / 1500 UTC, on YouTube and Twitch.
Episode 39 recorded stream:
Use this link to jump directly to where we start talking about today’s topic.
This week’s top of mind topics:
- We first discussed the recent release of Kubernetes 1.22, bringing lots of new features, capabilities, and a new release cadence (every four months instead of three). Earlier this week, the Cloud Tech Tuesday live stream focused entirely on the release and you can also find a blog post from the OpenShift product management, which has our perspective on the important aspects of the new Kubernetes release.
- The next topic we discussed is the difference between a release marked as “stable” in the OpenShift CI system vs a release in the stable channel. The difference is, primarily, tied to update / upgrade stability, but it’s important to always use a release found on console.redhat.com so that you know you’re using a supported version of OpenShift.
- Upgrades from 4.7 to 4.8 are not yet ready for the stable channel yet. The graph data shows a blocking BZ, but you can still update using the fast channel if desired - which is also fully supported!
- If you haven’t found it yet, the GitOps catalog, published and maintained by the container Community of Practice inside of Red Hat, has a large number of items for getting started with deploying and managing your applications using GitOps. If you’re interested in GitOps, don’t forget to watch the GitOps Guide to the Galaxy live streams every other Thursday at noon Pacific time.
- The last item this week is about OpenShift 4.7 rebasing to use Red Hat Enterprise Linux 8.4 as the basis for Red Hat CoreOS.
Questions answered and topics discussed during the stream:
- If you missed last week’s stream, we do a brief summary here. The important part is to remember that high availability usually refers to keeping an application running during a partial cluster failure, whereas disaster recovery is bringing back the application after a full cluster failure.
- Applications deployed to Kubernetes, and of course OpenShift, are often responsible for their own high availability. We often lump this under the blanket term “cloud native”. But, that doesn’t mean that we don’t want some cluster level capabilities to help them recover.
- When beginning to assess your disaster recovery requirements, it’s extremely important to understand several things: your RTO, your RPO, and the requirements of the application. Maybe the app doesn’t need anything other than a new cluster to deploy to. Maybe it needs storage replicated. You need to know those requirements in order to design and deploy appropriately.
- What do you need at a disaster recovery site? Well, it depends on your RTO, RPO, and the application team’s plan. Perhaps you only need a small set of hardware compared to the primary site because some parts will be moved to a hyperscaler.
- Make sure the hardware at the DR site is appropriately configured before an event happens. This includes not just installing an operating system, but things like making sure network connections are configured, BIOS/EFI settings are correct, and so on. Finding and troubleshooting these can take a long time, which is the last thing you want during a disaster recovery scenario.
- Additionally, dependent services - like DHCP, DNS, Active Directory, and so on - also need to be available at the recovery site. You may need to remember to order your recovery appropriately so that those services come up first.
- Does the destination cluster need to be exactly the same? Maybe. There’s a lot of factors that go into the decision, and you - very importantly! - need to work with the application team to understand aspects of their scaling and other configuration.
- Once your destination cluster is up and running, you’ll want to make sure that the dependent Operators and other functionality is available for the application. For example, if you’re relying on a specific Operator to deploy your database, is it available on the DR cluster? Does the application plan to rebuild components at the DR site? Do they need access to the artifact repository? If that repository isn’t replicated, will the rebuild time be extended in order to pull or build other dependencies?
- The last thing we talked about this week, but certainly still important, is to be aware of client side changes that need to happen. For example, if you deployed to a new cluster, did the DNS name change? Do you need to update any other applications to use a different name?
- During the last segment we also talked about how the OpenShift Subscription Guide defines hot, warm, and cold disaster recovery. These are important to understand because it can affect the entitlements you need, specifically infrastructure used for warm and cold DR does not need entitlements until it’s used.