How do you know if something bad is happening in your cluster? How do you know that a node is down, an application isn’t responding, or the storage backing a PVC has “disappeared”? If your answer to any of those is “when the users tell us there’s an error”, then it may be time to reevaluate your monitoring and alerting strategy.

Fortunately, OpenShift has built-in tools for doing just this. With only a small amount of work you can ensure that you’re receiving the proper alerts and warnings so that you can, hopefully, avoid any sticky situations. This week we are joined by Brian Gottfried, from Red Hat Consulting, to focus on Alertmanager, how to configure it and how to customize the settings to avoid both too many alerts and not enough.

As always, please see the list below for additional links to specific topics, questions, and supporting materials for the episode!

If you’re interested in more streaming content, please subscribe to the OpenShift.tv streaming calendar to see the upcoming episode topics and to receive any schedule changes. If you have questions or topic suggestions for the Ask an OpenShift Admin Office Hour, please contact us via Discord, Twitter, or come join us live, Wednesdays at 11am EDT / 1500 UTC, on YouTube and Twitch.

Episode 31 recorded stream:

 

 

Use this link to jump directly to where we start talking about today’s topic.

Supporting links for today:

Questions answered during the stream: