Cloud Experts Documentation

Custom Alerts on ROSA Classic

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration. This guide has been validated on OpenShift 4.18. Operator CRD names, API versions, and console paths may differ on other versions.

This guide shows how to create custom alerts on a ROSA Classic cluster. To keep it grounded in something you actually hit in the field, we focus on a common pain point: high-traffic workloads such as ingress controllers or API gateways that pile onto too few nodes can exhaust nf_conntrack capacity on those workers. The steps that follow show how to observe that pressure with platform metrics, evaluate alerting rules in the user workload monitoring path, and send notifications through user Alertmanager.

Every Kubernetes object in this guide is applied with oc apply -f - <<'EOF' (or equivalent). Use a shell where oc login already targets your cluster as a user who can edit openshift-user-workload-monitoring and create namespaces (for example cluster-admin).

Linux netfilter tracks connections in a fixed-size nf_conntrack table on each node. Ingress controllers, API gateways, and similar edge components terminate or proxy many short-lived or long-lived connections. That volume maps to per-node conntrack state, and busy edge stacks are often where pressure shows up first. The effect is worse if pods lack appropriate anti-affinity (or other spreading rules) and bunch on one or two workers, concentrating connection churn and table usage that would be tolerable if spread across the fleet. Symptoms include timeouts, packet loss, and errors localized to specific nodes, which are easy to misattribute to the network or security groups.

Use alerting here as a signal, not a substitute for capacity and placement work: review scheduling (pod anti-affinity, topology spread constraints, replica counts, and node capacity) so ingress and gateway traffic spreads across workers.

Metrics: OpenShift node_exporter exposes gauges such as node_nf_conntrack_entries, node_nf_conntrack_entries_limit, and kernel stat counters like node_nf_conntrack_stat_drop. They are scraped by platform Prometheus in openshift-monitoring.

Why the platform alert is not enough on ROSA Classic

OpenShift ships a platform rule roughly equivalent to high conntrack utilization when:

node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.75

(alert name NodeHighNumberConntrackEntriesUsed in node-exporter-rules).

On ROSA Classic, customers cannot configure receivers on the platform Alertmanager. Platform alerts follow Red Hat’s managed path; you do not get Slack, PagerDuty, or similar from that stack the way you would on a self-managed cluster.

If you need your own notification channels, evaluate equivalent (or stricter) rules in the user workload monitoring path and send them to user Alertmanager (and related config).

Overlap is OK: duplicating the 75% condition under user monitoring is normal on ROSA Classic. You are not double paging from platform Alertmanager; customer paging should come from user Alertmanager only.

How the pieces fit together

Piece Role
node-exporter Exposes node_nf_conntrack_* from each node (platform scrape).
Platform Prometheus + Thanos sidecars Store those series.
thanos-querier in openshift-monitoring Merges queries across platform and user-workload Prometheus backends.
thanos-ruler-user-workload Evaluates PrometheusRule objects that are not scoped to leaf-prometheus, running PromQL against Thanos Querier, so expressions can see cluster metrics (including conntrack) and user metrics.
namespacesWithoutLabelEnforcement In user-workload-monitoring-config, lists namespaces (here custom-alert) where Thanos Ruler must not force every query to match namespace="<project>", which would otherwise hide openshift-monitoring series.
User Alertmanager Receives alerts from the user monitoring stack so you can set receivers (Slack, PagerDuty, etc.).

Do not set openshift.io/prometheus-rule-evaluation-scope: leaf-prometheus on these rules: that path uses user-workload Prometheus only, whose TSDB does not contain node_nf_conntrack_* by default.

Prerequisites

  • oc logged in (for example cluster-admin).
  • User workload monitoring enabled:
    • cluster-monitoring-config in openshift-monitoring must include enableUserWorkload: true
    • (ROSA and OSD usually already satisfy this when UWM is on).

Configure the namespace and user workload monitoring

Use project name custom-alert below. If you change it, update both the Namespace and namespacesWithoutLabelEnforcement consistently.

Create the namespace

User workload monitoring ConfigMap

Inspect what is already there

Check whether data.config.yaml has any content:

Interpret the result:

Situation What it means What to do
Error from server (NotFound) ConfigMap not present yet. Use the full oc apply heredoc below (creates the object).
data: missing, or data: {}, or wc -c prints 0 The object exists but there is no config.yaml (or it is empty). Use the full oc apply heredoc below. It only adds data.config.yaml.
wc -c is greater than 0 config.yaml already has body text. Merge into that YAML by hand (or export, edit, re-apply). Do not paste the heredoc blindly or you will replace the entire config.yaml and drop existing settings.

Optional: list keys under data (requires jq):

Empty or missing config.yaml: apply full ConfigMap

Non-empty config.yaml: merge by hand

Edit the live object so config.yaml keeps your existing keys, and add or extend:

  • namespacesWithoutLabelEnforcement: include custom-alert (append to the list if the key already exists).
  • alertmanager.enabled: true if you need user Alertmanager and it is not already enabled there.

Then apply your merged file, for example:

Wrong shape will be rejected by the admission webhook (for example prometheus: enabled: true under user-workload-monitoring-config on 4.18+).

Wait 1 to 3 minutes for the cluster monitoring operator to reconcile Thanos Ruler and Alertmanager pods in openshift-user-workload-monitoring.

Demo: always-firing lab alerts

Use this to prove Thanos Ruler can see node_nf_conntrack_* and that Observe → Alerting with Source: User shows Firing (typically one series per node).

The following PrometheusRule creates two lab-only alerts in custom-alert. Both use conditions that are always true on a healthy cluster (entries >= 0 and limit > 0). They do not detect a real incident; they only confirm that Thanos Ruler can query node_exporter conntrack series through Thanos Querier. Labels use severity: none so you can tell them apart from production alerts.

Expect: within a couple of evaluation intervals, open Administrator → Observe → Alerting → Alerting rules, filter Source: User, and find ConntrackEntriesNonNegative and ConntrackLimitPositive in state Firing with count approximately equal to node count.

Production: real thresholds

Remove the lab rules first (optional but avoids noise), then apply utilization and drop or insert-failure alerts.

Remove demo rules

Apply production-style rules

The next PrometheusRule replaces the lab checks with threshold and failure rules: warning and critical alerts when the conntrack table is above 75% or 90% full (with for delays to reduce noise), and critical alerts when kernel counters show packets dropped, early drops, or insert failures (rates over five minutes). Together those cover high utilization and signs the table is already failing traffic.

Tuning thresholds

  • Warning ratio: change 0.75 or lengthen for: 15m if too noisy.
  • Critical ratio: change 0.90 or for: 5m.
  • Drops: rate(...[5m]) > 0 with for: 2m ignores single-scrape blips; tighten or loosen as needed.
  • If the UI warns that stat_* metrics are not counters, confirm types for your node_exporter build; increase(...[10m]) > 0 is a common alternative for counter-like series.

Notifications (user Alertmanager)

User Alertmanager reads its configuration from Secret/alertmanager-user-workload in openshift-user-workload-monitoring (key alertmanager.yaml). For enabling user Alertmanager, Slack webhooks, and updating that Secret, see Custom Alerts in ROSA 4.11.x .

Note: User Alertmanager configuration is not a fully managed ROSA surface. Toggling User Workload Monitoring in OpenShift Cluster Manager can overwrite related configuration. Keep a copy of your alertmanager.yaml and re-apply after changes.

Prerequisites: Create a Slack Incoming Webhook and set the URL in your shell (do not commit it):

Example Slack receiver (unquoted EOF so the shell substitutes ${SLACK_WEBHOOK_URL}; change #openshift-alerts to your channel):

If you prefer not to expand a variable in the shell, use a literal placeholder in api_url: and replace it before oc apply, or use --from-file=alertmanager.yaml=... as in the linked guide.

After applying, wait for user Alertmanager pods to pick up the Secret (typically within about a minute). Firing User alerts (for example the lab rules ConntrackEntriesNonNegative or ConntrackLimitPositive from the demo section) should then appear in Slack.

For AlertmanagerConfig CRs, multiple receivers, or release-specific details, follow your OpenShift version documentation in addition to the Cloud Experts article.

Verify

In the console: Administrator → Observe → Alerting, filter Source: User and search for alert names.

Optional platform check (confirms metrics exist at source):

Cleanup

Remove custom-alert from namespacesWithoutLabelEnforcement (edit the ConfigMap and re-apply), then optionally delete the namespace:

Troubleshooting

Symptom Things to check
Lab rules never fire as User namespacesWithoutLabelEnforcement includes custom-alert; wait for operator reconcile; rule must not have leaf-prometheus scope.
oc apply ConfigMap Forbidden / unknown field On 4.18+, remove prometheus.enabled from user-workload-monitoring-config.
custom-alert not eligible Namespace needs openshift.io/user-monitoring: "true"; avoid openshift.io/cluster-monitoring: "true" on that namespace for this pattern.
oc exec into Prometheus fails Use Observe → Metrics in the console or fix admission webhooks blocking exec.
Duplicate 75% in platform UI Expected on ROSA Classic; customer paging should come from user Alertmanager only.

Optional further reading

  • ROSA Classic Managing alerts (user-defined projects, cross-project rules, and related administrator tasks).
  • You can keep the same manifests as files and run oc apply -f instead of heredocs if that fits your GitOps workflow.

This guide aligns with OpenShift 4.18 validation of user-workload-monitoring-config and Thanos Ruler and Querier behavior observed on ROSA Classic.

Back to top

Interested in contributing to these docs?

Collaboration drives progress. Help improve our documentation The Red Hat Way.

Red Hat logo LinkedIn YouTube Facebook Twitter

Products

Tools

Try, buy & sell

Communicate

About Red Hat

We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Subscribe to our newsletter, Red Hat Shares

Sign up now
© 2026 Red Hat