ROSA Architecture Decision Checklist

Last edited: April 2, 2026
Published: April 2, 2026
Authors: Red Hat Cloud Experts

Tags:

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.

Use this checklist when planning a new ROSA deployment or reviewing an existing one. Each item captures a key decision, a safe default, and a pointer to the full rationale in the ROSA Best Practices and Recommendations guide.

Work through the phases in order. Phases 1 and 2 lock in decisions that are hard or impossible to change later; phases 3 and 4 can iterate as workloads evolve. The quick-reference summary at the end lists every decision point with its safe default on one page.

Phase 1: Pre-provisioning (before `rosa create cluster`)

Decisions in this phase are difficult to reverse after cluster creation.

1. Cluster model


Decision	ROSA with Hosted Control Planes (HCP) or ROSA Classic?
Safe default	HCP for new deployments.
When to deviate	Large existing Classic fleet mid-migration; Spot Instance machine pools (Classic only).
Full rationale	Fundamental architecture and the paradigm shift

2. VPC, CIDR, and Availability Zones

This is the most consequential network decision you will make. Machine, pod, and service CIDRs cannot be changed after cluster creation. Undersizing limits your maximum node count and pods-per-node ceiling for the life of the cluster. When a cluster runs out of address space the only remediation is standing up additional clusters, which introduces cross-cluster routing, split deployment decisions, service mesh or federation complexity, and operational overhead that compounds over time. Run the numbers with the OpenShift Network Calculator and validate with your network team before you provision.


Decision	How many AZs? What CIDR ranges for machine, pod, and service networks?
Safe default	3 AZs for production. Use ROSA defaults (machine `10.0.0.0/16`, pod `10.128.0.0/14`, service `172.30.0.0/16`, host prefix `23`) unless IPAM or federation requires otherwise. Plan `/22` or larger for machine CIDR when approaching 500 workers. Keep pod, service, and machine CIDRs unique across on-premises and cloud networks if you need routable connectivity later.
When to deviate	Dev/test can use fewer AZs. Existing enterprise IPAM may dictate non-default ranges, but never go below the HCP minimums (`/25` single-AZ, `/24` multi-AZ) and always leave headroom for growth beyond your day-1 node count.
Full rationale	VPC and CIDR architecture , OpenShift Network Calculator

3. Cluster API and network exposure


Decision	Private or public API? Private or public application ingress?
Safe default	Private API, private default ingress. Add a dedicated edge VPC for internet-facing workloads.
When to deviate	Dev/sandbox where public API simplifies access; non-regulated workloads where public Routes are acceptable.
Full rationale	Private clusters, landing-zone ingress, and application DNS/TLS

4. Egress model


Decision	Zero-egress, proxy/firewall egress, or unrestricted NAT?
Safe default	Zero-egress or centralized firewall/proxy via Transit Gateway for regulated estates.
When to deviate	Teams that need full internet for rapid iteration (dev clusters, PoCs).
Full rationale	Zero-Egress and Secure Egress architectures

5. IAM, STS, and OIDC


Decision	Have you created STS roles, OIDC config, and scoped IAM policies?
Safe default	Always STS mode. Create a reusable OIDC configuration shared across clusters. Scope every role to least privilege.
When to deviate	Rarely. Static IAM keys are a gap to close, not a design choice.
Full rationale	Identity and Access Management through STS and OIDC

6. Encryption at rest


Decision	AWS-managed keys or customer-managed keys (CMK/BYOK) in KMS?
Safe default	CMK for regulated or multi-tenant estates; separate keys for data, backups, and audit.
When to deviate	Non-regulated, single-tenant environments where AWS-managed keys are acceptable.
Full rationale	Security, identity, and encryption on AWS

7. Instance types and machine pools


Decision	Which EC2 family? Graviton (ARM) or x86? Multiple pool sizes?
Safe default	Current-gen general-purpose (e.g. m6i/m7g); evaluate Graviton when images are multi-arch. Use multiple pools to isolate noisy workloads (batch, GPU, ingress).
When to deviate	Memory- or compute-optimized families for specialized tiers (databases, ML).
Full rationale	Instance Type Optimization and Graviton , Worker memory, allocatable capacity, and mixed machine pools

8. AWS service quotas


Decision	Have you reviewed and raised quotas for VPC, ELB, EC2, and ROSA limits in your target region?
Safe default	Review defaults during architecture review, not the day before cutover.
Full rationale	Reliability scope, quotas, and backups

Phase 2: Day-1 cluster configuration (first hours after creation)

9. Identity provider and admin access


Decision	Which external IdP (OIDC, LDAP, Entra ID, Okta)? Who gets `dedicated-admin`? How is break-glass handled?
Safe default	External IdP with MFA. Remove `kubeadmin` after validation. Store break-glass credentials in a managed vault. Reserve `cluster-admin` for exceptional, policy-reviewed grants.
Full rationale	OIDC Configuration and Identity Providers , ROSA customer administration and break-glass

10. Security baselines (SCC and Pod Security)


Decision	Which SCC for workloads? How do you enforce `restricted` as the default?
Safe default	`restricted` (or `restricted-v2`) SCC for all workloads unless a documented exception exists. Custom SCCs over granting `privileged`. Namespace Pod Security labels aligned with SCC admission.
Full rationale	Security Context Constraints (SCC) and Pod Security , Pod security context baselines

11. Project templates and tenant defaults


Decision	Do new Projects get baseline `ResourceQuota`, `LimitRange`, `NetworkPolicy`, and `EgressFirewall` automatically?
Safe default	Yes. Configure a project request template so every Project inherits deny-by-default network policy, quotas, and limit ranges.
Full rationale	Projects, quotas, and project request templates

12. Network isolation


Decision	Default-deny `NetworkPolicy` per namespace? `EgressFirewall` for external destinations?
Safe default	Default-deny ingress and egress per namespace, with allow rules for the ingress controller and approved external APIs.
Full rationale	Network isolation with NetworkPolicies and Egress Firewalls

13. Observability stack


Decision	User workload monitoring enabled? Where do logs land (Loki, CloudWatch, SIEM)? Control plane log forwarding configured?
Safe default	Enable user workload monitoring. Forward cluster and control-plane logs to CloudWatch or your SIEM. Federate metrics to Amazon Managed Service for Prometheus or equivalent for long-term retention.
Full rationale	Centralized logging and metrics federation , Application observability

14. GitOps and CI/CD operators


Decision	Install OpenShift GitOps (Argo CD) and/or OpenShift Pipelines (Tekton)? External CI integration?
Safe default	OpenShift GitOps for declarative desired state. OpenShift Pipelines or external CI for build/test/promote. Pin Subscriptions with Manual `installPlanApproval`.
Full rationale	CI/CD and GitOps (platform-native)

15. Secret management


Decision	How are secrets delivered to workloads? Manual `Secret` YAML, or automated sync from a central store?
Safe default	External Secrets Operator syncing from AWS Secrets Manager (or Vault) with IRSA-backed authentication. Namespace-scoped `SecretStore` with least-privilege IAM.
Full rationale	Configuration, secrets, and external secret management

16. Compliance scanning


Decision	Which compliance profiles (CIS, PCI-DSS, FedRAMP)?
Safe default	Install the Compliance Operator, select profiles matching your regulatory posture, and review scan results on a regular cadence.
Full rationale	The OpenShift Compliance Operator

Phase 3: Workload onboarding (per application or team)

17. Health probes


Decision	Does every container define liveness, readiness, and (where needed) startup probes?
Safe default	Distinct liveness (`/livez`, narrow deadlock detection) and readiness (`/readyz`, dependency-aware) endpoints. Startup probes for slow-init apps.
Don’t	Reuse the same heavy endpoint for both liveness and readiness; that causes restart loops under load.
Full rationale	Health probes and the container lifecycle

18. Graceful shutdown


Decision	Does the application handle SIGTERM? Is `terminationGracePeriodSeconds` tuned?
Safe default	Stop accepting new work on SIGTERM, drain in-flight requests, and set the grace period to cover p99 latency. Use `preStop` hooks for deregistration when needed.
Full rationale	Graceful shutdown and rolling updates

19. Resource requests, limits, and QoS


Decision	Do all containers have CPU and memory requests and limits?
Safe default	Always set requests. Set memory limits. Be deliberate with CPU limits (they throttle via CFS). Use VPA in recommendation-only mode to right-size before committing.
Don’t	Deploy without requests: the scheduler cannot place Pods fairly and the cluster autoscaler cannot react.
Full rationale	Resource management and QoS

20. Scheduling and spread


Decision	Are replicas spread across nodes and AZs?
Safe default	Use `topologySpreadConstraints` for node and zone spread. Run 3+ replicas for tier-1 services. Pair with PDBs.
Don’t	Run a single replica and call it “HA” because the cluster is multi-AZ.
Full rationale	Scheduling spread, affinity, and noisy neighbors

21. Pod Disruption Budgets


Decision	Does every stateful or tier-1 workload have a PDB?
Safe default	`maxUnavailable: 1` (or equivalent) so drains and upgrades can proceed.
Don’t	Set `minAvailable` equal to your total replica count; that blocks all node drains and cluster upgrades.
Full rationale	Pod Disruption Budgets (PDBs)

22. Storage selection


Decision	EBS (RWO), EFS (RWX), S3 (object), or ephemeral?
Safe default	EBS via CSI (gp3, tuned IOPS) for most RWO workloads. EFS only when true RWX is required. S3 for blobs, data lakes, and off-cluster backups. Avoid large `emptyDir` or `hostPath`.
Don’t	Promise RWX on EBS-backed StorageClasses. Use `hostPath` in shared clusters without security review.
Full rationale	Persistent storage, CSI, and data planes on AWS

23. Backing services


Decision	Managed AWS service (RDS, ElastiCache, DynamoDB) or in-cluster StatefulSet?
Safe default	Managed services for tier-1 data. In-cluster operators are valid for dev/test or when you fully own the support story.
Don’t	Run a single-replica in-cluster database for production without documenting it as a deliberate SPOF.
Full rationale	Managed backing services vs in-cluster state on AWS

24. Application AWS access (IRSA)


Decision	How do Pods authenticate to AWS APIs (S3, Secrets Manager, RDS, SQS)?
Safe default	IRSA: dedicated `ServiceAccount` per app, dedicated IAM role with least-privilege trust policy scoped to the cluster OIDC issuer and exact `sub` claim.
Don’t	Embed `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` in Secrets, ConfigMaps, or Deployment env vars.
Full rationale	Application workloads: IRSA, STS, and AWS credentials

25. Service accounts and RBAC


Decision	Dedicated ServiceAccount per workload? Token automount disabled when not needed?
Safe default	One SA per app, `automountServiceAccountToken: false` for Pods that do not call the Kubernetes API. Minimal Role/ClusterRole bindings.
Full rationale	Service accounts and RBAC for workloads

26. Container image hygiene


Decision	Images pinned by digest? Base image rebuild process?
Safe default	Pull by digest or one-to-one tagged builds. Rebuild on CVE fixes as part of normal change cadence.
Don’t	Use unbounded `:latest` in production.
Full rationale	Container images, digests, and CVE response

27. Routes, TLS, and ingress


Decision	TLS mode per Route (edge, passthrough, reencrypt)? Certificate source (cert-manager, ACM, manual)?
Safe default	TLS on every Route. cert-manager Operator for on-cluster certs with automated renewal. ACM for TLS terminated on ALB/CloudFront at the edge. External DNS Operator to sync Route hostnames to Route 53.
Don’t	Expose Routes without TLS. Manually rotate wildcard certs pasted into Secrets. Hardcode IPs instead of FQDNs.
Full rationale	OpenShift Routes, ingress policy, and OVN semantics on ROSA

Phase 4: Steady-state operations

28. Upgrade strategy


Decision	How do you stage control plane and machine pool upgrades?
Safe default	Upgrade the hosted control plane first, then machine pools in sequence. Verify `ClusterOperator` health and Insights findings after each step. Use node surge so capacity is not reduced during upgrades.
Full rationale	Decoupled upgrade strategy , API compatibility and upgrade readiness

29. Autoscaling


Decision	How do you scale nodes, replicas, and per-Pod resources?
Safe default	Cluster autoscaler for node capacity (at least one pool per AZ). HPA for replica scaling on CPU or custom metrics. VPA in recommendation-only mode to inform request/limit tuning. For predictable spikes, schedule capacity ahead of demand.
Full rationale	Multi-Dimensional Autoscaling

30. Cost optimization


Decision	Savings Plans or Reserved Instances for steady pools? Consistent tagging for chargeback?
Safe default	Savings Plans for production workers. Tag VPC, LB, and machine pool resources with environment, cost center, and application keys.
Full rationale	Financial engineering and cost optimization , Performance, FinOps tags, and predictable capacity

31. Disaster recovery and backup


Decision	What is your RPO and RTO? HA scope (in-region) vs DR scope (cross-region)?
Safe default	Multi-AZ workers + spread Pods + multi-AZ managed data for in-region HA. Workload-scoped backup (RDS snapshots, EBS snapshots, Velero/OADP per namespace) rather than whole-cluster restore as the default story. Rehearse restores on a schedule.
Don’t	Conflate multi-AZ HA with full regional DR. Promise “high availability” without multi-AZ data and enough spread replicas.
Full rationale	Disaster Recovery and business continuity

32. Multi-Region (if applicable)


Decision	Hot/hot, hot/warm, or hot/cold posture? Data replication strategy?
Safe default	Hot/warm for most enterprise DR. Pair with Aurora Global, S3 CRR, Route 53 failover, and ECR cross-region replication. IaC and GitOps to rebuild the secondary cluster within RTO.
Full rationale	Multi-Region and Global Connectivity

33. Proactive health checks


Decision	How do you catch drift before it becomes an incident?
Safe default	Insights Advisor for platform recommendations. Periodic cluster health scripts (operator status, unbounded Pods, privileged SCC, PDB violations). Compliance Operator scans.
Full rationale	Proactive health monitoring with Insights Advisor , Health assessment framework and investigative scripting

34. Infrastructure as code


Decision	How are VPCs, peering, and landing zones provisioned?
Safe default	Terraform, CloudFormation, or ROSA CLI + versioned manifests. Console steps are fine for illustration but should not be the only path to reproduce production.
Full rationale	Operational excellence: IaC, observability, and residency

Quick-reference summary

Download the same rows as best-practices-checklist-decisions.csv (columns: id, phase, decision_point, safe_default).

#	Decision point	Safe default
Pre-provisioning
1	HCP or Classic?	HCP for new deployments
2	AZs and CIDR ranges	3 AZs; ROSA defaults; `/22`+ machine CIDR at scale
3	Private or public API / ingress?	Private API + edge VPC for internet workloads
4	Egress model	Zero-egress or firewall/proxy via TGW
5	STS roles, OIDC, IAM scoping	STS always; reusable OIDC; least-privilege roles
6	AWS-managed or CMK in KMS?	CMK for regulated / multi-tenant estates
7	EC2 families and pool layout	Current-gen GP; evaluate Graviton; multiple pools
8	Quotas reviewed and raised?	Review during architecture review, not cutover
Day-1 configuration
9	IdP, admin model, break-glass	External IdP + MFA; remove kubeadmin; vault for break-glass
10	SCC baseline and enforcement	`restricted` for all; custom SCCs over `privileged`
11	Project template with defaults?	Auto-create quota + NetworkPolicy + LimitRange
12	NetworkPolicy and EgressFirewall	Default-deny per namespace
13	Monitoring, logging, metrics	User workload monitoring + log forwarding to CloudWatch/SIEM
14	GitOps / CI / CD operators	OpenShift GitOps + Pipelines or external CI
15	Secret delivery mechanism	ESO + Secrets Manager via IRSA
16	Compliance profiles	Compliance Operator with regulatory profiles
Workload onboarding
17	Probe design per workload	Distinct liveness (`/livez`) and readiness (`/readyz`)
18	SIGTERM handling and grace period	Handle SIGTERM; tune `terminationGracePeriodSeconds`
19	Requests, limits, QoS	Always set requests; VPA recommend-only to inform sizing
20	Topology spread, replica count	`topologySpreadConstraints` + 3+ replicas for tier-1
21	PDB policy	`maxUnavailable: 1`
22	Storage tier per workload	EBS gp3 (RWO); EFS only for RWX; S3 for objects
23	Managed vs in-cluster state	Managed AWS services for tier-1 data
24	IRSA wiring per app	Dedicated SA + IAM role per app; no static keys
25	SA and RBAC scoping	Dedicated SA; `automountServiceAccountToken: false`
26	Image pinning and rebuild	Pin by digest; rebuild on CVE
27	TLS mode and cert source	TLS always; cert-manager + External DNS
Steady-state operations
28	Upgrade sequencing	Control plane first, then pools; verify ClusterOperators
29	CA / HPA / VPA	CA per AZ + HPA + VPA recommend-only
30	Savings Plans, tagging	Savings Plans for production; consistent FinOps tags
31	RPO, RTO, backup scope	Workload-scoped backup; rehearse restores
32	Multi-Region posture	Hot/warm + Aurora Global / S3 CRR / Route 53
33	Health check tooling	Insights + health scripts + Compliance Operator
34	IaC tooling	Terraform / CloudFormation / ROSA CLI in Git