ROSA Architecture Decision Checklist
This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.
Use this checklist when planning a new ROSA deployment or reviewing an existing one. Each item captures a key decision, a safe default, and a pointer to the full rationale in the ROSA Best Practices and Recommendations guide.
Work through the phases in order. Phases 1 and 2 lock in decisions that are hard or impossible to change later; phases 3 and 4 can iterate as workloads evolve. The quick-reference summary at the end lists every decision point with its safe default on one page.
Phase 1: Pre-provisioning (before rosa create cluster)
Decisions in this phase are difficult to reverse after cluster creation.
1. Cluster model
| Decision | ROSA with Hosted Control Planes (HCP) or ROSA Classic? |
| Safe default | HCP for new deployments. |
| When to deviate | Large existing Classic fleet mid-migration; Spot Instance machine pools (Classic only). |
| Full rationale | Fundamental architecture and the paradigm shift |
2. VPC, CIDR, and Availability Zones
This is the most consequential network decision you will make. Machine, pod, and service CIDRs cannot be changed after cluster creation. Undersizing limits your maximum node count and pods-per-node ceiling for the life of the cluster. When a cluster runs out of address space the only remediation is standing up additional clusters, which introduces cross-cluster routing, split deployment decisions, service mesh or federation complexity, and operational overhead that compounds over time. Run the numbers with the OpenShift Network Calculator and validate with your network team before you provision.
| Decision | How many AZs? What CIDR ranges for machine, pod, and service networks? |
| Safe default | 3 AZs for production. Use ROSA defaults (machine 10.0.0.0/16, pod 10.128.0.0/14, service 172.30.0.0/16, host prefix 23) unless IPAM or federation requires otherwise. Plan /22 or larger for machine CIDR when approaching 500 workers. Keep pod, service, and machine CIDRs unique across on-premises and cloud networks if you need routable connectivity later. |
| When to deviate | Dev/test can use fewer AZs. Existing enterprise IPAM may dictate non-default ranges, but never go below the HCP minimums (/25 single-AZ, /24 multi-AZ) and always leave headroom for growth beyond your day-1 node count. |
| Full rationale | VPC and CIDR architecture , OpenShift Network Calculator |
3. Cluster API and network exposure
| Decision | Private or public API? Private or public application ingress? |
| Safe default | Private API, private default ingress. Add a dedicated edge VPC for internet-facing workloads. |
| When to deviate | Dev/sandbox where public API simplifies access; non-regulated workloads where public Routes are acceptable. |
| Full rationale | Private clusters, landing-zone ingress, and application DNS/TLS |
4. Egress model
| Decision | Zero-egress, proxy/firewall egress, or unrestricted NAT? |
| Safe default | Zero-egress or centralized firewall/proxy via Transit Gateway for regulated estates. |
| When to deviate | Teams that need full internet for rapid iteration (dev clusters, PoCs). |
| Full rationale | Zero-Egress and Secure Egress architectures |
5. IAM, STS, and OIDC
| Decision | Have you created STS roles, OIDC config, and scoped IAM policies? |
| Safe default | Always STS mode. Create a reusable OIDC configuration shared across clusters. Scope every role to least privilege. |
| When to deviate | Rarely. Static IAM keys are a gap to close, not a design choice. |
| Full rationale | Identity and Access Management through STS and OIDC |
6. Encryption at rest
| Decision | AWS-managed keys or customer-managed keys (CMK/BYOK) in KMS? |
| Safe default | CMK for regulated or multi-tenant estates; separate keys for data, backups, and audit. |
| When to deviate | Non-regulated, single-tenant environments where AWS-managed keys are acceptable. |
| Full rationale | Security, identity, and encryption on AWS |
7. Instance types and machine pools
| Decision | Which EC2 family? Graviton (ARM) or x86? Multiple pool sizes? |
| Safe default | Current-gen general-purpose (e.g. m6i/m7g); evaluate Graviton when images are multi-arch. Use multiple pools to isolate noisy workloads (batch, GPU, ingress). |
| When to deviate | Memory- or compute-optimized families for specialized tiers (databases, ML). |
| Full rationale | Instance Type Optimization and Graviton , Worker memory, allocatable capacity, and mixed machine pools |
8. AWS service quotas
| Decision | Have you reviewed and raised quotas for VPC, ELB, EC2, and ROSA limits in your target region? |
| Safe default | Review defaults during architecture review, not the day before cutover. |
| Full rationale | Reliability scope, quotas, and backups |
Phase 2: Day-1 cluster configuration (first hours after creation)
9. Identity provider and admin access
| Decision | Which external IdP (OIDC, LDAP, Entra ID, Okta)? Who gets dedicated-admin? How is break-glass handled? |
| Safe default | External IdP with MFA. Remove kubeadmin after validation. Store break-glass credentials in a managed vault. Reserve cluster-admin for exceptional, policy-reviewed grants. |
| Full rationale | OIDC Configuration and Identity Providers , ROSA customer administration and break-glass |
10. Security baselines (SCC and Pod Security)
| Decision | Which SCC for workloads? How do you enforce restricted as the default? |
| Safe default | restricted (or restricted-v2) SCC for all workloads unless a documented exception exists. Custom SCCs over granting privileged. Namespace Pod Security labels aligned with SCC admission. |
| Full rationale | Security Context Constraints (SCC) and Pod Security , Pod security context baselines |
11. Project templates and tenant defaults
| Decision | Do new Projects get baseline ResourceQuota, LimitRange, NetworkPolicy, and EgressFirewall automatically? |
| Safe default | Yes. Configure a project request template so every Project inherits deny-by-default network policy, quotas, and limit ranges. |
| Full rationale | Projects, quotas, and project request templates |
12. Network isolation
| Decision | Default-deny NetworkPolicy per namespace? EgressFirewall for external destinations? |
| Safe default | Default-deny ingress and egress per namespace, with allow rules for the ingress controller and approved external APIs. |
| Full rationale | Network isolation with NetworkPolicies and Egress Firewalls |
13. Observability stack
| Decision | User workload monitoring enabled? Where do logs land (Loki, CloudWatch, SIEM)? Control plane log forwarding configured? |
| Safe default | Enable user workload monitoring. Forward cluster and control-plane logs to CloudWatch or your SIEM. Federate metrics to Amazon Managed Service for Prometheus or equivalent for long-term retention. |
| Full rationale | Centralized logging and metrics federation , Application observability |
14. GitOps and CI/CD operators
| Decision | Install OpenShift GitOps (Argo CD) and/or OpenShift Pipelines (Tekton)? External CI integration? |
| Safe default | OpenShift GitOps for declarative desired state. OpenShift Pipelines or external CI for build/test/promote. Pin Subscriptions with Manual installPlanApproval. |
| Full rationale | CI/CD and GitOps (platform-native) |
15. Secret management
| Decision | How are secrets delivered to workloads? Manual Secret YAML, or automated sync from a central store? |
| Safe default | External Secrets Operator syncing from AWS Secrets Manager (or Vault) with IRSA-backed authentication. Namespace-scoped SecretStore with least-privilege IAM. |
| Full rationale | Configuration, secrets, and external secret management |
16. Compliance scanning
| Decision | Which compliance profiles (CIS, PCI-DSS, FedRAMP)? |
| Safe default | Install the Compliance Operator, select profiles matching your regulatory posture, and review scan results on a regular cadence. |
| Full rationale | The OpenShift Compliance Operator |
Phase 3: Workload onboarding (per application or team)
17. Health probes
| Decision | Does every container define liveness, readiness, and (where needed) startup probes? |
| Safe default | Distinct liveness (/livez, narrow deadlock detection) and readiness (/readyz, dependency-aware) endpoints. Startup probes for slow-init apps. |
| Don’t | Reuse the same heavy endpoint for both liveness and readiness; that causes restart loops under load. |
| Full rationale | Health probes and the container lifecycle |
18. Graceful shutdown
| Decision | Does the application handle SIGTERM? Is terminationGracePeriodSeconds tuned? |
| Safe default | Stop accepting new work on SIGTERM, drain in-flight requests, and set the grace period to cover p99 latency. Use preStop hooks for deregistration when needed. |
| Full rationale | Graceful shutdown and rolling updates |
19. Resource requests, limits, and QoS
| Decision | Do all containers have CPU and memory requests and limits? |
| Safe default | Always set requests. Set memory limits. Be deliberate with CPU limits (they throttle via CFS). Use VPA in recommendation-only mode to right-size before committing. |
| Don’t | Deploy without requests: the scheduler cannot place Pods fairly and the cluster autoscaler cannot react. |
| Full rationale | Resource management and QoS |
20. Scheduling and spread
| Decision | Are replicas spread across nodes and AZs? |
| Safe default | Use topologySpreadConstraints for node and zone spread. Run 3+ replicas for tier-1 services. Pair with PDBs. |
| Don’t | Run a single replica and call it “HA” because the cluster is multi-AZ. |
| Full rationale | Scheduling spread, affinity, and noisy neighbors |
21. Pod Disruption Budgets
| Decision | Does every stateful or tier-1 workload have a PDB? |
| Safe default | maxUnavailable: 1 (or equivalent) so drains and upgrades can proceed. |
| Don’t | Set minAvailable equal to your total replica count; that blocks all node drains and cluster upgrades. |
| Full rationale | Pod Disruption Budgets (PDBs) |
22. Storage selection
| Decision | EBS (RWO), EFS (RWX), S3 (object), or ephemeral? |
| Safe default | EBS via CSI (gp3, tuned IOPS) for most RWO workloads. EFS only when true RWX is required. S3 for blobs, data lakes, and off-cluster backups. Avoid large emptyDir or hostPath. |
| Don’t | Promise RWX on EBS-backed StorageClasses. Use hostPath in shared clusters without security review. |
| Full rationale | Persistent storage, CSI, and data planes on AWS |
23. Backing services
| Decision | Managed AWS service (RDS, ElastiCache, DynamoDB) or in-cluster StatefulSet? |
| Safe default | Managed services for tier-1 data. In-cluster operators are valid for dev/test or when you fully own the support story. |
| Don’t | Run a single-replica in-cluster database for production without documenting it as a deliberate SPOF. |
| Full rationale | Managed backing services vs in-cluster state on AWS |
24. Application AWS access (IRSA)
| Decision | How do Pods authenticate to AWS APIs (S3, Secrets Manager, RDS, SQS)? |
| Safe default | IRSA: dedicated ServiceAccount per app, dedicated IAM role with least-privilege trust policy scoped to the cluster OIDC issuer and exact sub claim. |
| Don’t | Embed AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY in Secrets, ConfigMaps, or Deployment env vars. |
| Full rationale | Application workloads: IRSA, STS, and AWS credentials |
25. Service accounts and RBAC
| Decision | Dedicated ServiceAccount per workload? Token automount disabled when not needed? |
| Safe default | One SA per app, automountServiceAccountToken: false for Pods that do not call the Kubernetes API. Minimal Role/ClusterRole bindings. |
| Full rationale | Service accounts and RBAC for workloads |
26. Container image hygiene
| Decision | Images pinned by digest? Base image rebuild process? |
| Safe default | Pull by digest or one-to-one tagged builds. Rebuild on CVE fixes as part of normal change cadence. |
| Don’t | Use unbounded :latest in production. |
| Full rationale | Container images, digests, and CVE response |
27. Routes, TLS, and ingress
| Decision | TLS mode per Route (edge, passthrough, reencrypt)? Certificate source (cert-manager, ACM, manual)? |
| Safe default | TLS on every Route. cert-manager Operator for on-cluster certs with automated renewal. ACM for TLS terminated on ALB/CloudFront at the edge. External DNS Operator to sync Route hostnames to Route 53. |
| Don’t | Expose Routes without TLS. Manually rotate wildcard certs pasted into Secrets. Hardcode IPs instead of FQDNs. |
| Full rationale | OpenShift Routes, ingress policy, and OVN semantics on ROSA |
Phase 4: Steady-state operations
28. Upgrade strategy
| Decision | How do you stage control plane and machine pool upgrades? |
| Safe default | Upgrade the hosted control plane first, then machine pools in sequence. Verify ClusterOperator health and Insights findings after each step. Use node surge so capacity is not reduced during upgrades. |
| Full rationale | Decoupled upgrade strategy , API compatibility and upgrade readiness |
29. Autoscaling
| Decision | How do you scale nodes, replicas, and per-Pod resources? |
| Safe default | Cluster autoscaler for node capacity (at least one pool per AZ). HPA for replica scaling on CPU or custom metrics. VPA in recommendation-only mode to inform request/limit tuning. For predictable spikes, schedule capacity ahead of demand. |
| Full rationale | Multi-Dimensional Autoscaling |
30. Cost optimization
| Decision | Savings Plans or Reserved Instances for steady pools? Consistent tagging for chargeback? |
| Safe default | Savings Plans for production workers. Tag VPC, LB, and machine pool resources with environment, cost center, and application keys. |
| Full rationale | Financial engineering and cost optimization , Performance, FinOps tags, and predictable capacity |
31. Disaster recovery and backup
| Decision | What is your RPO and RTO? HA scope (in-region) vs DR scope (cross-region)? |
| Safe default | Multi-AZ workers + spread Pods + multi-AZ managed data for in-region HA. Workload-scoped backup (RDS snapshots, EBS snapshots, Velero/OADP per namespace) rather than whole-cluster restore as the default story. Rehearse restores on a schedule. |
| Don’t | Conflate multi-AZ HA with full regional DR. Promise “high availability” without multi-AZ data and enough spread replicas. |
| Full rationale | Disaster Recovery and business continuity |
32. Multi-Region (if applicable)
| Decision | Hot/hot, hot/warm, or hot/cold posture? Data replication strategy? |
| Safe default | Hot/warm for most enterprise DR. Pair with Aurora Global, S3 CRR, Route 53 failover, and ECR cross-region replication. IaC and GitOps to rebuild the secondary cluster within RTO. |
| Full rationale | Multi-Region and Global Connectivity |
33. Proactive health checks
| Decision | How do you catch drift before it becomes an incident? |
| Safe default | Insights Advisor for platform recommendations. Periodic cluster health scripts (operator status, unbounded Pods, privileged SCC, PDB violations). Compliance Operator scans. |
| Full rationale | Proactive health monitoring with Insights Advisor , Health assessment framework and investigative scripting |
34. Infrastructure as code
| Decision | How are VPCs, peering, and landing zones provisioned? |
| Safe default | Terraform, CloudFormation, or ROSA CLI + versioned manifests. Console steps are fine for illustration but should not be the only path to reproduce production. |
| Full rationale | Operational excellence: IaC, observability, and residency |
Quick-reference summary
Download the same rows as
best-practices-checklist-decisions.csv
(columns: id, phase, decision_point, safe_default).
| # | Decision point | Safe default |
|---|---|---|
| Pre-provisioning | ||
| 1 | HCP or Classic? | HCP for new deployments |
| 2 | AZs and CIDR ranges | 3 AZs; ROSA defaults; /22+ machine CIDR at scale |
| 3 | Private or public API / ingress? | Private API + edge VPC for internet workloads |
| 4 | Egress model | Zero-egress or firewall/proxy via TGW |
| 5 | STS roles, OIDC, IAM scoping | STS always; reusable OIDC; least-privilege roles |
| 6 | AWS-managed or CMK in KMS? | CMK for regulated / multi-tenant estates |
| 7 | EC2 families and pool layout | Current-gen GP; evaluate Graviton; multiple pools |
| 8 | Quotas reviewed and raised? | Review during architecture review, not cutover |
| Day-1 configuration | ||
| 9 | IdP, admin model, break-glass | External IdP + MFA; remove kubeadmin; vault for break-glass |
| 10 | SCC baseline and enforcement | restricted for all; custom SCCs over privileged |
| 11 | Project template with defaults? | Auto-create quota + NetworkPolicy + LimitRange |
| 12 | NetworkPolicy and EgressFirewall | Default-deny per namespace |
| 13 | Monitoring, logging, metrics | User workload monitoring + log forwarding to CloudWatch/SIEM |
| 14 | GitOps / CI / CD operators | OpenShift GitOps + Pipelines or external CI |
| 15 | Secret delivery mechanism | ESO + Secrets Manager via IRSA |
| 16 | Compliance profiles | Compliance Operator with regulatory profiles |
| Workload onboarding | ||
| 17 | Probe design per workload | Distinct liveness (/livez) and readiness (/readyz) |
| 18 | SIGTERM handling and grace period | Handle SIGTERM; tune terminationGracePeriodSeconds |
| 19 | Requests, limits, QoS | Always set requests; VPA recommend-only to inform sizing |
| 20 | Topology spread, replica count | topologySpreadConstraints + 3+ replicas for tier-1 |
| 21 | PDB policy | maxUnavailable: 1 |
| 22 | Storage tier per workload | EBS gp3 (RWO); EFS only for RWX; S3 for objects |
| 23 | Managed vs in-cluster state | Managed AWS services for tier-1 data |
| 24 | IRSA wiring per app | Dedicated SA + IAM role per app; no static keys |
| 25 | SA and RBAC scoping | Dedicated SA; automountServiceAccountToken: false |
| 26 | Image pinning and rebuild | Pin by digest; rebuild on CVE |
| 27 | TLS mode and cert source | TLS always; cert-manager + External DNS |
| Steady-state operations | ||
| 28 | Upgrade sequencing | Control plane first, then pools; verify ClusterOperators |
| 29 | CA / HPA / VPA | CA per AZ + HPA + VPA recommend-only |
| 30 | Savings Plans, tagging | Savings Plans for production; consistent FinOps tags |
| 31 | RPO, RTO, backup scope | Workload-scoped backup; rehearse restores |
| 32 | Multi-Region posture | Hot/warm + Aurora Global / S3 CRR / Route 53 |
| 33 | Health check tooling | Insights + health scripts + Compliance Operator |
| 34 | IaC tooling | Terraform / CloudFormation / ROSA CLI in Git |