Contact

Contact

POLICIES

OpenShift Dedicated Process and Security Overview

Our Process and Security document has moved to docs.openshift.com. You will be automatically redirected in 5 seconds.

Table of Contents

Acronyms and terms

OSD
OpenShift Dedicated
OCM
OpenShift Cluster Manager
AMRO
Amazon Red Hat OpenShift
SRE
Red Hat Site Reliability Engineering
CEE
Customer Experience and Engagement (Red Hat Support)
CVE
Common Vulnerabilities and Exposures
PVs
Persistent Volumes
VPC
Virtual Private Cloud
CI/CD
Continuous Integration / Continuous Delivery

Red Hat responsibilities

This document details the Red Hat responsibilities for the OpenShift Dedicated managed service. For more information about customer or shared responsibilities, please refer to the OpenShift Dedicated Responsibilities document.

For more information about OpenShift Dedicated and its components, please refer to the OpenShift Dedicated Service Definition. This document applies to both OpenShift Dedicated and Amazon Red Hat OpenShift.

Incident and operations management

Architecture

Updated_OSD_arch_diagram_-_Google_Slides

Platform monitoring

Red Hat SRE maintains a centralized monitoring and alerting system for all OpenShift Dedicated cluster components, SRE services, and underlying cloud provider accounts. Platform audit logs are securely forwarded  to a centralized SIEM (Security Information and Event Monitoring) system, where they may trigger configured alerts to the SRE team, and are also subject to manual review. Audit logs are retained in the SIEM for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted.

Incident management

An incident is an event which results in a degradation or outage of one or more Red Hat services. An incident may be raised by a customer or CEE member (such as a Technical Account Manager) through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.

Depending on the impact on the service and customer, the incident is categorized in terms of severity.

The general workflow of how a new incident is managed by Red Hat:

  1. An SRE first responder is alerted to a new incident, and begins an initial investigation.
    • If the incident is discovered to be caused by a configuration change made by the customer, Red Hat will send a notification via email and the Cluster History log in OpenShift Cluster Manager requesting further engagement via a support case.
  2. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
  3. The incident lead manages all communication and coordination around recovery, including any relevant notifications and/or support case updates.
  4. The incident is recovered.
  5. The incident is documented and a root cause analysis is performed within 3 business days of the incident.
  6. Root Cause Analysis (RCA) draft document will be shared with the customer within 7 business days of the incident.

Notifications

Platform notifications are configured using email. Any customer notification will also be sent to the corresponding Red Hat account team and if applicable, the Red Hat Technical Account Manager.

The following activities may trigger notifications:

  • Platform incident
  • Performance degradation
  • Cluster capacity warnings
  • Critical vulnerabilities and resolution
  • Upgrade scheduling

Backup and recovery

All OpenShift Dedicated clusters are backed up using cloud provider snapshots. Notably, this does not include customer data stored on persistent volumes. All snapshots are taken using the appropriate cloud provider snapshot APIs and are uploaded to a secure object storage bucket (S3 in AWS, and GCS in Google Cloud) in the same account as the cluster.

Component Snapshot Frequency Retention Notes
Full object store backup, all SRE-managed cluster PVs Daily 7 days This is a full backup of all kubernetes objects like etcd, as well as all SRE-managed PVs in the cluster.
Weekly 30 days
Full object store backup Hourly 24 hours This is a full backup of all kubernetes objects like etcd. No PVs are backed up in this backup schedule.
Node Root Volume Never   Nodes are considered to be ephemeral. Nothing critical should be stored on a node's root volume.
  • Red Hat SRE rehearses recovery processes quarterly
  • Red Hat does not commit to any Recovery Point Objective (RPO) or Recovery Time Objective (RTO).
  • Customers should take regular backups of their data.
  • Backups performed by SRE are taken as a precautionary measure only. They are stored in the same Region as the cluster.
  • Customers may request SRE backup data via a support case.
  • Red Hat highly encourages customers to deploy multi-AZ clusters with workloads that follow Kubernetes best practices to ensure high availability within a region. Learn more at: https://www.openshift.com/products/dedicated/understanding-availability 
  • In the event an entire cloud Region is unavailable, customers must install a new cluster in a different region and restore their apps using their backup data.

Cluster capacity

Evaluating and managing cluster capacity is a responsibility that is shared between Red Hat and the customer. Red Hat SRE is responsible for the capacity of all master and infrastructure nodes on the cluster.

Red Hat SRE also evaluates cluster capacity during upgrades and in response to cluster alerts. The impact of a cluster upgrade on capacity is evaluated as part of the upgrade testing process to ensure that capacity is not negatively impacted by new additions to the cluster. During a cluster upgrade, additional worker nodes are added to make sure that total cluster capacity is maintained during the upgrade process.

Capacity evaluations by SRE staff also happen in response to alerts from the cluster once usage thresholds are exceeded for a certain period of time. Such alerts may also result in a notification to the customer.

Change management

Cluster changes are initiated in one of two ways:

  1. A customer initiates changes via self-service capabilities like cluster deployment, worker node scaling, and cluster deletion.
  2. SRE initiates a change through Operator-driven capabilities like configuration, upgrade, patching, or configuration changes.

Change history is captured in the Cluster History section in OpenShift Cluster Manager Overview tab and is available to customers. This includes logs from the following changes:

  • Adding or removing identity providers
  • Adding or removing users to/from the dedicated-admins group
  • Scaling the cluster compute nodes
  • Scaling the cluster load balancer
  • Scaling the cluster persistent storage
  • Upgrading the cluster

SRE-initiated changes that require manual intervention generally follow the below procedure. SREs consider manual changes a failure and this is only used as a fallback process.

  • Preparing for Change
    • Change characteristics are identified and a gap analysis against current state is performed.
    • Change steps are documented and validated.
    • Communication plan and schedule is shared with all stakeholders.
    • CICD and end-to-end tests are updated to automate change validation.
    • Change request capturing change details is submitted for management approval.
  • Managing Change
    • Automated nightly CI/CD jobs pick up the change and run tests.
    • The change is made to Integration and Stage environments, and manually validated before updating the customer cluster.
    • Major change notifications are sent before and after the event.
  • Reinforcing the Change
    • Feedback on the change is collected and analyzed..
    • Potential gaps are diagnosed in order to understand resistance and automate similar change requests.
    • Corrective actions are implemented.

Configuration management

The infrastructure and configuration of the OSD environment is managed as code. Red Hat SRE manages changes to the OSD environment using a gitops workflow and automated CI/CD pipeline.

Each proposed change undergoes a series of automated verifications immediately upon check-in. Changes are then deployed to a Staging environment where they undergo automated integration testing. Finally, changes are deployed to the Production environment. Each step is fully automated.

An authorized SRE reviewer must approve advancement to each step. The reviewer may not be the same individual who proposed the change. All changes and approvals are fully auditable as part of the gitops workflow.

Release management

Refer to OpenShift Dedicated Life Cycle for more information on the upgrade policy and procedures.

Identity and access management

Automated access

Most access done by SRE teams is done through automated configuration management using cluster operators.

SRE access to OSD clusters

SREs access OSD clusters via the web console or command line tools. Authentication requires Multi-Factor Authentication (MFA) with industry-standard requirements for password complexity and account lockouts. SREs must authenticate as individuals to ensure auditability. All authentication attempts are logged to a Security Information and Event Management (SIEM) system.

SREs access private clusters using an encrypted tunnel through a hardened SRE Support Pod running in the cluster. Connections to the SRE Support Pod are permitted only from a secured Red Hat network using an IP allow-list. In addition to the cluster authentication controls described above, authentication to the SRE Support Pod is controlled using SSH keys. SSH key authorization is limited to SRE staff and automatically synchronized with Red Hat corporate directory data. Corporate directory data is secured and controlled by HR systems, including management review, approval, and audits.

Privileged access controls in OSD

Red Hat SRE adheres to the principle of least privilege when accessing OSD and public cloud provider components. There are four basic categories of manual SRE access:

SRE admin access (via Red Hat Portal): This is normal two-factor authentication with no privileged elevation.

SRE admin access (via Red Hat corporate SSO): This is normal two-factor authentication with no privileged elevation.

OpenShift elevation: This is manual elevation using Red Hat SSO. It is limited to 2 hours, is fully audited, and requires management approval.

Cloud provider access/elevation: This is manual elevation for cloud provider console access. It is limited to 60 minutes, is fully audited, and requires management approval.

Each of these access types has different levels of access to components:

Component Typical SRE admin access (via Red Hat Portal) Typical SRE admin access (via Red Hat SSO) OpenShift elevation Cloud provider access / elevation
OpenShift Cluster Manager R/W No Access No Access No Access
OpenShift Console No Access R/W R/W No Access
Node Operating System No Access A specific list of elevated OS and network permissions. A specific list of elevated OS and network permissions. No Access
AWS Console No Access No Access, but this is the account used to request cloud provider access. No Access All cloud provider permissions using the SRE identity.

 

SRE access to cloud infrastructure accounts

Red Hat personnel do not access cloud infrastructure accounts in the course of routine OSD operations. For emergency troubleshooting purposes, Red Hat SRE have well-defined and auditable procedures to access cloud infrastructure accounts.

In AWS, SREs generate a short-lived AWS access token for the osdManagedAdminSRE user using the AWS Security Token Service (STS). Access to the STS token is audit logged and traceable back to individual users. The osdManagedAdminSRE has the AdministratorAccess IAM policy attached.

In Google Cloud, SREs access resources after being authenticated against a Red Hat's SAML identity provider (IDP). The IDP authorizes tokens that have time-to-live expirations. The issuance of the token is auditable by corporate Red Hat IT and linked back to an individual user.

Red Hat Support access

Members of the Red Hat CEE team will typically have read-only access to parts of the cluster. Specifically, CEE has limited access to the core and product namespaces and does not have access to the customer namespaces.

Role Core Namespace Layered Product Namespace Customer Namespace Cloud Infrastructure Account*
OpenShift SRE Read: All
Write: Very Limited1
Read: All
Write: None
Read: None2
Write: None
Read: All4
Write: All4
CEE Read: All
Write: None
Read: All
Write: None
Read: None2
Write: None
Read: None
Write: None
Customer Administrator Read: None
Write: None
Read: None
Write: None
Read: All
Write: All
Read: Limited5
Write: Limited5
Customer User Read: None
Write: None
Read: None
Write: None
Read: Limited3
Write: Limited3
Read: None
Write: None
Everybody Else Read: None
Write: None
Read: None
Write: None
Read: None
Write: None
Read: None
Write: None

* - Cloud Infrastructure Account refers to the underlying AWS or Google Cloud account
1 - limited to addressing common use cases such as failing deployments, upgrading a cluster, and replacing bad worker nodes.
2 - Red Hat associates have no access to customer data by default.
3 - limited to namespaces created by the user and to what is granted via RBAC by the Customer Administrator role.
4 - SRE access to the cloud infrastructure account is a "break-glass" procedure for exceptional troubleshooting during a documented incident.
5 - Customer Administrator has limited access to the cloud infrastructure account console via Cloud Infrastructure Access

Customer access

Customer access is limited to namespaces created by the customer and permissions that are granted using RBAC by the Customer Administrator role. Access to the underlying infrastructure or product namespaces is generally not permitted without cluster-admin access. More information on customer access and authentication can be found in the Understanding Authentication section of the documentation.

Access approval and review

New SRE user access requires management approval. Separated or transferred SRE accounts are removed as authorized users through an automated process. Additionally, SRE performs periodic access review including management sign-off of authorized user lists.

Security and regulation compliance

Security and regulation compliance includes tasks such as security controls implementation and compliance certification.

Data classification

Red Hat defines and follows a data classification standard to determine the sensitivity of data and highlight inherent risk to the confidentiality and integrity of that data while it is collected/used/transmitted/stored/processed. Customer-owned data is classified at the highest level of sensitivity and handling requirements.

Data management

OSD uses cloud provider services to help securely manage keys for encrypted data (AWS KMS and Google Cloud KMS). These keys are used for control plane data volumes which are encrypted by default. Persistent volumes for customer applications also use these cloud services for key management.

When a customer deletes their OSD cluster, all cluster data is permanently deleted, including control plane data volumes, customer application data volumes (PVs), and backup data.

Vulnerability management

Red Hat performs periodic vulnerability scanning of OpenShift Dedicated using industry standard tools. Identified vulnerabilities are tracked to their remediation according to timelines based on severity. Vulnerability scanning and remediation activities are documented for verification by third party assessors in the course of compliance certification audits.

Network security

Firewall and DDoS protection

Each OSD cluster is protected by a secure network configuration at the cloud infrastructure level using firewall rules (AWS Security Groups or Google Cloud Compute Engine firewall rules). OSD customers on AWS are also protected against DDoS attacks with AWS Shield Standard.

Private clusters and network connectivity

Customers can optionally configure their OSD cluster endpoints (web console, API, and application router) to be made private so that the cluster control plane and/or applications are not accessible from the Internet. Customers can configure a private network connection to their OSD cluster via AWS VPC peering, AWS VPN, or AWS Direct Connect.

Cluster network access controls

Fine-grained network access control rules can be configured by customers per-project using NetworkPolicy objects and the OpenShift SDN.

Penetration testing

Red Hat performs periodic penetration tests against OpenShift Dedicated. Tests are performed by an independent internal team using industry standard tools and best practices. Any issues that may be discovered are prioritized based on severity. Any issues found belonging to open source projects are shared with the community for resolution.

Customers may run their own penetration or stress tests of their OpenShift Dedicated clusters after receiving express written approval from Red Hat in a support case.

Compliance

Red Hat OpenShift Dedicated follows common industry best practices for security and controls. At this time, OpenShift Dedicated on AWS and Google Cloud are PCI-DSS, ISO 27001, and SOC 2 Type 2 certified.

Disaster recovery

OpenShift Dedicated provides disaster recovery for failures that occur at the pod, worker node, infrastructure node, master node, and availability zone levels. All disaster recovery requires that the customer use best practices for deploying highly available applications, storage, and cluster architecture (e.g. single-zone deployment vs. multi-zone deployment) to account for the level of desired availability. This document contains more information about OpenShift Dedicated availability and potential points of failure.

One single-zone cluster will not provide disaster avoidance or recovery in the event of an availability zone or region outage. Multiple single-zone clusters with customer-maintained failover can account for outages at the zone or region levels.

One multi-zone cluster will not provide disaster avoidance or recovery in the event of a full region outage. Multiple multi-zone clusters with customer-maintained failover can account for outages at the region level.