June 4, 2020 | by Jonathan Beakley
This document provides a summary of the quay.io incidents that occurred on May 19 and May 28, 2020. This document describes the nature of the incidents, then provides information about steps being taken to reduce the likelihood of similar outages in the future.
Red Hat's quay.io container registry service experienced two periods of degraded performance and availability in May 2020. The first incident occurred from approximately 7 a.m. May 19, 2020, to 1 a.m. May 20, 2020, UTC. The second incident occurred from approximately 11 a.m. to 4 p.m. May 28, 2020, UTC.
During these periods, users experienced a range of outcomes, including slow container image access times and inability to retrieve container images. These issues affected several other Red Hat services, including OpenShift Cluster Manager, which is used to deploy and manage OpenShift clusters.
Red Hat Site Reliability Engineering teams concluded several factors combined to form the root cause of these incidents. These factors include:
Following these two incidents, several actions have been defined to improve quay.io availability, reliability, and continuity. Some of these items are already complete, others are being actively worked on, and some are being researched. They are described in the sections below.
A number of actions have already been completed to address these two incidents. These actions include:
In addition to these completed actions, many others are pending, and are described below.
The following actions are in progress and are targeted for completion by June 11, 2020:
The following actions have been proposed and are under consideration:
Categories
March 20, 2023
March 17, 2023
March 16, 2023