Quay.io May 2020 Incident Summary

June 4, 2020Jonathan Beakley

This document provides a summary of the quay.io incidents that occurred on May 19 and May 28, 2020. This document describes the nature of the incidents, then provides information about steps being taken to reduce the likelihood of similar outages in the future.

Incidents Summary

Red Hat's quay.io container registry service experienced two periods of degraded performance and availability in May 2020. The first incident occurred from approximately 7 a.m. May 19, 2020, to 1 a.m. May 20, 2020, UTC. The second incident occurred from approximately 11 a.m. to 4 p.m. May 28, 2020, UTC.

During these periods, users experienced a range of outcomes, including slow container image access times and inability to retrieve container images. These issues affected several other Red Hat services, including OpenShift Cluster Manager, which is used to deploy and manage OpenShift clusters.

Red Hat Site Reliability Engineering teams concluded several factors combined to form the root cause of these incidents. These factors include:

non-optimal quay.io tuning in the areas of process parallelization and database access
traffic surges from simultaneous OpenShift platform upgrades

Following these two incidents, several actions have been defined to improve quay.io availability, reliability, and continuity. Some of these items are already complete, others are being actively worked on, and some are being researched. They are described in the sections below.

Completed Actions

A number of actions have already been completed to address these two incidents. These actions include:

Redeploying quay.io on a 4.3.18 OpenShift Dedicated cluster.
Indefinitely disabling garbage collection to reduce database load.
Optimizing several aspects of quay.io in the areas of process parallelization and database access. These optimizations will prevent the database from being driven to the point of lockup.
Doubling the underlying quay.io database size/capacity to handle more traffic.
Suspending certain OpenShift Dedicated z-stream updates to prevent potential traffic spikes during OpenShift Dedicated upgrades.

In addition to these completed actions, many others are pending, and are described below.

Short-term Actions

The following actions are in progress and are targeted for completion by June 11, 2020:

Creating database “read-replicas” that will distribute database load and prevent database lockups
Appreciably accelerating the quay.io redeployment process for faster incident response
Creating a quay.io hot standby in a separate region
Adding pod caching for faster quay.io restart time
Improving monitoring for more visibility into quay.io health and the ability to detect potential problems

Long-term Actions

The following actions have been proposed and are under consideration:

Investigating potential exacerbating networking issues
Conducting substantial load and performance analysis to identify additional parallelization/database optimizations
Reducing quay.io logging verbosity
Investigating additional caching methods
Investigating per-account rate-limiting
Upgrading/replacing various database components and sub-systems
Implementing numerous process and documentation improvements

About the author

Jonathan Beakley

Browse by channel

Explore all channels

Platform products

Try & buy

Featured cloud services

By category

By organization type

By customer

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Communities

Recommendations

Select a language

Select a language

Quay.io May 2020 Incident Summary

Incidents Summary

Completed Actions

Short-term Actions

Long-term Actions

About the author

Jonathan Beakley

More like this

Browse by channel

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links