This is a guest blog by Dr. Ronen Dar, CTO and co-founder of Run:AI.
At each stage of the machine learning / deep learning (ML/DL) process, researchers are completing specific tasks, and those tasks have unique requirements from their AI infrastructure. Data scientists’ compute needs are typically aligned to the following tasks:
- Build sessions - where data scientists consume GPU power in interactive sessions to develop and debug their models. These sessions require instant and always-available GPUs, but require low compute and memory.
- Training models - DL models are generally trained in long sessions. Training is highly compute-intensive, can run on multiple GPUs, and typically requires very high GPU utilization. Performance (in terms of time-to-train) is highly important. In a project lifecycle, there are long periods of time during which many concurrent training workloads are running (e.g. while optimizing hyperparameters) but also long periods of idle time in which only a small number of experiments are utilizing GPUs.
- Inference - in this phase of development, trained DL models are inferencing on requests from real-time applications or from periodical systems that apply offline batch inferencing. Inference typically induces low GPU utilization and memory footprint (compared to training sessions).
Two for me, two for you
As you can see in the chart above, in order to accelerate the development of AI research, data scientists need access to as much or as little GPU as their development phase requires. Unfortunately, the standard Kubernetes scheduler typically allocates static, fixed numbers of GPU – ‘two for me, two for you’. This “one size fits all” approach to scheduling and allocating GPU compute means that if scientists are building models or working on inferencing they have too many GPUs, but when they try to train models they often have too few GPUs. In these environments, a significant number of GPUs remain idle at any given time. Simple resource sharing, like using another data scientist’s GPUs while they are idle, is not possible with static allocations.
Making GPU allocations dynamic
At Run:AI, we envisaged a scheduler that implements a concept which we refer to as “guaranteed quotas” to solve the static allocation challenge. Guaranteed quotas let users go over their static quota as long as idle GPUs are available.
How does a guaranteed quota system work?
Guaranteed quotas of GPUs, as opposed to projects with a static allocation, can use more GPUs than their quota allows. So ultimately the system allocates available resources to a job submitted to a queue, even if the queue is over quota. In cases where a job is submitted to an under-quota queue and there are not enough available resources to launch it, the scheduler starts to become smarter and pause a job from a queue that is over quota, while taking priorities and fairness into account.
Guaranteed quotas essentially break the boundaries of fixed allocations and make data scientists more productive, freeing them from limitations of the number of concurrent experiments they can run, or the number of GPUs they can use for multi-GPU training sessions. This greatly increases utilization of the overall GPU cluster. Researchers accelerate their data science and IT gains control over the full GPU cluster. Better scheduling greatly increases the utilization of the full cluster.
How do Run:AI and Red Hat Openshift Container Platform work together?
Run:AI creates an acceleration layer over GPU resources that manages granular scheduling, prioritization and allocation of compute power. A kubernetes-based dedicated batch scheduler, running on top of OCP, manages GPU-based workloads. It includes mechanisms for creating multiple queues, setting fixed and guaranteed resource quotas, and managing priorities, policies, and multi-node training. It provides an elegant solution to simplify complex ML scheduling processes. Companies using OpenShift-managed Kubernetes clusters can easily install Run:AI by using the operator available from the Red Hat Container Catalog.
Learn more about advanced scheduling for GPUs on OpenShift Container Platform at www.run.ai
About the author
Browse by channel
Automation
The latest on IT automation that spans tech, teams, and environments
Artificial intelligence
Explore the platforms and partners building a faster path for AI
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
Explore how we reduce risks across environments and technologies
Edge computing
Updates on the solutions that simplify infrastructure at the edge
Infrastructure
Stay up to date on the world’s leading enterprise Linux platform
Applications
The latest on our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Developer resources
- Customer support
- Red Hat value calculator
- Red Hat Ecosystem Catalog
- Find a partner
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit