Not All Machine Learning Workloads are Alike, So Why are We Treating Them Like They are?
August 31, 2021 | by
This is a guest blog by Dr. Ronen Dar, CTO and co-founder of Run:AI.
At each stage of the machine learning / deep learning (ML/DL) process, researchers are completing specific tasks, and those tasks have unique requirements from their AI infrastructure. Data scientists’ compute needs are typically aligned to the following tasks:
Build sessions - where data scientists consume GPU power in interactive sessions to develop and debug their models. These sessions require instant and always-available GPUs, but require low compute and memory.
Training models - DL models are generally trained in long sessions. Training is highly compute-intensive, can run on multiple GPUs, and typically requires very high GPU utilization. Performance (in terms of time-to-train) is highly important. In a project lifecycle, there are long periods of time during which many concurrent training workloads are running (e.g. while optimizing hyperparameters) but also long periods of idle time in which only a small number of experiments are utilizing GPUs.
Inference - in this phase of development, trained DL models are inferencing on requests from real-time applications or from periodical systems that apply offline batch inferencing. Inference typically induces low GPU utilization and memory footprint (compared to training sessions).
Two for me, two for you
As you can see in the chart above, in order to accelerate the development of AI research, data scientists need access to as much or as little GPU as their development phase requires. Unfortunately, the standard Kubernetes scheduler typically allocates static, fixed numbers of GPU – ‘two for me, two for you’. This “one size fits all” approach to scheduling and allocating GPU compute means that if scientists are building models or working on inferencing they have too many GPUs, but when they try to train models they often have too few GPUs. In these environments, a significant number of GPUs remain idle at any given time. Simple resource sharing, like using another data scientist’s GPUs while they are idle, is not possible with static allocations.
Making GPU allocations dynamic
At Run:AI, we envisaged a scheduler that implements a concept which we refer to as “guaranteed quotas” to solve the static allocation challenge. Guaranteed quotas let users go over their static quota as long as idle GPUs are available.
How does a guaranteed quota system work?
Guaranteed quotas of GPUs, as opposed to projects with a static allocation, can use more GPUs than their quota allows. So ultimately the system allocates available resources to a job submitted to a queue, even if the queue is over quota. In cases where a job is submitted to an under-quota queue and there are not enough available resources to launch it, the scheduler starts to become smarter and pause a job from a queue that is over quota, while taking priorities and fairness into account.
Guaranteed quotas essentially break the boundaries of fixed allocations and make data scientists more productive, freeing them from limitations of the number of concurrent experiments they can run, or the number of GPUs they can use for multi-GPU training sessions. This greatly increases utilization of the overall GPU cluster. Researchers accelerate their data science and IT gains control over the full GPU cluster. Better scheduling greatly increases the utilization of the full cluster.
How do Run:AI and Red Hat Openshift Container Platform work together?
Run:AI creates an acceleration layer over GPU resources that manages granular scheduling, prioritization and allocation of compute power. A kubernetes-based dedicated batch scheduler, running on top of OCP, manages GPU-based workloads. It includes mechanisms for creating multiple queues, setting fixed and guaranteed resource quotas, and managing priorities, policies, and multi-node training. It provides an elegant solution to simplify complex ML scheduling processes. Companies using OpenShift-managed Kubernetes clusters can easily install Run:AI by using the operator available from the Red Hat Container Catalog.
Learn more about advanced scheduling for GPUs on OpenShift Container Platform at www.run.ai
Overview The policy framework in Red Hat Advanced Cluster Management for Kubernetes (RHACM) is a powerful feature that help you to govern your configurations across multiple clusters. You can enforce ...