Keynote: Multicluster Batch Jobs Dispatching with Kueue at CERN - Ricardo Rocha & Marcin Wielgus

Introduction to Kubernetes and Resource Management

Example
Researchers Adam, Brenda, and Chen want to run machine learning training jobs on a vanilla Kubernetes cluster, but submitting jobs at the same time can cause deadlocks and prevent the jobs from running due to insufficient resources (00:01:38).
Kueue is a project developed by the Kubernetes Working Group Batch that can admit and schedule workloads in full or keep them on hold until sufficient resources are available in the cluster (00:02:14).
Kueue allows admins to assign resource quotas to teams, ensuring each team has a guarantee of resources, and also has a concept of quota borrowing and fair sharing to prevent unused resources (00:03:17).

Kueue can integrate with Cluster Autoscaler to request additional capacity for pending workloads, allowing for dynamic adjustment of cluster size (00:04:49).

The MultiKueue feature allows users to submit workloads to a single management cluster, which automatically distributes the workload across worker clusters, monitors which cluster admits it, and reflects the status in the management cluster (00:07:30).

CERN uses the multiq feature to support a use case for improved particle flow event reconstruction with scalable neural networks, providing a central place for physicists to submit jobs without worrying about where they will run (00:09:33).
CERN operates multiple clusters, including a Master cluster, an on-premises worker cluster, and a public cloud provider cluster, to manage workloads (00:10:26).

Kueue is used to dispatch batch jobs across multiple clusters, with queues defined for local, on-premises, and external clusters (00:10:52).
A demo is shown where a job is submitted to the Master cluster, but actually runs on an on-premises cluster, and another job is submitted to run on a remote cluster in a public cloud provider (00:12:47).
The demo also shows how Kueue can handle multiple jobs, preemption, and job deletion policies (00:13:30).
Another job is submitted to run on a remote cluster with GPUs, and after updating the queue configuration, the job is successfully run on the remote cluster with GPUs (00:14:48).