- Example
- Researchers Adam, Brenda, and Chen want to run machine learning training jobs on a vanilla Kubernetes cluster, but submitting jobs at the same time can cause deadlocks and prevent the jobs from running due to insufficient resources (00:01:38).
- Kueue is a project developed by the Kubernetes Working Group Batch that can admit and schedule workloads in full or keep them on hold until sufficient resources are available in the cluster (00:02:14).
- Kueue allows admins to assign resource quotas to teams, ensuring each team has a guarantee of resources, and also has a concept of quota borrowing and fair sharing to prevent unused resources (00:03:17).
- Kueue can integrate with Cluster Autoscaler to request additional capacity for pending workloads, allowing for dynamic adjustment of cluster size (00:04:49).
- The MultiKueue feature allows users to submit workloads to a single management cluster, which automatically distributes the workload across worker clusters, monitors which cluster admits it, and reflects the status in the management cluster (00:07:30).
- CERN uses the multiq feature to support a use case for improved particle flow event reconstruction with scalable neural networks, providing a central place for physicists to submit jobs without worrying about where they will run (00:09:33).
- CERN operates multiple clusters, including a Master cluster, an on-premises worker cluster, and a public cloud provider cluster, to manage workloads (00:10:26).
- Kueue is used to dispatch batch jobs across multiple clusters, with queues defined for local, on-premises, and external clusters (00:10:52).
- A demo is shown where a job is submitted to the Master cluster, but actually runs on an on-premises cluster, and another job is submitted to run on a remote cluster in a public cloud provider (00:12:47).
- The demo also shows how Kueue can handle multiple jobs, preemption, and job deletion policies (00:13:30).
- Another job is submitted to run on a remote cluster with GPUs, and after updating the queue configuration, the job is successfully run on the remote cluster with GPUs (00:14:48).