Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save luebken/5d74adf503cdb5e93c3ca1456aa88b7d to your computer and use it in GitHub Desktop.
Save luebken/5d74adf503cdb5e93c3ca1456aa88b7d to your computer and use it in GitHub Desktop.
Keynote: Multicluster Batch Jobs Dispatching with Kueue at CERN - Ricardo Rocha & Marcin Wielgus

Keynote: Multicluster Batch Jobs Dispatching with Kueue at CERN - Ricardo Rocha & Marcin Wielgus

Introduction to Kubernetes and Resource Management

  • Example
  • Researchers Adam, Brenda, and Chen want to run machine learning training jobs on a vanilla Kubernetes cluster, but submitting jobs at the same time can cause deadlocks and prevent the jobs from running due to insufficient resources (00:01:38).
  • Kueue is a project developed by the Kubernetes Working Group Batch that can admit and schedule workloads in full or keep them on hold until sufficient resources are available in the cluster (00:02:14).
  • Kueue allows admins to assign resource quotas to teams, ensuring each team has a guarantee of resources, and also has a concept of quota borrowing and fair sharing to prevent unused resources (00:03:17).

Autoscaled Cluster

  • Kueue can integrate with Cluster Autoscaler to request additional capacity for pending workloads, allowing for dynamic adjustment of cluster size (00:04:49).

Multicluster

  • The MultiKueue feature allows users to submit workloads to a single management cluster, which automatically distributes the workload across worker clusters, monitors which cluster admits it, and reflects the status in the management cluster (00:07:30).

Cern

  • CERN uses the multiq feature to support a use case for improved particle flow event reconstruction with scalable neural networks, providing a central place for physicists to submit jobs without worrying about where they will run (00:09:33).
  • CERN operates multiple clusters, including a Master cluster, an on-premises worker cluster, and a public cloud provider cluster, to manage workloads (00:10:26).

CERN's Use Case and Multi-Cluster Management

  • Kueue is used to dispatch batch jobs across multiple clusters, with queues defined for local, on-premises, and external clusters (00:10:52).
  • A demo is shown where a job is submitted to the Master cluster, but actually runs on an on-premises cluster, and another job is submitted to run on a remote cluster in a public cloud provider (00:12:47).
  • The demo also shows how Kueue can handle multiple jobs, preemption, and job deletion policies (00:13:30).
  • Another job is submitted to run on a remote cluster with GPUs, and after updating the queue configuration, the job is successfully run on the remote cluster with GPUs (00:14:48).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment