Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save luebken/aca21e87d51c55d84fa26a936aeccd97 to your computer and use it in GitHub Desktop.
Save luebken/aca21e87d51c55d84fa26a936aeccd97 to your computer and use it in GitHub Desktop.
# SkyRay: Seamlessly Extending KubeRay to Multi-Cluster Multi-Cloud Operation - Anne Holler, Elotl

SkyRay: Seamlessly Extending KubeRay to Multi-Cluster Multi-Cloud Operation - Anne Holler, Elotl

Introduction to SkyRay and Sky Computing

  • SkyRay is an extension of KubeRay, aiming to seamlessly extend its operation from a single cluster environment to a multi-cluster, multi-cloud operation (00:00:16).
  • The idea of SkyRay is based on the concept of Sky Computing, which requires a commodity Cloud compute layer, making it easy to use multiple clusters as it is to use one (00:02:42).
  • SkyRay works with a policy-driven Kubernetes fleet manager, which presents a Kubernetes API to the user and interoperates with KubeRay (00:03:04).
  • The fleet manager schedules KubeRay deployments on workload clusters according to a policy, and KubeRay handles the deployments on each cluster (00:05:43).

SkyRay's Policy-Driven Management with Anova

  • Analysis of variance fleet manager is used in SkyRay, which supports policies such as spread duplicate, specified cluster, priority, and available capacity (00:05:51).
  • SkyRay can be used to achieve various policy objectives, such as training and serving workloads, with examples available in an open-source repository (00:08:27).
  • SkyRay allows users to run jobs on a static cluster, and if the job doesn't fit, it can be scheduled on a dynamic cluster with on-demand resources to handle the job, using the available capacity policy (00:10:11).
  • For experimental jobs, users can set up a cluster with a specified cluster placement policy, which allows for easy rescheduling and future scheduling on a different cluster if needed (00:11:55).

Use Cases and Policy Objectives in SkyRay

  • SkyRay also supports serving for production and development, allowing users to create static clusters for production and dynamic clusters for development, with different GPU instances and auto-scaling policies (00:13:30).
  • A priority policy can be used to schedule workloads based on cloud provider, allowing users to prioritize certain cloud providers over others (00:15:36).

Cost Optimization and Kubernetes Upgrades with SkyRay

  • SkyRay's just-in-time capability in standby mode allows clusters to scale to zero when idle, reducing costs, and can be used with CU to handle Ray jobs and services (00:16:13).
  • SkyRay can facilitate Kubernetes upgrades with no downtime to AI workloads by spreading duplicate workloads across labeled clusters and cloning clusters with the new Kubernetes version (00:18:47).

Multi-Cloud Operation and Cluster Management in SkyRay

  • SkyRay extends KubeRay for multi-cluster, multi-cloud operation, allowing for seamless deployment and management of clusters across different cloud providers (00:19:33).
  • The delete-recreate version of just-in-time clusters involves labeling a cluster, duplicating the workload, deploying the Ray service, and ensuring it serves before switching the load balancer and deleting the old cluster (00:19:56).

Compound AI Example and Efficient Resource Allocation

  • A compound AI example is demonstrated, featuring an Large language model plus retrieval-augmented generation, with one cluster dedicated to serving the LLM and another for ingestion, both utilizing GPU and CPU resources respectively (00:20:46).
  • The clusters are managed using labels and policies, allowing for efficient scaling and resource allocation, and can be easily deployed and managed with the right policies in place (00:22:07).

SkyRay's Goals and Availability

  • SkyRay aims to reduce launch time, increase efficiency, manage costs, enhance robustness, and facilitate cluster maintenance, and is available as an open-source solution for users to try (00:22:51).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment