theme	title	info	transition	mdc
default	Multi-Cluster Kubernetes: Problems & Solutions	OCM, Sveltos, and the road to a standard API Guilhem Lettron — SRE France Meetup	slide-left	true

Multi-Cluster Kubernetes: Problems & Solutions

OCM, Sveltos, and the road to a standard API

Guilhem Lettron — SRE France Meetup

Why multi-cluster?

You already have one cluster. Why would you want many?

Blast radius — isolate failure domains (prod / staging / per-team)
Regulatory / data sovereignty — data must stay in a given region
Edge & hybrid — workloads close to users or on-prem constraints
Scaling limits — etcd / API server throughput ceiling
Organizational — different teams, different upgrade cadences

Multi-cluster is not a choice. It's an inevitability.

Problem #1 — Where are my clusters?

You provision clusters with Cluster API, Terraform, cloud consoles…

But then:

No single inventory of all clusters
No standard way to describe cluster properties (version, region, labels)
No health status aggregation
Each tool has its own registration mechanism

"We have 17 clusters… I think. Let me check three different dashboards."

Problem #2 — How do I deploy consistently?

You need the same stack on every cluster:

CNI, CSI, cert-manager, monitoring, policies…
But with per-cluster variations (cloud provider, sizing, feature flags)
Drift happens: someone kubectl applys in prod
Ordering matters: CRDs before controllers, Istio before apps

Manual Helm installs x N clusters = chaos

Problem #3 — How do clusters talk to the hub?

In many setups, managed clusters:

Are behind NAT / firewalls (edge, on-prem)
Have no direct inbound connectivity from the hub
Require mTLS, short-lived tokens, certificate rotation

"Just expose the kubeconfig" is not an option in production.

layout: center

"Easy. Just use GitOps."

flowchart LR
    dev[You] -->|git push| repo["Git Repo"]
    repo -->|pull| flux["Flux / ArgoCD"]
    flux -->|apply| c1[Cluster 1]
    flux -->|apply| c2[Cluster 2]
    flux -->|apply| c3[Cluster ...]
    flux -->|apply| cn[Cluster N]

Problem solved. Thank you, good night.

Well... who installs the GitOps tool?

flowchart LR
    fresh["Fresh Cluster<br/>(nothing running)"] -.-x flux["Flux / ArgoCD<br/>???"]
    flux -->|pull| repo[Git Repo]

    style fresh fill:#ff6b6b,color:#fff
    style flux fill:#ff6b6b,color:#fff,stroke-dasharray: 5 5

Flux / ArgoCD must already be running on the cluster to reconcile
Chicken-and-egg: you need a tool to deploy the tool
flux bootstrap / argocd install is... imperative
You still need a push-based mechanism for day-0

Your "pull-only" workflow starts with a push.

Well... how do you limit blast radius?

graph TB
    repo[Git Repo] -->|"1 branch<br/>1 folder"| boom["All clusters<br/>at once"]
    repo -->|"N branches"| branches["branch-cluster-1<br/>branch-cluster-2<br/>branch-cluster-3<br/>...branch-cluster-47"]
    repo -->|"N folders"| folders["clusters/prod-eu-1/<br/>clusters/prod-eu-2/<br/>clusters/prod-us-1/<br/>...clusters/staging-ap-3/"]

    style boom fill:#ff6b6b,color:#fff
    style branches fill:#f5a623,color:#fff
    style folders fill:#f5a623,color:#fff

1 repo + 1 branch = one merge hits every cluster simultaneously
N branches / N folders = you're back to managing things one by one
Progressive rollout? Write custom CI pipelines to promote between folders
30 clusters x 15 addons = 450 files to maintain — that's not automation

class: text-sm

Well... how do you get feedback?

flowchart LR
    dev[Developer] -->|git push| repo[Git Repo]
    repo -->|pull| agent1[Agent<br/>Cluster 1]
    repo -->|pull| agent2[Agent<br/>Cluster 2]
    repo -->|pull| agentN[Agent<br/>Cluster N]
    agent1 -->|apply| c1[Cluster 1]
    agent2 -->|apply| c2[Cluster 2]
    agentN -->|apply| cN[Cluster N]
    c1 -.->|"???"| dev
    c2 -.->|"???"| dev
    cN -.->|"???"| dev

Git is write-only — no native status feedback channel
"I merged, did it deploy?" → check N dashboards / kubectl on N clusters
Errors buried in controller logs, not in your PR
No aggregated status across the fleet — "deployed to 28/30 clusters" does not exist

GitOps is not wrong. It's not enough.

GitOps solves single-cluster continuous delivery.

Multi-cluster needs something more:

Need	GitOps	What's missing
Bootstrap	`flux bootstrap` (imperative)	Push-based day-0
Blast radius	N folders / branches	Native progressive rollout
Feedback	Per-cluster logs	Aggregated fleet status
Cluster inventory	Manual cluster list	Dynamic discovery

GitOps solves delivery. Multi-cluster needs orchestration.

The solution landscape

Concern	Tool
Cluster lifecycle	Cluster API, Terraform, cloud CLIs
Cluster inventory & registration	OCM, Rancher, Fleet
Addon deployment	Sveltos, Flux, ArgoCD
Workload scheduling	Karmada, KubeFleet, ArgoCD
Standard cluster API	SIG Multicluster ClusterProfile

None of these tools does everything. Composability is the goal.

layout: section

Part 1: Open Cluster Management (OCM)

OCM — Hub & Spoke architecture

graph TB
    subgraph Hub["Hub Cluster"]
        MC[ManagedCluster API]
        PL[Placement API]
        MW[ManifestWork]
        AD[Addon Framework]
    end

    subgraph SpokeA["Managed Cluster A"]
        KA[klusterlet agent]
    end

    subgraph SpokeB["Managed Cluster B"]
        KB[klusterlet agent]
    end

    KA -->|registers| MC
    KB -->|registers| MC
    PL -->|selects| MC
    AD -->|deploys via| MW
    MW -->|applied by| KA
    MW -->|applied by| KB

class: text-sm

OCM — Solving Problem #1: Discovery

The ManagedCluster resource on the hub:

apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
  name: cluster-eu-west-1
  labels:
    cloud: aws
    region: eu-west-1
    env: production
status:
  conditions:
    - type: ManagedClusterConditionAvailable
      status: "True"
  version:
    kubernetes: v1.30.2

Auto-registered by the klusterlet agent
Labels → queryable inventory
Conditions → aggregated health

OCM — Placement API

"Deploy this to all production clusters in EU"

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
  name: eu-prod
spec:
  predicates:
    - requiredClusterSelector:
        labelSelector:
          matchLabels:
            env: production
            region: eu-west-1

→ Produces a PlacementDecision with the list of matching clusters.

Used by addons, policies, and workload controllers.

OCM — Addons (the OCM concept)

An OCM addon is not a Helm chart. It's a framework:

ClusterManagementAddOn — defines the addon globally
ManagedClusterAddOn — enables it per cluster
Addon controller runs on the hub, deploys agents to spokes via ManifestWork

Examples of OCM addons:

cluster-proxy — reverse tunnel for hub → spoke connectivity
managed-serviceaccount — automated token lifecycle
sveltos-ocm-addon — bridges OCM and Sveltos

layout: section

Part 2: Sveltos

class: text-sm

Sveltos — Solving Problem #2: Consistent deployment

ClusterProfile: declare what to deploy and where

apiVersion: config.projectsveltos.io/v1beta1
kind: ClusterProfile
metadata:
  name: monitoring-stack
spec:
  clusterSelector:
    matchLabels:
      env: production
  helmCharts:
    - repositoryURL: https://prometheus-community.github.io/helm-charts
      chartName: kube-prometheus-stack
      chartVersion: "65.1.0"
      releaseName: monitoring
      releaseNamespace: monitoring
      values: |
        grafana:
          enabled: true
  syncMode: ContinuousWithDriftDetection

Sveltos — Key features

Feature	What it does
Drift detection	Detects & auto-corrects config drift on managed clusters
Templating	Values from management or managed cluster (Go templates)
Deployment order	Sequential within a profile, dependencies between profiles
Progressive rollout	Phased rollout across cluster groups
Multi-tenancy	`Profile` (namespaced) vs `ClusterProfile` (cluster-wide)
Tier / conflict resolution	When two profiles target the same resource, tier wins

class: text-sm

But wait — Sveltos needs to know about clusters

Sveltos watches SveltosCluster resources:

apiVersion: lib.projectsveltos.io/v1beta1
kind: SveltosCluster
metadata:
  name: cluster-eu-west-1
  namespace: default
  labels:
    cloud: aws
    region: eu-west-1
spec:
  kubeconfigKeyName: kubeconfig

Problem: who creates these SveltosCluster objects?

Manually? Doesn't scale.
From Cluster API? Works, but only if you use CAPI.
From OCM? That's where sveltos-ocm-addon comes in.

layout: section

Part 3: Bridging OCM + Sveltos

github.com/guilhem/sveltos-ocm-addon

sveltos-ocm-addon — What it does

Automatically registers OCM managed clusters as Sveltos clusters

flowchart LR
    MC[OCM ManagedCluster] --> addon[sveltos-ocm-addon]
    CP[cluster-proxy<br/>kubeconfig via tunnel] --> addon
    addon --> SC[SveltosCluster]

ManagedClusterAddOn deployed to selected clusters (via Placement)
Controller creates a ManagedServiceAccount → gets a token
Builds a kubeconfig routed through cluster-proxy
Creates the SveltosCluster on the hub with synced labels

→ Sveltos immediately picks up the new cluster and deploys matching profiles.

sveltos-ocm-addon — Architecture

flowchart LR
    subgraph Hub["Hub Cluster"]
        MC[ManagedCluster] --> MCA[ManagedClusterAddOn]
        MCA --> MSA[ManagedServiceAccount]
        MSA --> ctrl[sveltos-ocm-addon<br/>controller]
        MC --> ctrl
        ctrl --> SC[SveltosCluster]
        SC --> sveltos[Sveltos<br/>addon-controller]
        ctrl <--> proxy[cluster-proxy<br/>tunnel]
    end

    subgraph A["Managed Cluster A"]
        KA[klusterlet]
    end
    subgraph B["Managed Cluster B"]
        KB[klusterlet]
    end

    proxy <-.-> KA
    proxy <-.-> KB
    sveltos -.->|deploy addons| A
    sveltos -.->|deploy addons| B

The full workflow

flowchart LR
    A["1. Register<br/>cluster with OCM"] --> B["2. Placement<br/>selects clusters"]
    B --> C["3. sveltos-ocm-addon<br/>creates SveltosCluster"]
    C --> D["4. ClusterProfile<br/>matches & deploys"]
    D --> E["5. Drift detection<br/>keeps state"]

    style A fill:#4a9eff,color:#fff
    style B fill:#4a9eff,color:#fff
    style C fill:#f5a623,color:#fff
    style D fill:#7ed321,color:#fff
    style E fill:#7ed321,color:#fff

Each layer does one thing well. Composability > monolith.

layout: section

Part 4: SIG Multicluster ClusterProfile API

The upstream standardization effort

The standardization problem

Today every project reinvents the cluster object:

Project	Cluster resource
OCM	`ManagedCluster`
Cluster API	`Cluster`
Sveltos	`SveltosCluster`
Karmada	`Cluster`
Rancher	`clusters.management.cattle.io`

→ No interoperability. Bridges everywhere. Sound familiar?

class: text-sm

KEP-4322: ClusterProfile API

apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ClusterProfile
metadata:
  name: cluster-eu-west-1
  namespace: fleet-inventory
spec:
  displayName: "EU West Production"
  clusterManager:
    name: ocm
status:
  version:
    kubernetes: v1.30.2
  properties:
    - name: region
      value: eu-west-1
  conditions:
    - type: ControlPlaneHealthy
      status: "True"

Namespace-scoped (multiple inventories on one hub)
Cluster Manager creates and updates status
Consumers read for scheduling / placement decisions

The future: one API to rule them all?

flowchart LR
    CAPI[Cluster API] --> CP["ClusterProfile<br/>(standard API)"]
    OCM[OCM] --> CP
    Karmada --> CP
    CP --> Sveltos
    CP --> ArgoCD
    CP --> Flux

Cluster managers (OCM, CAPI, Karmada) populate ClusterProfile
Consumers (Sveltos, Argo, Flux) read ClusterProfile to discover targets
No more per-project bridges

The bridge I wrote (sveltos-ocm-addon) should eventually become unnecessary.

layout: section

Bonus: Intelligent workload scheduling

Now that we have inventory, addons, and communication...

The next question

We can now:

Discover clusters (OCM)
Deploy addons consistently (Sveltos)
Communicate securely through tunnels (cluster-proxy)

But what about applications?

Where should this workload run? The cluster with the most available capacity?
How do I spread replicas across failure domains?
How do I do a progressive rollout across the fleet?
What if a cluster goes down mid-rollout?

Addons = same everywhere. Apps = smart placement.

class: text-sm

Karmada — Kubernetes-native workload propagation

flowchart TB
    subgraph Hub["Karmada Control Plane"]
        D[Deployment] --> PP[PropagationPolicy]
        PP --> OP[OverridePolicy<br/>per-cluster values]
    end

    PP -->|"spread: 3 clusters<br/>weighted by capacity"| C1[Cluster EU<br/>4 replicas]
    PP --> C2[Cluster US<br/>6 replicas]
    PP --> C3[Cluster AP<br/>2 replicas]

PropagationPolicy — where to schedule, how many replicas per cluster
OverridePolicy — per-cluster customizations (image registry, resource limits)
Replica scheduling — distribute by capacity (DynamicWeight), region, cost
Failover — auto-migrate replicas when a cluster fails

class: text-sm

KubeFleet — Multi-cluster scheduling from the hub

flowchart LR
    subgraph Hub["KubeFleet Hub"]
        CRP[ClusterResourcePlacement]
        SCH[Scheduler<br/>capacity + affinity + topology]
    end

    CRP --> SCH
    SCH -->|"pick best N"| M1[Member Cluster 1]
    SCH --> M2[Member Cluster 2]
    SCH --> M3[Member Cluster 3]

    M1 -.->|status| Hub
    M2 -.->|status| Hub
    M3 -.->|status| Hub

Hub-spoke (agent-initiated, like OCM — works behind NAT)
Scheduler plugins — capacity, affinity, topology spread, cost, GPU
Progressive rollout — staged updates with health checks at each step
Status aggregation — fleet-wide deployment status on the hub

class: text-sm

The full picture

flowchart TB
    subgraph Lifecycle["Cluster Lifecycle"]
        CAPI[Cluster API / Terraform]
    end

    subgraph Inventory["Cluster Inventory"]
        OCM["OCM<br/>registration + discovery"]
    end

    subgraph Addons["Addon Management"]
        SV["Sveltos<br/>CNI, monitoring, policies..."]
    end

    subgraph Apps["App Scheduling"]
        KR["Karmada / KubeFleet<br/>intelligent placement"]
    end

    CAPI --> OCM
    OCM --> SV
    OCM --> KR

    style Lifecycle fill:#8e8e8e,color:#fff
    style Inventory fill:#4a9eff,color:#fff
    style Addons fill:#7ed321,color:#fff
    style Apps fill:#bd10e0,color:#fff

Each layer solves one concern. Pick the tools that fit your needs.

Key takeaways

Multi-cluster is inevitable — plan for it early
GitOps alone is not enough — bootstrap, blast radius, feedback are unsolved
Separate concerns: inventory (OCM) vs addons (Sveltos) vs app scheduling (Karmada/KubeFleet)
Composability wins — no single tool does everything well
Bridges are necessary today — sveltos-ocm-addon connects OCM ↔ Sveltos
SIG Multicluster ClusterProfile is the upstream path to eliminate bridges

Links & References

sveltos-ocm-addon: github.com/guilhem/sveltos-ocm-addon
OCM: open-cluster-management.io
Sveltos: projectsveltos.github.io/sveltos
Karmada: karmada.io
KubeFleet: github.com/kubefleet-dev/kubefleet
SIG Multicluster: multicluster.sigs.k8s.io
KEP-4322 (Cluster Inventory): github.com/kubernetes/enhancements/tree/master/keps/sig-multicluster/4322-cluster-inventory

layout: center

Thank you!

Questions?

@guilhem — github.com/guilhem

guilhem/multicluster.md

Multi-Cluster Kubernetes: Problems & Solutions

OCM, Sveltos, and the road to a standard API

Why multi-cluster?

Problem #1 — Where are my clusters?

Problem #2 — How do I deploy consistently?

Problem #3 — How do clusters talk to the hub?

layout: center

"Easy. Just use GitOps."

Well... who installs the GitOps tool?

Well... how do you limit blast radius?

class: text-sm

Well... how do you get feedback?

GitOps is not wrong. It's not enough.

The solution landscape

layout: section

Part 1: Open Cluster Management (OCM)

OCM — Hub & Spoke architecture

class: text-sm

OCM — Solving Problem #1: Discovery

OCM — Placement API

OCM — Addons (the OCM concept)

layout: section

Part 2: Sveltos

class: text-sm

Sveltos — Solving Problem #2: Consistent deployment

Sveltos — Key features

class: text-sm

But wait — Sveltos needs to know about clusters

layout: section

Part 3: Bridging OCM + Sveltos

github.com/guilhem/sveltos-ocm-addon

sveltos-ocm-addon — What it does

sveltos-ocm-addon — Architecture

The full workflow

layout: section

Part 4: SIG Multicluster ClusterProfile API

The upstream standardization effort

The standardization problem

class: text-sm

KEP-4322: ClusterProfile API

The future: one API to rule them all?

layout: section

Bonus: Intelligent workload scheduling

Now that we have inventory, addons, and communication...

The next question

class: text-sm

Karmada — Kubernetes-native workload propagation

class: text-sm

KubeFleet — Multi-cluster scheduling from the hub

class: text-sm

The full picture

Key takeaways

Links & References

layout: center

Thank you!

Questions?