Skip to content

ADR-0006: Cluster lifecycle management

Proposed
Status

proposed

Date

2026-03-09

Group

cluster-management

Depends-on

ADR-0002, ADR-0005

Context

With bare-metal provisioning in place (ADR-0005), tenant organizations need dedicated Kubernetes clusters with full lifecycle management: provisioning, upgrades, certificate rotation, node rotation, auto-scaling, and self-healing. At 50,000 physical servers this means potentially hundreds of clusters that must be managed without manual intervention.

Options

Option 1: Gardener

  • Pros: full lifecycle management built in (upgrades, cert rotation, node rotation, auto-scaling, self-healing); proven at scale (SAP runs thousands of clusters); control planes run as pods on Seed cluster (efficient); native metal-stack integration; European origin (SAP/NeoNephos/LF Europe); Apache 2.0

  • Cons: complex to set up and operate; own concept model (Garden/Seed/Shoot) with learning curve; Seed cluster is a critical component; smaller community than CAPI

Option 2: Cluster API (CAPI) with infrastructure providers

  • Pros: Kubernetes-native declarative API; CNCF project; large community; modular — swap infrastructure providers; familiar to Kubernetes operators

  • Cons: no built-in day-2 operations (upgrades, cert rotation, self-healing are separate concerns); requires assembling multiple components; Metal³ provider is less mature than Gardener’s metal-stack integration

Option 3: Kamaji

  • Pros: efficient shared control planes; CNCF Sandbox; fast provisioning

  • Cons: young project; smaller community; central etcd requires careful capacity planning; less mature day-2 tooling

Option 4: Rancher (SUSE)

  • Pros: widely deployed in European government; UI-driven cluster management; supports multiple Kubernetes distributions; large community

  • Cons: commercial vendor dependency (SUSE); less suited for bare-metal at scale; no native metal-stack integration; day-2 operations less mature than Gardener

Option 5: vCluster

  • Pros: lowest overhead; seconds to provision; works within existing cluster

  • Cons: no kernel isolation (container breakout affects other tenants); sync mechanism adds attack surface; commercial vendor dependency (Loft Labs)

Decision

Gardener. Native metal-stack integration means the full stack (provisioning → cluster lifecycle) is proven in production together. Gardener is the only option with comprehensive built-in day-2 operations at scale. CAPI requires assembling multiple components to achieve what Gardener provides out of the box. Rancher lacks native metal-stack integration. Kamaji is too young for production use at our scale. vCluster does not provide the kernel-level isolation required by ADR-0008 and EUCS SEAL-4. European governance aligns with sovereignty requirements.

Consequences

  • Seed cluster(s) must be operated with HA configuration

  • Gardener Extensions model provides the plugin mechanism for per-cluster services (DNS, certs, monitoring)

  • Tenant isolation is at the Shoot cluster level — each organization gets dedicated clusters

  • The platform team needs Gardener expertise