ADR-0009: Availability zones
- Status
-
proposed
- Date
-
2026-03-09
- Group
-
cross-cutting
- Depends-on
-
ADR-0002, ADR-0003
Context
Government workloads require high availability and disaster resilience. A single datacenter is a single point of failure (power, cooling, network, physical incidents). The number of availability zones determines the resilience model, the complexity of data replication, and the network architecture between sites.
Options
Option 1: Single AZ (1 datacenter)
-
Pros: simplest operations; no cross-site networking; no replication latency; lowest infrastructure cost
-
Cons: single point of failure; no disaster resilience
Option 2: 2 AZs
-
Pros: survives single-site failure; simpler than 3-site; lower infrastructure investment than 3 AZs
-
Cons: split-brain risk for distributed systems (no quorum possible with 2 sites); failover capacity requires 2x provisioning
Option 3: 3 AZs
-
Pros: quorum-based consensus possible (etcd, Ceph, etc.); survives single-site failure without split-brain; capacity can be distributed (each site runs at ~66% instead of 50%)
-
Cons: cross-site network complexity; data replication across 3 sites; higher infrastructure investment
Option 4: 4+ AZs
-
Pros: survives multiple simultaneous site failures; more granular capacity distribution
-
Cons: significantly higher infrastructure and operational cost; diminishing returns beyond 3 for quorum-based systems; cross-site replication complexity increases with each AZ
Decision
Minimum 3 availability zones across physically separate government datacenters (ODCs). A single AZ does not provide disaster resilience, which is a hard requirement for government continuity. Two AZs create split-brain risk for all quorum-based systems (etcd, Ceph, Gardener). Three is the minimum for quorum-based consensus. Each AZ must be independently operational (separate power, cooling, network uplinks). Four or more AZs may be added later but 3 is the design target.
Consequences
-
Cross-AZ networking must be low-latency and high-bandwidth (separate ADR)
-
Storage replication strategy must span 3 AZs (separate ADR)
-
Gardener Seed placement across AZs needs to be defined
-
Each AZ must have sufficient capacity to absorb failure of one other AZ