HA failover proposal
ArgoCD Agent HA Failover¶
ai disclosure: diagrams and tables were formatted/generated with ai
Overview¶
This proposal adds active/passive High Availability to the argocd-agent principal, enabling cross-region disaster recovery with operator-driven failover.
Design philosophy: All promotion decisions are external (operator CLI or future coordinator). Principals never autonomously decide to go ACTIVE. This eliminates the need for self-fencing, term/epoch systems, and heartbeat protocols — significantly reducing split-brain risk at the cost of requiring operator intervention for failover.
| Decision | Choice | Rationale |
|---|---|---|
| Recovery Target | 1–5 minutes | Acceptable for DR; limited by DNS TTL + operator response |
| Replication | Principal-to-principal streaming | Reuses existing CloudEvent/gRPC patterns |
| Agent connectivity | Single GSLB/DNS endpoint | Transparent to agents, zero agent changes |
| Failover trigger | Operator CLI (ha promote/demote) |
No autonomous promotion = no split-brain |
| Consistency | Safety over availability | Brief outage during partition is acceptable |
Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Global DNS / GSLB │
│ principal.argocd.example.com │
│ Health checks: /healthz (200 only when ACTIVE) │
└──────────────┬──────────────────────────────┬───────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ REGION A (Primary) │ │ REGION B (Replica) │
│ │ │ │
│ Principal Server │◄─────│ Replication Client │
│ - gRPC :8443 │ │ - gRPC :8443 │
│ - HAAdmin :8405 │─────►│ - Mirrors state │
│ - /healthz :8003 │ │ - /healthz :8003 │
│ │ │ │
│ ArgoCD Instance │ │ ArgoCD Instance │
│ (source of truth) │ │ (standby, receives │
│ │ │ replicated state) │
└──────────────────────┘ └──────────────────────┘
▲ ▲
│ Replication Stream: │
│ All events forwarded │
│ Primary → Replica │
│ │
┌───────┴───────────────────────────────┴───────┐
│ Remote Clusters │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Agent 1 │ │ Agent 2 │ │ Agent N │ │
│ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────────────────────────────┘
Primary and Replica run in separate Kubernetes clusters. The Replica's cluster starts empty — Applications and AppProjects are populated entirely via replication.
State Machine¶
Five states, with only operator-triggered promotion:
RECOVERING ──┬── config=primary & peer not ACTIVE ──→ ACTIVE
└── config=replica OR peer is ACTIVE ──→ SYNCING → REPLICATING
REPLICATING ──── stream breaks ──→ DISCONNECTED
DISCONNECTED ─── stream reconnects ──→ REPLICATING
Operator-only transitions:
{REPLICATING, DISCONNECTED} ── ha promote ──→ ACTIVE
ACTIVE ── ha demote ──→ REPLICATING
| State | /healthz |
Accepts Agents | Description |
|---|---|---|---|
| RECOVERING | 503 | No | Startup, determining role |
| SYNCING | 503 | No | Initial catch-up to primary |
| REPLICATING | 503 | No | Receiving events, in sync |
| DISCONNECTED | 503 | No | Lost replication stream |
| ACTIVE | 200 | Yes | Serving agents |
Only ACTIVE returns healthy. GSLB routes agents exclusively to the active principal.
The promote command checks whether the local principal is still actively replicating from a peer. If so, it refuses (the peer is likely alive) unless --force is passed. This prevents the most common operator error — promoting while the primary is still running.
Replication¶
Model¶
The Replica runs a Replication Client that connects to the Primary's main gRPC server (port 8443) via a bidirectional gRPC stream. The replication service is registered on the same server as the agent API, with per-method auth routing in the interceptors. Unlike regular agents (namespace-scoped), the replication peer receives ALL events across all agents.
What Gets Replicated¶
| Data | Method |
|---|---|
| Applications | Full objects in snapshot + incremental CloudEvents. Written to replica's K8s cluster. |
| AppProjects | Full objects in snapshot + incremental CloudEvents. Written to replica's K8s cluster. |
| Agent connection metadata | Snapshot (agent name, mode, connected state) |
| Resource keys | Snapshot + event-driven |
| Queue state | Queue pairs created on snapshot; events flow as queued |
Protocol¶
Three RPCs defined in principal/apis/replication/replication.proto:
| RPC | Direction | Purpose |
|---|---|---|
Subscribe |
Bidi stream | Replica receives ReplicatedEvents, sends ReplicationAcks |
GetSnapshot |
Unary | Initial full-state sync on connect |
Status |
Unary | Sequence number + lag for monitoring and reconciliation |
Each ReplicatedEvent wraps a CloudEvent with: agent name, direction (inbound/outbound), sequence number, and timestamp. Events are tagged with direction so the replica knows whether to update its local state (inbound) or queue for future agent delivery (outbound).
Sync Flow¶
- Replica connects to primary
- Calls
GetSnapshot— receives all agent states with full serialized resources - Writes Applications/AppProjects to its local K8s cluster (upsert)
- Opens
Subscribestream — receives incremental events - Sends periodic ACKs (every 5s) with last processed sequence number
- Runs periodic reconciliation (every 1m) — compares sequences via
StatusRPC, re-fetches snapshot if gaps detected
Gap Recovery¶
The forwarder queue (1000 events) drops events on overflow. The client detects sequence gaps and marks itself for reconciliation. On the next reconciliation tick, it re-fetches a full snapshot to catch up. This bounds drift to at most 1 minute.
Failover Scenarios¶
Primary Dies¶
T+0s Primary dies. Replica detects stream break → DISCONNECTED.
T+30s Operator notified via alert. Both principals unhealthy from GSLB perspective.
T+31s Operator runs: argocd-agentctl ha promote [--force]
Replica → ACTIVE. Health → 200.
T+60s DNS TTL expires. Agents reconnect to Region B via GSLB.
T+90s Fully operational.
No full resync needed — replica already has all resources written to its K8s cluster.
Clean Switchover (Primary Alive)¶
$ argocd-agentctl ha demote # on Region A — drops agents, stops serving
$ argocd-agentctl ha promote # on Region B — becomes ACTIVE
# Update DNS to point to Region B
Failback¶
T+0 Old primary restarts → RECOVERING → SYNCING (peer is ACTIVE)
Connects to Region B as replica → REPLICATING
T+2m Caught up (lag: 0s)
Operator runs:
$ argocd-agentctl ha demote # on Region B
$ argocd-agentctl ha promote # on Region A
Update DNS back to Region A.
Agent Reconnection¶
Because the replica has been continuously replicating and writing resources to its cluster:
- Agent authenticates (same shared CA)
- Replica already has all resources — no full resync
- Quick checksum verification confirms state
- Only in-flight events during failover need delta sync
Security¶
Replication RPCs are served on the main gRPC port (8443) alongside agent traffic. The main server's interceptors route replication methods (/replicationapi.Replication/*) through the server's --auth method and check the extracted identity against --ha-allowed-replication-clients. This reuses the existing mTLS infrastructure — no separate port, listener, or TLS config needed.
The admin gRPC server (ha status, ha promote, ha demote) binds to 127.0.0.1:8405 only and has no TLS. Access requires kubectl port-forward — Kubernetes RBAC is the gate.
Configuration¶
Flags for Region A (preferred primary):
--ha-enabled
--ha-preferred-role=primary
--ha-peer-address=principal.region-b.internal:8443
# Peer identity allowlist (replication uses the server's --auth method)
--ha-allowed-replication-clients=region-b
Flags for Region B (preferred replica) are symmetric — swap peer address and allowed client identity.
All flags have ARGOCD_PRINCIPAL_HA_* environment variable equivalents.
Agent configuration is unchanged — agents connect to a single GSLB/DNS endpoint:
--address=principal.argocd.example.com:8443
Ports¶
| Port | Bind | TLS | Purpose |
|---|---|---|---|
| 8443 | 0.0.0.0 |
mTLS | Agent gRPC + replication (shared) |
| 8405 | 127.0.0.1 |
None | HAAdmin gRPC (status/promote/demote) |
Override admin port with --ha-admin-port.
CLI¶
The argocd-agentctl ha subcommand manages HA state. It auto port-forwards to the principal pod's admin port (8405) via --principal-context, or accepts --address for a direct connection.
| Command | Description |
|---|---|
ha status |
Show state, peer status, replication lag, sequence numbers |
ha promote |
Transition to ACTIVE. Refuses if peer is ACTIVE unless --force. |
ha demote |
Transition ACTIVE → REPLICATING. Disconnects all agents. |
GSLB / DNS Setup¶
Any GSLB or DNS provider that supports health checks works. Requirements:
| Requirement | Detail |
|---|---|
| Health check | Poll /healthz on each principal (port 8003) |
| Failover routing | Route to healthy endpoint |
| DNS TTL | Recommend 60s |
| Single endpoint | Agents resolve one DNS name |
DNS is operator-managed. The principal's health endpoint reflects HA state — only ACTIVE returns 200.
For environments that only have simple DNS (no GSLB health checks), the operator manually updates the DNS A record as part of the failover procedure.
Observability¶
Prometheus metrics exposed for monitoring:
| Metric | Type | Description |
|---|---|---|
argocd_agent_ha_state |
Gauge | Current HA state (labeled) |
argocd_agent_ha_state_transitions_total |
Counter | State transition count |
argocd_agent_ha_failovers_total |
Counter | Failover events |
argocd_agent_replication_forwarder_events_total |
Counter | Events forwarded |
argocd_agent_replication_forwarder_events_dropped_total |
Counter | Events dropped (queue full) |
argocd_agent_replication_forwarder_queue_depth |
Gauge | Pending events in queue |
argocd_agent_replication_forwarder_replicas_connected |
Gauge | Connected replicas |
argocd_agent_replication_client_events_total |
Counter | Events received by client |
argocd_agent_replication_client_lag_seconds |
Gauge | Replication lag |
argocd_agent_replication_client_sequence_gaps_total |
Counter | Sequence gaps detected |
argocd_agent_replication_client_reconciliations_total |
Counter | Snapshot re-fetches from gap recovery |
Recommended alerts:
- ReplicationLagHigh: client_lag_seconds > 5 for 1m
- ReplicaDisconnected: forwarder_replicas_connected == 0 for 30s
- QueueNearCapacity: forwarder_queue_depth > 900 for 1m
Design Decisions¶
| Decision | Choice | Rationale |
|---|---|---|
| Failover mode | Manual only (operator promote/demote) | No autonomous promotion; operator controls RTO vs safety |
| Split-brain prevention | Promote refuses if replication stream active | Common error caught; --force for emergencies (see Limitations) |
| Peer health detection | Replication stream state | No separate heartbeat/ACK protocol needed |
| State count | 5 | RECOVERING, SYNCING, REPLICATING, DISCONNECTED, ACTIVE |
| Resource replication | Full objects written to replica K8s cluster | Separate clusters, no shared state |
| Replication backpressure | Drop + metric + reconcile | Simple for v1; bounded 1-min drift via reconciliation |
| Secrets | Operator configures manually | Avoid replicating sensitive data |
| Agent changes | None | GSLB/DNS transparent failover |
| DNS integration | None (operator-managed) | Works with any provider |
Limitations and Edge Cases¶
Split-brain is possible with operator error. The promote safety check only verifies the local replication stream is broken (DISCONNECTED state). It does not RPC to the peer to confirm the peer is down. If two operators in separate regions independently promote during a network partition, both principals go ACTIVE. Mitigation: use --force only when the peer is confirmed dead. Future work: add external coordinator (see below) for stronger guarantees, or add peer Status RPC check before promotion.
RPO depends on replication lag at time of failure. Events processed by the primary but not yet delivered to the replica are lost on failover. RPO = time since last replica ACK + any events in the forwarder queue. Under normal operation this is sub-second; under burst load with queue overflow, it can be up to 60s (bounded by reconciliation interval).
Full snapshot on gap recovery is O(N). When sequence gaps are detected, the client re-fetches a complete snapshot of all agents and resources. At scale (thousands of Applications), this can be expensive. Future work: incremental catch-up using since_sequence_num to fetch only missing events.
Forwarder queue has no backpressure. The 1000-event queue drops events on overflow with only a metric increment. Sustained burst traffic can cause repeated gaps and reconciliation storms. Future work: configurable queue size, backpressure signaling, or ring buffer with eviction.
Demote→promote sequence has a brief window. During clean switchover, after demoting the primary and before promoting the replica, both principals are unhealthy. Agents cannot connect during this window. The window is typically sub-second but depends on operator speed. Future work: atomic switchover command.
Both-die scenario requires manual intervention. If both principals die and restart simultaneously, both enter RECOVERING. The one configured as preferredRole: primary goes ACTIVE; the other goes SYNCING. If configuration is identical or missing, behavior is undefined. Ensure preferredRole is always set.
Future Work¶
- External coordinator interface — Pluggable
Coordinatorinterface that the principal polls to determine whether it should be ACTIVE. This removes operator error as a split-brain vector — the coordinator is the single source of truth for which principal serves traffic.
type Coordinator interface {
ShouldBeActive(ctx context.Context) (bool, error)
}
The principal polls periodically and transitions accordingly: if the coordinator says "active" and the principal isn't, it promotes; if it says "not active" and the principal is, it demotes and disconnects agents.
Candidate implementations:
| Backend | How It Works |
|---|---|
| AWS Route53 ARC | Routing controls provide explicit on/off switches per region with safety rules preventing both from being ON simultaneously. Failover = flip the routing control via console, CLI, or ARC's health-check automation. |
| Consul | Distributed lock / session-based leader election. Principal holds a Consul session; losing the session triggers demotion. |
| Kubernetes Lease | Lease object in a shared control plane (multi-AZ). Principal that holds the lease is ACTIVE. Works for single-cluster or multi-AZ setups, not cross-region. |
| HashiCorp Vault | Vault's HA backend (Consul/Raft) for distributed lock. |
| etcd | Direct etcd lease for environments already running etcd. |
Configuration would look like:
principal:
ha:
enabled: true
mode: coordinator # "manual" (default) or "coordinator"
coordinator:
type: aws-arc # or: consul, k8s-lease
pollInterval: 10s
aws:
routingControlArn: arn:aws:route53-recovery-control::123456789012:...
With a coordinator, failover becomes: flip the external control (via cloud console, CLI, or the coordinator's own health-check automation). The principal sees the change on next poll and transitions automatically. Manual ha promote/demote commands remain available as an override.
- Peer status RPC in promote — Before allowing promotion, call peer's Status RPC to confirm it is not ACTIVE. Strengthen the safety check beyond local state.
- Incremental gap recovery — Use
since_sequence_numin snapshot requests to fetch only missing events instead of full state. - Multi-replica — Multiple replicas for additional redundancy.
- Automatic failback — Auto-failback when preferred primary is synced and healthy for N minutes.
- CLI-integrated DNS —
ha failoveroptionally updates DNS records directly.