HA Failover Operations¶
This page covers operating HA principals day-to-day: checking status, performing failovers, and monitoring replication health. See HA concepts and HA configuration for setup.
CLI Reference¶
The argocd-agentctl ha commands manage HA state. By default they auto-detect the principal pod via --principal-context and port-forward to the admin port (8405). Use --address for a direct connection.
ha status¶
Show current HA state, peer reachability, and replication info.
argocd-agentctl ha status
argocd-agentctl ha status -o json
argocd-agentctl ha status -o yaml
Example output:
HA Status
---------
State: active
Preferred Role: primary
Peer Address: principal.region-b.internal:8443
Connected Replicas: 1
Connected Agents: 12
ha promote¶
Transition to ACTIVE. Refuses if the replication stream is still connected (the peer is likely alive) unless --force is passed.
argocd-agentctl ha promote
argocd-agentctl ha promote --force # skip safety check
ha demote¶
Transition ACTIVE to REPLICATING. Disconnects all connected agents.
argocd-agentctl ha demote
argocd-agentctl ha demote --force # skip confirmation prompt
Procedures¶
Unplanned Failover (Primary Dies)¶
T+0s Primary dies. Replica detects stream break -> DISCONNECTED.
T+30s Operator notified via alert.
T+31s Operator runs:
argocd-agentctl ha promote --force
Replica -> ACTIVE. /healthz -> 200.
T+60s DNS TTL expires. Agents reconnect to Region B.
T+90s Fully operational.
No full resync — the replica already has all resources written to its K8s cluster.
Steps:
- Confirm the primary is actually down (check pod, node, network)
- Check replica status:
argocd-agentctl ha status - Promote:
argocd-agentctl ha promote --force - Verify:
argocd-agentctl ha statusshowsactive - If using manual DNS, update the A record to point to Region B
Clean Switchover (Primary Alive)¶
When both principals are healthy and you want to switch the active role.
Steps:
- Check both principals:
argocd-agentctl ha statuson each - Demote the current primary:
argocd-agentctl ha demote - Promote the replica:
argocd-agentctl ha promote - Update DNS if not using GSLB with health checks
- Verify agents reconnect:
argocd-agentctl ha statusshows agents connecting
Failback¶
Restore the original primary after an unplanned failover.
- Old primary restarts -> RECOVERING -> SYNCING -> REPLICATING
- Wait for replication to catch up:
argocd-agentctl ha status # on old primary, look for "replicating" - Once caught up, perform a clean switchover (demote Region B, promote Region A)
- Update DNS back to Region A
Monitoring¶
Metrics¶
Replication metrics are exposed at the principal's metrics endpoint (default port 8000).
Forwarder (active principal):
| Metric | Type | Description |
|---|---|---|
argocd_agent_replication_events_queued_total |
Counter | Events queued for replication |
argocd_agent_replication_events_dropped_total |
Counter | Events dropped (queue full) |
argocd_agent_replication_events_forwarded_total |
Counter | Events sent to replicas |
argocd_agent_replication_replicas_connected |
Gauge | Connected replica count |
argocd_agent_replication_queue_size |
Gauge | Pending events in queue |
argocd_agent_replication_forwarding_errors_total |
Counter | Send errors |
argocd_agent_replication_last_sequence_number |
Gauge | Latest sequence number |
Client (replica principal):
| Metric | Type | Description |
|---|---|---|
argocd_agent_replication_client_events_received_total |
Counter | Events received from primary |
argocd_agent_replication_client_lag_seconds |
Gauge | Replication lag |
argocd_agent_replication_client_sequence_gaps_total |
Counter | Sequence gaps detected |
argocd_agent_replication_client_reconciliations_total |
Counter | Full snapshot re-fetches |
HA state:
| Metric | Type | Description |
|---|---|---|
argocd_agent_ha_state |
Gauge | Current HA state (labeled) |
argocd_agent_ha_transitions_total |
Counter | State transition count |
argocd_agent_ha_failovers_total |
Counter | Promotion events |
Recommended Alerts¶
| Alert | PromQL | For |
|---|---|---|
| ReplicationLagHigh | argocd_agent_replication_client_lag_seconds > 5 |
1m |
| ReplicaDisconnected | argocd_agent_replication_replicas_connected == 0 |
30s |
| QueueNearCapacity | argocd_agent_replication_queue_size > 900 |
1m |
| BothPrincipalsActive | count(argocd_agent_ha_state{state="active"}) > 1 |
0s |
Troubleshooting¶
Replica stuck in DISCONNECTED¶
Check network connectivity between regions:
# From the replica pod
kubectl exec -it <replica-pod> -- curl -sS http://<primary-addr>:8003/healthz
Check that the server's --auth method is configured correctly — the replica needs a client certificate the primary trusts, and the extracted identity must be in --ha-allowed-replication-clients.
Promote refuses (replication stream active)¶
The safety check prevents promoting while still receiving events from a peer. If you're certain the primary is dead but the stream hasn't timed out:
argocd-agentctl ha promote --force
Agents not reconnecting after failover¶
- Check DNS TTL — agents won't resolve the new address until TTL expires (default recommendation: 60s)
- Verify
/healthzreturns 200 on the newly promoted principal - Check GSLB health check configuration points to port 8003
Sequence gaps and frequent reconciliation¶
The sequence_gaps_total metric incrementing means events are being dropped in the forwarder queue. This is normal during burst traffic and self-corrects on the next reconciliation cycle (default 1 minute). Sustained gaps indicate the replica can't keep up. Check network latency and resource utilization.