HA Failover Operations¶

This page covers operating HA principals day-to-day: checking status, performing failovers, and monitoring replication health. See HA concepts and HA configuration for setup.

CLI Reference¶

The argocd-agentctl ha commands manage HA state. By default they auto-detect the principal pod via --principal-context and port-forward to the admin port (8405). Use --address for a direct connection.

ha status¶

Show current HA state, peer reachability, and replication info.

argocd-agentctl ha status
argocd-agentctl ha status -o json
argocd-agentctl ha status -o yaml

Example output:

HA Status
---------
State:             active
Preferred Role:    primary
Peer Address:      principal.region-b.internal:8443
Connected Replicas: 1
Connected Agents:  12

ha promote¶

Transition to ACTIVE. Refuses if the replication stream is still connected (the peer is likely alive) unless --force is passed.

argocd-agentctl ha promote
argocd-agentctl ha promote --force   # skip safety check

ha demote¶

Transition ACTIVE to REPLICATING. Disconnects all connected agents.

argocd-agentctl ha demote
argocd-agentctl ha demote --force    # skip confirmation prompt

Procedures¶

Unplanned Failover (Primary Dies)¶

T+0s   Primary dies. Replica detects stream break -> DISCONNECTED.
T+30s  Operator notified via alert.
T+31s  Operator runs:
         argocd-agentctl ha promote --force
       Replica -> ACTIVE. /healthz -> 200.
T+60s  DNS TTL expires. Agents reconnect to Region B.
T+90s  Fully operational.

No full resync — the replica already has all resources written to its K8s cluster.

Steps:

Confirm the primary is actually down (check pod, node, network)
Check replica status: argocd-agentctl ha status
Promote: argocd-agentctl ha promote --force
Verify: argocd-agentctl ha status shows active
If using manual DNS, update the A record to point to Region B

Clean Switchover (Primary Alive)¶

When both principals are healthy and you want to switch the active role.

Steps:

Check both principals: argocd-agentctl ha status on each
Demote the current primary:
```
argocd-agentctl ha demote
```
Promote the replica:
```
argocd-agentctl ha promote
```
Update DNS if not using GSLB with health checks
Verify agents reconnect: argocd-agentctl ha status shows agents connecting

Failback¶

Restore the original primary after an unplanned failover.

Old primary restarts -> RECOVERING -> SYNCING -> REPLICATING

Wait for replication to catch up:

argocd-agentctl ha status   # on old primary, look for "replicating"

Once caught up, perform a clean switchover (demote Region B, promote Region A)
Update DNS back to Region A

Monitoring¶

Metrics¶

Replication metrics are exposed at the principal's metrics endpoint (default port 8000).

Forwarder (active principal):

Metric	Type	Description
`argocd_agent_replication_events_queued_total`	Counter	Events queued for replication
`argocd_agent_replication_events_dropped_total`	Counter	Events dropped (queue full)
`argocd_agent_replication_events_forwarded_total`	Counter	Events sent to replicas
`argocd_agent_replication_replicas_connected`	Gauge	Connected replica count
`argocd_agent_replication_queue_size`	Gauge	Pending events in queue
`argocd_agent_replication_forwarding_errors_total`	Counter	Send errors
`argocd_agent_replication_last_sequence_number`	Gauge	Latest sequence number

Client (replica principal):

Metric	Type	Description
`argocd_agent_replication_client_events_received_total`	Counter	Events received from primary
`argocd_agent_replication_client_lag_seconds`	Gauge	Replication lag
`argocd_agent_replication_client_sequence_gaps_total`	Counter	Sequence gaps detected
`argocd_agent_replication_client_reconciliations_total`	Counter	Full snapshot re-fetches

HA state:

Metric	Type	Description
`argocd_agent_ha_state`	Gauge	Current HA state (labeled)
`argocd_agent_ha_transitions_total`	Counter	State transition count
`argocd_agent_ha_failovers_total`	Counter	Promotion events

Recommended Alerts¶

Alert	PromQL	For
ReplicationLagHigh	`argocd_agent_replication_client_lag_seconds > 5`	1m
ReplicaDisconnected	`argocd_agent_replication_replicas_connected == 0`	30s
QueueNearCapacity	`argocd_agent_replication_queue_size > 900`	1m
BothPrincipalsActive	`count(argocd_agent_ha_state{state="active"}) > 1`	0s

Troubleshooting¶

Replica stuck in DISCONNECTED¶

Check network connectivity between regions:

# From the replica pod
kubectl exec -it <replica-pod> -- curl -sS http://<primary-addr>:8003/healthz

Check that the server's --auth method is configured correctly — the replica needs a client certificate the primary trusts, and the extracted identity must be in --ha-allowed-replication-clients.

Promote refuses (replication stream active)¶

The safety check prevents promoting while still receiving events from a peer. If you're certain the primary is dead but the stream hasn't timed out:

argocd-agentctl ha promote --force

Agents not reconnecting after failover¶

Check DNS TTL — agents won't resolve the new address until TTL expires (default recommendation: 60s)
Verify /healthz returns 200 on the newly promoted principal
Check GSLB health check configuration points to port 8003

Sequence gaps and frequent reconciliation¶

The sequence_gaps_total metric incrementing means events are being dropped in the forwarder queue. This is normal during burst traffic and self-corrects on the next reconciliation cycle (default 1 minute). Sustained gaps indicate the replica can't keep up. Check network latency and resource utilization.