Synchronization Protocol¶
This document provides a detailed technical explanation of the synchronization protocol used by argocd-agent to maintain consistency between the principal (control plane) and agents (workload clusters). The target audience is engineers and architects who need to understand the underlying mechanisms.
Overview¶
The argocd-agent synchronization protocol is built on a bidirectional streaming communication model that enables reliable data synchronization between a central principal and distributed agents. The protocol handles:
- Configuration distribution (managed mode) or status aggregation (autonomous mode)
- Failure recovery through resync mechanisms
- Conflict detection using checksums
- Event-driven updates with guaranteed ordering
Architecture¶
Communication Model¶
The protocol uses a hub-and-spoke architecture where:
- Principal: Runs on the control plane, acts as the central coordination point
- Agents: Run on workload clusters, initiate all connections to the principal
- Connections: Always initiated by agents (never principal-to-agent)
- Streams: Bidirectional gRPC streams over HTTP/2 (with HTTP/1.1 websocket fallback)
┌─────────────────┐ gRPC/HTTP2 ┌─────────────────┐
│ Agent 1 │──────────────────►│ Principal │
│ (Workload) │ │ (Control Plane) │
└─────────────────┘ └─────────────────┘
▲
┌─────────────────┐ │
│ Agent 2 │──────────────────────────┘
│ (Workload) │
└─────────────────┘
Protocol Stack¶
┌─────────────────────────────────────┐
│ Application Events │ ← Create, Delete, SpecUpdate, Status, etc
├─────────────────────────────────────┤
│ CloudEvents Format │ ← Event envelope with metadata
├─────────────────────────────────────┤
│ gRPC Streaming │ ← Bidirectional streams
├─────────────────────────────────────┤
│ HTTP/2 (or HTTP/1+WS) │ ← Transport layer
├─────────────────────────────────────┤
│ TLS + mTLS │ ← Security layer
└─────────────────────────────────────┘
Communication Protocol¶
gRPC Service Definition¶
The core communication is defined by the EventStream service:
service EventStream {
rpc Subscribe(stream Event) returns (stream Event);
rpc Push(stream Event) returns (PushSummary);
rpc Ping(PingRequest) returns (PongReply);
}
Connection Lifecycle¶
- Authentication: Agent presents JWT token with client certificate (optional)
- Authorization: Principal validates agent identity and creates queue pair
- Stream Establishment: Bidirectional gRPC stream created
- Resync: Initial synchronization based on agent mode
- Event Processing: Continuous bidirectional event exchange
- Graceful Shutdown: Connection cleanup and queue removal
Event Format¶
All synchronization data is exchanged using CloudEvents format:
{
"specversion": "1.0",
"source": "agent-name" | "principal",
"type": "io.argoproj.argocd-agent.event.*",
"dataschema": "application" | "appproject" | "resource" | "resourceResync",
"extensions": {
"resourceid": "uuid",
"eventid": "uuid"
},
"data": { /* event-specific payload */ }
}
Event Types and Flow¶
Core Event Types¶
The protocol defines several event types for different synchronization scenarios:
Resource Management Events¶
create: Create new resource (managed mode only)delete: Remove resource (managed mode only)spec-update: Update resource specificationstatus-update: Update resource status
Synchronization Events¶
request-synced-resource-list: Request list of managed resourcesresponse-synced-resource: Response with resource metadatarequest-update: Request latest version of specific resourcerequest-resource-resync: Trigger full resync process
Control Events¶
ping/pong: Keepalive mechanismprocessed: Event acknowledgment
Event Flow Patterns¶
Managed Mode Flow¶
Principal Agent
│ │
│─── create/update/delete ────► │ (Configuration)
│ │
│ ◄─── status-update ───────────│ (Status feedback)
│ │
│─── request-resource-resync ──► │ (On principal restart)
│ │
│ ◄─── request-synced-resource │ (Agent sends inventory)
│ -list │
│ │
│─── response-synced-resource ──►│ (For each resource)
Autonomous Mode Flow¶
Principal Agent
│ │
│ ◄─── create/update/delete ────│ (Configuration changes)
│ │
│─── status-update ────────────► │ (Status sync)
│ │
│─── request-synced-resource ──► │ (On principal restart)
│ -list │
│ │
│ ◄─── response-synced-resource │ (For each resource)
│ │
│─── request-update ───────────► │ (Request specific updates)
Synchronization Modes¶
Managed Mode¶
In managed mode, the principal is the source of truth for configuration:
Characteristics:
- Principal creates/updates/deletes resources
- Agent receives configuration and applies it locally
- Agent sends status updates back to principal
- Principal initiates resync after restarts
Event Flow:
- Configuration changes made on principal
- Principal sends
create/update/deleteevents to agent - Agent applies changes to local Argo CD
- Agent sends
status-updateevents back to principal - Principal updates UI/API with current status
Namespace Mapping:
- Namespace-based (default): Applications on principal are placed in namespaces named after target agents. Example: Applications in namespace
production-clustersync to agentproduction-cluster. - Destination-based: Applications use
spec.destination.nameon the principal to specify the target agent, allowing multiple namespaces per agent. Example: An application withdestination.name: production-clustersyncs to agentproduction-clusterregardless of its namespace.
See Agent Mapping Modes for detailed configuration.
Autonomous Mode¶
In autonomous mode, the agent is the source of truth for configuration:
Characteristics:
- Agent creates/updates/deletes resources locally
- Principal receives configuration updates and mirrors them
- Principal acts as read-only observer with limited control capabilities
- Agent initiates resync after restarts
Event Flow:
- Configuration changes made on agent (via Git, local API, etc.)
- Agent detects changes via informers
- Agent sends
create/update/deleteevents to principal - Principal mirrors changes for UI/API visibility
- Principal can still trigger sync/refresh operations
Resync Process¶
Purpose¶
The resync process ensures data consistency when either the principal or agent restarts and may have missed events.
Resync Triggers¶
Agent Restart (Managed Mode)¶
- Agent connects and is detected as needing resync
- Principal sends
request-resource-resyncevent - Agent responds with
request-synced-resource-listcontaining checksum - Principal compares checksums and sends missing/updated resources
Principal Restart (Autonomous Mode)¶
- Principal detects agent connection after restart
- Principal sends
request-synced-resource-listwith its checksum - Agent compares checksums and sends
response-synced-resourcefor each resource - Principal rebuilds its state from agent responses
Principal Restart (Managed Mode)¶
- Principal detects agent connection after restart
- Principal sends
request-resource-resyncevent - Agent sends
request-synced-resource-listwith checksum - Principal validates and sends any needed updates
Resync State Management¶
The principal maintains resync state to avoid redundant resync operations:
type resyncStatus struct {
mu sync.RWMutex
agents map[string]bool // tracks which agents have been resynced
}
Checksum-based Sync Detection¶
Resource Identification¶
Resources are uniquely identified using a composite key:
type ResourceKey struct {
Name string // Resource name
Namespace string // Resource namespace
Kind string // Resource type (Application, AppProject)
UID string // Source UID for tracking
}
Checksum Calculation¶
Checksums are calculated from resource keys to efficiently detect synchronization drift:
func (r *Resources) Checksum() []byte {
resources := make([]string, 0, len(r.resources))
for res := range r.resources {
// Namespace omitted for cross-cluster compatibility
resources = append(resources, fmt.Sprintf("%s/%s/%s",
res.Kind, res.Name, res.UID))
}
sort.Strings(resources) // Ensure deterministic order
hash := sha256.Sum256([]byte(strings.Join(resources, "")))
return hash[:]
}
Sync Detection Flow¶
- Checksum Exchange: Peer sends checksum of all managed resources
- Comparison: Recipient compares with local checksum
- Delta Sync: If different, recipient requests specific resource updates
- Resource-level Checksums: Individual resources compared via spec checksums
Resource-level Sync¶
For individual resources, spec checksums detect when updates are needed:
type RequestUpdate struct {
Name string
Namespace string
UID string
Kind string
Checksum []byte // SHA256 of resource spec
}
Queue Management¶
Queue Architecture¶
Each agent connection has a dedicated queue pair:
type QueuePair struct {
sendQ workqueue.TypedRateLimitingInterface[*cloudevents.Event]
recvQ workqueue.TypedRateLimitingInterface[*cloudevents.Event]
}
Queue Operations¶
Send Queue (Principal → Agent)¶
- Purpose: Buffer outgoing events to agent
- Processing: FIFO order with rate limiting
- Blocking: Sender blocks if queue is full
Receive Queue (Agent → Principal)¶
- Purpose: Buffer incoming events from agent
- Processing: Parallel processing with semaphore control
- Ordering: Per-queue ordering maintained via named locks
Event Processing Pipeline¶
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Receive │───►│ Queue │───►│ Process │
│ Event │ │ Event │ │ Event │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Send │
│ Response │
└─────────────┘
Queue Lifecycle¶
- Creation: Queue pair created when agent connects
- Processing: Continuous event processing while connected
- Cleanup: Queues drained and removed on disconnect
- Persistence: Events may be held during brief disconnections
Error Handling and Reliability¶
Connection Resilience¶
Reconnection Logic¶
- Exponential Backoff: Agent implements backoff strategy for reconnections
- State Preservation: In-flight events preserved across reconnections
- Automatic Resume: Processing resumes where it left off
Error Classifications¶
- Temporary Errors: Network issues, temporary unavailability
- Permanent Errors: Authentication failures, invalid configurations
- Recoverable Errors: Resource conflicts, validation failures
Event Reliability¶
Delivery Guarantees¶
- At-least-once delivery: Events may be processed multiple times
- Ordering guarantees: Per-queue FIFO ordering maintained
- Idempotency: Event handlers designed to be idempotent
Failure Scenarios¶
Agent Disconnection¶
- Send queue preserves pending events
- Receive queue continues processing
- On reconnection, resync process handles missed events
Principal Restart¶
- All agent connections dropped
- Agents automatically reconnect
- Resync process rebuilds principal state
Network Partitions¶
- Agents operate autonomously during partition
- Resync resolves conflicts when connectivity restored
- Checksums detect and resolve data drift
Monitoring and Observability¶
Metrics¶
- Connection metrics: Active connections, connection duration
- Event metrics: Events sent/received, processing latency
- Error metrics: Failed events, reconnection attempts
- Queue metrics: Queue depth, processing rate
Logging¶
- Structured logging: JSON format with contextual fields
- Trace IDs: Event correlation across components
- Debug levels: Configurable verbosity for troubleshooting
Performance Characteristics¶
Scalability¶
- Concurrent agents: Principal can handle hundreds of simultaneous agents
- Event throughput: Thousands of events per second per agent
- Memory usage: Bounded by queue sizes and connection count
Latency¶
- Event propagation: Sub-second latency for most events
- Resync duration: Proportional to resource count and network latency
- Connection establishment: Typically < 5 seconds including auth
Tuning Parameters¶
- Queue sizes: Configurable per-queue limits
- Processing concurrency: Adjustable semaphore limits
- Connection timeouts: Configurable keepalive and timeout settings