High Availability¶

SMG supports high-availability cluster deployments using mesh networking for fault tolerance, scalability, and zero-downtime updates.

Overview¶

Fault Tolerance¶

Continue serving requests when individual router nodes fail. Automatic failover with zero manual intervention.

Scalability¶

Distribute load across multiple router instances. Add nodes without downtime.

State Synchronization¶

Share worker states, policy configurations, and rate limits across the cluster in real-time.

Zero Downtime Updates¶

Perform rolling updates without service interruption. Graceful shutdown with request draining.

Mesh Architecture¶

Gossip Protocol¶

SWIM-based protocol for membership and failure detection.

1-second heartbeat interval
Automatic peer discovery
Failure detection in seconds

Cluster Coordination¶

Node coordination for cluster operations.

Membership tracking
Node status management
Graceful shutdown coordination

CRDT Stores¶

Conflict-free Replicated Data Types for eventual consistency.

No coordination locks
Partition tolerant
Automatic conflict resolution

State Replication¶

Real-time synchronization of all cluster state.

Worker registry
Rate limit counters
Cache-aware routing trees

Configuration¶

Command Line Options¶

Flag	Default	Description
`--enable-mesh`	`false`	Enable mesh networking for HA deployments
`--mesh-server-name`	(auto)	Unique identifier for this node in the cluster
`--mesh-host`	`0.0.0.0`	Host address for mesh communication
`--mesh-port`	`39527`	Port for mesh gRPC communication
`--mesh-peer-urls`	(none)	Initial peer URLs for cluster bootstrap

Basic Configuration¶

Node 1 (Bootstrap)

smg --enable-mesh \
    --mesh-server-name node1 \
    --mesh-host 0.0.0.0 \
    --mesh-port 39527 \
    --host 0.0.0.0 \
    --port 8000

Node 2 (Join)

smg --enable-mesh \
    --mesh-server-name node2 \
    --mesh-port 39527 \
    --mesh-peer-urls "node1:39527" \
    --host 0.0.0.0 \
    --port 8000

Node 3 (Join)

smg --enable-mesh \
    --mesh-server-name node3 \
    --mesh-port 39527 \
    --mesh-peer-urls "node1:39527,node2:39527" \
    --host 0.0.0.0 \
    --port 8000

Environment Variables¶

export SMG_ENABLE_MESH=true
export SMG_MESH_SERVER_NAME=node1
export SMG_MESH_HOST=0.0.0.0
export SMG_MESH_PORT=39527
export SMG_MESH_PEER_URLS="node1:39527,node2:39527"

Gossip Protocol¶

State Synchronization¶

SMG uses a SWIM-based gossip protocol for cluster membership and state propagation:

Ping/Ping-Req: Each node periodically pings random peers to check health
State Sync: Healthy nodes exchange state information during pings
Failure Detection: Unreachable nodes are marked as suspected, then down
Broadcast: Status changes are broadcast to all cluster members

Node Status States¶

Status	Description
`INIT`	Node is starting up
`ALIVE`	Node is healthy and reachable
`SUSPECTED`	Node may be unreachable (failed ping)
`DOWN`	Node confirmed unreachable (failed ping-req)
`LEAVING`	Node is gracefully shutting down

Failure Detection Timing¶

Phase	Duration	Action
Ping	1s interval	Direct probe to peer
Down	After missed pings	Remove from active cluster

State Synchronization¶

Synchronized State Types¶

Worker Registry¶

All nodes share worker discovery and health status.

Worker URLs and metadata
Health check results
Circuit breaker states

Rate Limits¶

Cluster-wide rate limiting coordination.

Token bucket state
Request counters
Quota synchronization

Routing Trees¶

Cache-aware routing state shared across nodes.

Radix tree operations
Prefix match data
LRU eviction coordination

Policy State¶

Routing policy configuration and state.

Policy parameters
Load balancing weights
Session affinity mappings

CRDT Implementation¶

SMG uses several CRDT types for conflict-free synchronization:

CRDT Type	Used For	Merge Strategy
G-Counter	Request counts	Sum of all increments
PN-Counter	Token buckets	Sum of positive and negative
LWW-Register	Worker state	Last-writer-wins by timestamp
OR-Set	Worker sets	Union with tombstones

Deployment Patterns¶

Three-Node Cluster (Minimum HA)¶

Characteristics

Tolerates 1 node failure
Quorum of 2 for leader election
Recommended for most deployments

Configuration

# All nodes
smg --enable-mesh \
    --mesh-peer-urls "node1:39527,node2:39527,node3:39527" \
    --worker-urls http://worker1:8000,http://worker2:8000

Five-Node Cluster (Higher Availability)¶

Characteristics

Tolerates 2 node failures
Quorum of 3 for leader election
Suitable for critical workloads

Configuration

# All nodes
smg --enable-mesh \
    --mesh-peer-urls "node1:39527,node2:39527,node3:39527,node4:39527,node5:39527" \
    --worker-urls http://worker1:8000,http://worker2:8000

Kubernetes Deployment¶

StatefulSet Configuration¶

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: smg
spec:
  serviceName: smg-mesh
  replicas: 3
  selector:
    matchLabels:
      app: smg
  template:
    metadata:
      labels:
        app: smg
    spec:
      containers:
      - name: smg
        image: ghcr.io/lightseekorg/smg:latest
        args:
        - --enable-mesh
        - --mesh-server-name=$(POD_NAME)
        - --mesh-host=0.0.0.0
        - --mesh-port=39527
        - --mesh-peer-urls=smg-0.smg-mesh:39527,smg-1.smg-mesh:39527,smg-2.smg-mesh:39527
        - --worker-urls=$(WORKER_URLS)
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 39527
          name: mesh

Headless Service¶

apiVersion: v1
kind: Service
metadata:
  name: smg-mesh
spec:
  clusterIP: None
  selector:
    app: smg
  ports:
  - port: 39527
    name: mesh

HA Management API¶

Health Endpoints¶

Endpoint	Method	Description
`/ha/health`	GET	Node health status
`/ha/status`	GET	Cluster status information
`/ha/workers`	GET	Worker states across cluster
`/ha/policies`	GET	Policy states across cluster
`/ha/shutdown`	POST	Graceful shutdown trigger

Cluster Status Response¶

{
  "node_name": "node1",
  "node_count": 3,
  "nodes": [
    {"name": "node1", "status": "ALIVE", "address": "node1:39527"},
    {"name": "node2", "status": "ALIVE", "address": "node2:39527"},
    {"name": "node3", "status": "ALIVE", "address": "node3:39527"}
  ],
  "stores": {
    "workers": {"entry_count": 5, "last_sync": "2024-01-15T10:30:00Z"},
    "policies": {"entry_count": 2, "last_sync": "2024-01-15T10:30:00Z"}
  }
}

Monitoring¶

Mesh Metrics¶

Metric	Description
`smg_mesh_peers_total`	Number of connected peers
`smg_mesh_peer_status`	Status of each peer (1=alive, 0=down)
`smg_mesh_sync_operations_total`	State sync operations by type
`smg_mesh_sync_latency_seconds`	State sync latency histogram
`smg_mesh_leader_elections_total`	Leader election events
`smg_mesh_gossip_messages_total`	Gossip messages sent/received

Alerting Rules¶

groups:
- name: smg-mesh
  rules:
  - alert: SMGClusterDegraded
    expr: smg_mesh_peers_total < 2
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "SMG cluster has fewer than 3 nodes"

  - alert: SMGNodeDown
    expr: smg_mesh_peer_status == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "SMG mesh node {{ $labels.peer }} is down"

Best Practices¶

Odd Node Counts¶

Use 3, 5, or 7 nodes to avoid split-brain scenarios during network partitions.

Availability Zones¶

Distribute nodes across availability zones for resilience against zone failures.

Network Latency¶

Keep mesh nodes in the same region (< 10ms RTT) for optimal state sync performance.

Monitoring¶

Monitor smg_mesh_peers_total and alert when cluster size drops below threshold.

Troubleshooting¶

Common Issues¶

Symptom	Cause	Solution
Node stuck in INIT	Cannot reach peers	Check firewall rules for mesh port
Frequent leader elections	Network instability	Increase gossip timeouts
State inconsistency	Clock skew	Synchronize NTP across nodes
High sync latency	Large state	Increase sync interval

Debug Logging¶

RUST_LOG=smg::mesh=debug smg --enable-mesh ...

Verify Cluster Health¶

# Check cluster status
curl http://node1:8000/ha/status | jq

# Check individual node health
curl http://node1:8000/ha/health | jq

# Check worker states
curl http://node1:8000/ha/workers | jq

# Check policy states
curl http://node1:8000/ha/policies | jq

What's Next?¶

Graceful Shutdown¶

Allow in-flight requests to complete during shutdown.

Graceful Shutdown →

Circuit Breakers¶

Isolate failing workers to prevent cascade failures.

Circuit Breakers →

Metrics Reference¶

Complete list of mesh networking metrics.

Metrics Reference →