Health Checks¶

Background health checks continuously monitor worker availability, removing unhealthy workers from the selection pool before they can cause request failures.

Overview¶

Proactive Monitoring¶

Detect worker failures before they impact requests—not after.

Automatic Isolation¶

Unhealthy workers are removed from the pool without manual intervention.

Self-Healing¶

Workers automatically rejoin the pool when they recover.

Configurable Sensitivity¶

Tune detection speed vs. tolerance for temporary issues.

Why Health Checks?¶

Without proactive health checks:

Reactive detection: Failures only discovered when real requests fail
Wasted requests: Multiple requests may fail before worker is marked unhealthy
Slower recovery: No way to know when a worker has recovered without trying it

With health checks:

Proactive detection: Unhealthy workers removed before they cause failures
Fast recovery: Workers rejoin the pool as soon as they're healthy
No wasted requests: Real requests only go to verified healthy workers

How It Works¶

SMG sends periodic HTTP requests to each worker's health endpoint:

Health Check Sequence Diagram

Worker States¶

State	Meaning	Traffic
Pending	Freshly registered, not yet verified	No requests
Ready	Passing health checks	Receives requests
NotReady	Consecutive probe failures reached the readiness threshold	No requests
Failed	Consecutive failures reached the liveness threshold, or `Pending` ran out of probe attempts	Terminal — receives no requests and is not probed further

The smg_worker_health gauge collapses these to 1 (Ready) and 0 (anything else), so existing dashboards continue to work.

State Transitions¶

Pending → Ready: When consecutive successful probes reach --health-success-threshold.

Pending → Failed: If the worker accumulates 10 × failure_threshold total probes without ever reaching the success threshold (prevents misconfigured URLs from lingering forever).

Ready → NotReady: When consecutive failed probes reach --health-failure-threshold.

NotReady → Ready: When consecutive successful probes reach --health-success-threshold.

NotReady → Failed: When consecutive failures reach 3 × --health-failure-threshold (the liveness threshold — analogous to a Kubernetes liveness probe, tolerating longer outages than the readiness threshold).

Failed is terminal: Successful probes do not recover a Failed worker. A failed worker is removed via --remove-unhealthy-workers or requires manual re-registration.

Configuration¶

smg \
  --worker-urls http://w1:8000 http://w2:8000 \
  --health-check-interval-secs 60 \
  --health-failure-threshold 3 \
  --health-success-threshold 2 \
  --health-check-timeout-secs 5 \
  --health-check-endpoint /health

Parameters¶

Parameter	Default	Description
`--health-check-interval-secs`	`60`	Interval between health checks
`--health-failure-threshold`	`3`	Consecutive failures before marking unhealthy
`--health-success-threshold`	`2`	Consecutive successes to mark healthy again
`--health-check-timeout-secs`	`5`	Timeout for each health check request
`--health-check-endpoint`	`/health`	Endpoint path for health checks
`--disable-health-check`	`false`	Disable background health checks
`--remove-unhealthy-workers`	`false`	Submit a removal job when a worker reaches the terminal `Failed` state

Recommended Configurations¶

Fast Detection¶

Sensitive to failures—detect issues quickly.

smg \
  --health-check-interval-secs 10 \
  --health-failure-threshold 2 \
  --health-check-timeout-secs 3

Use when: Critical availability, rapid failure response needed

Conservative Detection¶

Tolerant of network blips.

smg \
  --health-check-interval-secs 120 \
  --health-failure-threshold 5 \
  --health-success-threshold 3

Use when: Flaky networks, workers with occasional slow responses

Production Balanced¶

Balanced detection for typical deployments.

smg \
  --health-check-interval-secs 30 \
  --health-failure-threshold 3 \
  --health-success-threshold 2 \
  --health-check-timeout-secs 5

Use when: Standard production environments

No Health Checks¶

Disable health checks entirely.

smg --disable-health-check

Use when: External health monitoring, testing scenarios

Worker Health Endpoint¶

SMG expects workers to provide a health endpoint that returns:

2xx status code: Worker is healthy
Any other status or timeout: Worker is unhealthy

Example Health Endpoint (vLLM)¶

vLLM workers expose /health by default:

# vLLM automatically provides /health endpoint
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Example Health Endpoint (SGLang)¶

SGLang workers expose /health by default:

# SGLang automatically provides /health endpoint
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 8000

Custom Health Endpoint¶

If your worker uses a different health endpoint:

smg \
  --worker-urls http://worker:8000 \
  --health-check-endpoint /api/health

Interaction with Circuit Breakers¶

Health checks and circuit breakers work together for comprehensive fault detection:

Health Check	Circuit Breaker	Worker State
Passing	Closed	Healthy, receiving traffic
Failing	Open	Unhealthy, no traffic
Passing	Open	Recovering, limited traffic (half-open)

Key differences:

Health checks: Proactive background monitoring (no request impact)
Circuit breakers: Reactive detection based on real request failures

Both are recommended for production deployments.

Monitoring¶

Metrics¶

Metric	Description
`smg_worker_health_checks_total`	Health check results by worker type and result
`smg_worker_health`	Current health status per worker (1=healthy, 0=unhealthy)

Useful PromQL Queries¶

Health Status¶

# Current health status per worker
smg_worker_health

# Count of unhealthy workers
count(smg_worker_health == 0)

Check Results¶

# Health check success rate
rate(smg_worker_health_checks_total{result="success"}[5m]) /
rate(smg_worker_health_checks_total[5m])

# Failed checks per minute
rate(smg_worker_health_checks_total{result="failure"}[1m]) * 60

Alert Thresholds¶

Metric	Warning	Critical	Action
Unhealthy workers	1 worker	>50% workers	Investigate worker health
Health check success rate	<90%	<70%	Check network connectivity
Check duration	>timeout/2	>timeout	Workers may be overloaded

Alerting Example¶

groups:
  - name: smg-health-checks
    rules:
      - alert: WorkerUnhealthy
        expr: smg_worker_health == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Worker {{ $labels.worker }} is unhealthy"

      - alert: MajorityUnhealthy
        expr: count(smg_worker_health == 0) > count(smg_worker_health) / 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Majority of workers are unhealthy"

Tuning Guidelines¶

Symptom	Potential Adjustment
Workers marked unhealthy too quickly	Increase `--health-failure-threshold`
Slow failure detection	Decrease `--health-check-interval-secs`
Health checks timing out	Increase `--health-check-timeout-secs`
Workers slow to rejoin	Decrease `--health-success-threshold`
Too many health check requests	Increase `--health-check-interval-secs`

What's Next?¶

Circuit Breakers¶

Reactive failure detection based on real request failures.

Circuit Breakers →

Retries¶

Automatic retry with exponential backoff for transient failures.

Retries →

Graceful Shutdown¶

Allow in-flight requests to complete during shutdown.

Graceful Shutdown →