Health Checks¶

Background health checks continuously monitor worker availability, removing unhealthy workers from the selection pool before they can cause request failures.

Overview¶

Proactive Monitoring¶

Detect worker failures before they impact requests—not after.

Automatic Isolation¶

Unhealthy workers are removed from the pool without manual intervention.

Self-Healing¶

Workers automatically rejoin the pool when they recover.

Configurable Sensitivity¶

Tune detection speed vs. tolerance for temporary issues.

Why Health Checks?¶

Without proactive health checks:

Reactive detection: Failures only discovered when real requests fail
Wasted requests: Multiple requests may fail before worker is marked unhealthy
Slower recovery: No way to know when a worker has recovered without trying it

With health checks:

Proactive detection: Unhealthy workers removed before they cause failures
Fast recovery: Workers rejoin the pool as soon as they're healthy
No wasted requests: Real requests only go to verified healthy workers

How It Works¶

SMG sends periodic HTTP requests to each worker's health endpoint:

Health States¶

State	Meaning	Traffic
Healthy	Passing health checks	Receives requests
Unhealthy	Consecutive failures ≥ threshold	No requests

State Transitions¶

Healthy → Unhealthy: When consecutive failed health checks reach --health-failure-threshold

Unhealthy → Healthy: When consecutive successful health checks reach --health-success-threshold

Configuration¶

smg \
  --worker-urls http://w1:8000 http://w2:8000 \
  --health-check-interval-secs 60 \
  --health-failure-threshold 3 \
  --health-success-threshold 2 \
  --health-check-timeout-secs 5 \
  --health-check-endpoint /health

Parameters¶

Parameter	Default	Description
`--health-check-interval-secs`	`60`	Interval between health checks
`--health-failure-threshold`	`3`	Consecutive failures before marking unhealthy
`--health-success-threshold`	`2`	Consecutive successes to mark healthy again
`--health-check-timeout-secs`	`5`	Timeout for each health check request
`--health-check-endpoint`	`/health`	Endpoint path for health checks
`--disable-health-check`	`false`	Disable background health checks

Recommended Configurations¶

Fast Detection¶

Sensitive to failures—detect issues quickly.

smg \
  --health-check-interval-secs 10 \
  --health-failure-threshold 2 \
  --health-check-timeout-secs 3

Use when: Critical availability, rapid failure response needed

Conservative Detection¶

Tolerant of network blips.

smg \
  --health-check-interval-secs 120 \
  --health-failure-threshold 5 \
  --health-success-threshold 3

Use when: Flaky networks, workers with occasional slow responses

Production Balanced¶

Balanced detection for typical deployments.

smg \
  --health-check-interval-secs 30 \
  --health-failure-threshold 3 \
  --health-success-threshold 2 \
  --health-check-timeout-secs 5

Use when: Standard production environments

No Health Checks¶

Disable health checks entirely.

smg --disable-health-check

Use when: External health monitoring, testing scenarios

Worker Health Endpoint¶

SMG expects workers to provide a health endpoint that returns:

2xx status code: Worker is healthy
Any other status or timeout: Worker is unhealthy

Example Health Endpoint (vLLM)¶

vLLM workers expose /health by default:

# vLLM automatically provides /health endpoint
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Example Health Endpoint (SGLang)¶

SGLang workers expose /health by default:

# SGLang automatically provides /health endpoint
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 8000

Custom Health Endpoint¶

If your worker uses a different health endpoint:

smg \
  --worker-urls http://worker:8000 \
  --health-check-endpoint /api/health

Interaction with Circuit Breakers¶

Health checks and circuit breakers work together for comprehensive fault detection:

Health Check	Circuit Breaker	Worker State
Passing	Closed	Healthy, receiving traffic
Failing	Open	Unhealthy, no traffic
Passing	Open	Recovering, limited traffic (half-open)

Key differences:

Health checks: Proactive background monitoring (no request impact)
Circuit breakers: Reactive detection based on real request failures

Both are recommended for production deployments.

Monitoring¶

Metrics¶

Metric	Description
`smg_health_check_total`	Health check results by worker and status
`smg_worker_health_status`	Current health status per worker (0=unhealthy, 1=healthy)

Useful PromQL Queries¶

Health Status¶

# Current health status per worker
smg_worker_health_status

# Count of unhealthy workers
count(smg_worker_health_status == 0)

Check Results¶

# Health check success rate
rate(smg_health_check_total{status="success"}[5m]) /
rate(smg_health_check_total[5m])

# Failed checks per minute
rate(smg_health_check_total{status="failure"}[1m]) * 60

Alert Thresholds¶

Metric	Warning	Critical	Action
Unhealthy workers	1 worker	>50% workers	Investigate worker health
Health check success rate	<90%	<70%	Check network connectivity
Check duration	>timeout/2	>timeout	Workers may be overloaded

Alerting Example¶

groups:
  - name: smg-health-checks
    rules:
      - alert: WorkerUnhealthy
        expr: smg_worker_health_status == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Worker {{ $labels.worker_id }} is unhealthy"

      - alert: MajorityUnhealthy
        expr: count(smg_worker_health_status == 0) > count(smg_worker_health_status) / 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Majority of workers are unhealthy"

Tuning Guidelines¶

Symptom	Potential Adjustment
Workers marked unhealthy too quickly	Increase `--health-failure-threshold`
Slow failure detection	Decrease `--health-check-interval-secs`
Health checks timing out	Increase `--health-check-timeout-secs`
Workers slow to rejoin	Decrease `--health-success-threshold`
Too many health check requests	Increase `--health-check-interval-secs`

What's Next?¶

Circuit Breakers¶

Reactive failure detection based on real request failures.

Circuit Breakers →

Retries¶

Automatic retry with exponential backoff for transient failures.

Retries →

Graceful Shutdown¶

Allow in-flight requests to complete during shutdown.

Graceful Shutdown →