Skip to content

Health Checks

Background health checks continuously monitor worker availability, removing unhealthy workers from the selection pool before they can cause request failures.


Overview

Proactive Monitoring

Detect worker failures before they impact requests—not after.

Automatic Isolation

Unhealthy workers are removed from the pool without manual intervention.

Self-Healing

Workers automatically rejoin the pool when they recover.

Configurable Sensitivity

Tune detection speed vs. tolerance for temporary issues.


Why Health Checks?

Without proactive health checks:

  • Reactive detection: Failures only discovered when real requests fail
  • Wasted requests: Multiple requests may fail before worker is marked unhealthy
  • Slower recovery: No way to know when a worker has recovered without trying it

With health checks:

  • Proactive detection: Unhealthy workers removed before they cause failures
  • Fast recovery: Workers rejoin the pool as soon as they're healthy
  • No wasted requests: Real requests only go to verified healthy workers

How It Works

SMG sends periodic HTTP requests to each worker's health endpoint:

Health Check Sequence Diagram

Worker States

State Meaning Traffic
Pending Freshly registered, not yet verified No requests
Ready Passing health checks Receives requests
NotReady Consecutive probe failures reached the readiness threshold No requests
Failed Consecutive failures reached the liveness threshold, or Pending ran out of probe attempts Terminal — receives no requests and is not probed further

The smg_worker_health gauge collapses these to 1 (Ready) and 0 (anything else), so existing dashboards continue to work.

State Transitions

Pending → Ready: When consecutive successful probes reach --health-success-threshold.

Pending → Failed: If the worker accumulates 10 × failure_threshold total probes without ever reaching the success threshold (prevents misconfigured URLs from lingering forever).

Ready → NotReady: When consecutive failed probes reach --health-failure-threshold.

NotReady → Ready: When consecutive successful probes reach --health-success-threshold.

NotReady → Failed: When consecutive failures reach 3 × --health-failure-threshold (the liveness threshold — analogous to a Kubernetes liveness probe, tolerating longer outages than the readiness threshold).

Failed is terminal: Successful probes do not recover a Failed worker. A failed worker is removed via --remove-unhealthy-workers or requires manual re-registration.


Configuration

smg \
  --worker-urls http://w1:8000 http://w2:8000 \
  --health-check-interval-secs 60 \
  --health-failure-threshold 3 \
  --health-success-threshold 2 \
  --health-check-timeout-secs 5 \
  --health-check-endpoint /health

Parameters

Parameter Default Description
--health-check-interval-secs 60 Interval between health checks
--health-failure-threshold 3 Consecutive failures before marking unhealthy
--health-success-threshold 2 Consecutive successes to mark healthy again
--health-check-timeout-secs 5 Timeout for each health check request
--health-check-endpoint /health Endpoint path for health checks
--disable-health-check false Disable background health checks
--remove-unhealthy-workers false Submit a removal job when a worker reaches the terminal Failed state

Fast Detection

Sensitive to failures—detect issues quickly.

smg \
  --health-check-interval-secs 10 \
  --health-failure-threshold 2 \
  --health-check-timeout-secs 3

Use when: Critical availability, rapid failure response needed

Conservative Detection

Tolerant of network blips.

smg \
  --health-check-interval-secs 120 \
  --health-failure-threshold 5 \
  --health-success-threshold 3

Use when: Flaky networks, workers with occasional slow responses

Production Balanced

Balanced detection for typical deployments.

smg \
  --health-check-interval-secs 30 \
  --health-failure-threshold 3 \
  --health-success-threshold 2 \
  --health-check-timeout-secs 5

Use when: Standard production environments

No Health Checks

Disable health checks entirely.

smg --disable-health-check

Use when: External health monitoring, testing scenarios


Worker Health Endpoint

SMG expects workers to provide a health endpoint that returns:

  • 2xx status code: Worker is healthy
  • Any other status or timeout: Worker is unhealthy

Example Health Endpoint (vLLM)

vLLM workers expose /health by default:

# vLLM automatically provides /health endpoint
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Example Health Endpoint (SGLang)

SGLang workers expose /health by default:

# SGLang automatically provides /health endpoint
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 8000

Custom Health Endpoint

If your worker uses a different health endpoint:

smg \
  --worker-urls http://worker:8000 \
  --health-check-endpoint /api/health

Interaction with Circuit Breakers

Health checks and circuit breakers work together for comprehensive fault detection:

Health Check Circuit Breaker Worker State
Passing Closed Healthy, receiving traffic
Failing Open Unhealthy, no traffic
Passing Open Recovering, limited traffic (half-open)

Key differences:

  • Health checks: Proactive background monitoring (no request impact)
  • Circuit breakers: Reactive detection based on real request failures

Both are recommended for production deployments.


Monitoring

Metrics

Metric Description
smg_worker_health_checks_total Health check results by worker type and result
smg_worker_health Current health status per worker (1=healthy, 0=unhealthy)

Useful PromQL Queries

Health Status

# Current health status per worker
smg_worker_health

# Count of unhealthy workers
count(smg_worker_health == 0)

Check Results

# Health check success rate
rate(smg_worker_health_checks_total{result="success"}[5m]) /
rate(smg_worker_health_checks_total[5m])

# Failed checks per minute
rate(smg_worker_health_checks_total{result="failure"}[1m]) * 60

Alert Thresholds

Metric Warning Critical Action
Unhealthy workers 1 worker >50% workers Investigate worker health
Health check success rate <90% <70% Check network connectivity
Check duration >timeout/2 >timeout Workers may be overloaded

Alerting Example

groups:
  - name: smg-health-checks
    rules:
      - alert: WorkerUnhealthy
        expr: smg_worker_health == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Worker {{ $labels.worker }} is unhealthy"

      - alert: MajorityUnhealthy
        expr: count(smg_worker_health == 0) > count(smg_worker_health) / 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Majority of workers are unhealthy"

Tuning Guidelines

Symptom Potential Adjustment
Workers marked unhealthy too quickly Increase --health-failure-threshold
Slow failure detection Decrease --health-check-interval-secs
Health checks timing out Increase --health-check-timeout-secs
Workers slow to rejoin Decrease --health-success-threshold
Too many health check requests Increase --health-check-interval-secs

What's Next?

Circuit Breakers

Reactive failure detection based on real request failures.

Circuit Breakers →

Retries

Automatic retry with exponential backoff for transient failures.

Retries →

Graceful Shutdown

Allow in-flight requests to complete during shutdown.

Graceful Shutdown →