Skip to content

Circuit Breakers

Circuit breakers prevent cascade failures by stopping traffic to unhealthy workers. They're essential for maintaining system stability when workers fail.


Overview

Automatic Isolation

Automatically isolate unhealthy workers to prevent cascade failures across your inference fleet.

Fail Fast

Detect failing workers and stop sending traffic immediately—no wasted requests or timeout waits.

Self-Healing

Automatically test recovery and restore traffic when workers become healthy again.

Observable

Full Prometheus metrics for state transitions, failure counts, and recovery timing.


Why Circuit Breakers?

Without circuit breakers, a failing worker can cause:

  1. Wasted requests: Requests sent to failing workers timeout
  2. Increased latency: Clients wait for timeouts before retry
  3. Resource exhaustion: Connections pile up to dead workers
  4. Cascade failures: Retry storms overwhelm remaining workers

Circuit breakers fail fast—they detect failing workers and stop sending traffic immediately.


How It Works

Circuit Breaker State Machine

Each worker has its own circuit breaker with three states:

Closed State

Normal operation - requests flow through.

  • Failures increment counter
  • Success resets counter to zero
  • Opens when failures ≥ threshold

Open State

Circuit tripped - requests rejected immediately.

  • Worker isolated from pool
  • No traffic sent to worker
  • Transitions to half-open after timeout

Half-Open State

Testing recovery - limited probe requests allowed.

  • Success → close circuit
  • Any failure → reopen circuit
  • Gradual traffic restoration

State Transitions

Closed → Open

The circuit opens when:

consecutive_failures >= failure_threshold

Within the sliding window (--cb-window-duration-secs).

Open → Half-Open

After the circuit has been open for --cb-timeout-duration-secs, it transitions to half-open to test if the worker has recovered.

Half-Open → Closed

If --cb-success-threshold consecutive requests succeed, the circuit closes and normal operation resumes.

Half-Open → Open

If any request fails during half-open, the circuit immediately reopens.


Configuration

smg \
  --worker-urls http://w1:8000 http://w2:8000 \
  --cb-failure-threshold 5 \
  --cb-success-threshold 2 \
  --cb-timeout-duration-secs 30 \
  --cb-window-duration-secs 60

Parameters

Parameter Default Description
--cb-failure-threshold 10 Consecutive failures before circuit opens
--cb-success-threshold 3 Successes in half-open state to close circuit
--cb-timeout-duration-secs 60 Seconds before open circuit transitions to half-open
--cb-window-duration-secs 120 Sliding window for counting failures
--disable-circuit-breaker false Disable circuit breakers entirely

Configuration Examples

Fast Circuit Opening

Sensitive to failures—isolate workers quickly.

smg \
  --cb-failure-threshold 3 \
  --cb-timeout-duration-secs 30

Use when: Critical availability, latency-sensitive applications

Tolerant Configuration

Allow occasional failures before tripping.

smg \
  --cb-failure-threshold 20 \
  --cb-success-threshold 5 \
  --cb-timeout-duration-secs 120

Use when: Flaky workers, network instability, batch processing

Tuning Guidelines

Scenario Recommendation
Flaky workers Higher failure_threshold, shorter timeout
Critical availability Lower failure_threshold, longer timeout
Fast recovery workers Lower timeout, lower success_threshold
Slow recovery workers Higher timeout, higher success_threshold

Example Scenarios

Normal Operation

Circuit Breaker Normal Operation

Worker Fails

Circuit Breaker Worker Failure

Recovery

Circuit Breaker Recovery

Monitoring

Metrics

Metric Description
smg_circuit_breaker_state Current state per worker (0=closed, 1=open, 2=half-open)
smg_circuit_breaker_transitions_total State transitions by worker and direction
smg_circuit_breaker_consecutive_failures Current failure count per worker
smg_circuit_breaker_consecutive_successes Current success count per worker

Useful PromQL Queries

Current States

# Current circuit breaker states
smg_circuit_breaker_state

# Workers with open circuits
count(smg_circuit_breaker_state == 1)

Transitions

# State transitions rate
rate(smg_circuit_breaker_transitions_total[5m])

# Consecutive failures per worker
smg_circuit_breaker_consecutive_failures

Alert Thresholds

Metric Warning Critical Action
Open circuits 1 worker All workers Investigate worker health
Transition rate >10/min >50/min Check for flapping
Consecutive failures >5 >threshold Worker likely failing

Alerting Example

groups:
  - name: smg-circuit-breakers
    rules:
      - alert: CircuitBreakerOpen
        expr: smg_circuit_breaker_state == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker open for {{ $labels.worker_id }}"

      - alert: AllCircuitsOpen
        expr: count(smg_circuit_breaker_state == 1) == count(smg_circuit_breaker_state)
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "All worker circuit breakers are open"

Interaction with Other Features

Retries

When a circuit is open:

  • Requests are rejected immediately (no retry to that worker)
  • Other workers may be tried if available

When a circuit is half-open:

  • Only a limited number of test requests are sent
  • Failures don't count against retry budget

Health Checks

Circuit breakers and health checks work together:

Health Check Circuit Breaker Worker State
Passing Closed Healthy, receiving traffic
Failing Open Unhealthy, no traffic
Passing Open Recovering, limited traffic

Disabling Circuit Breakers

In some cases, you may want to disable circuit breakers:

smg --worker-urls http://w1:8000 --disable-circuit-breaker

Not Recommended

Disabling circuit breakers removes an important safety mechanism. Only do this if you have another layer providing similar protection.


What's Next?

Retries

Automatic retry with exponential backoff for transient failures.

Retries →

Health Checks

Proactive worker monitoring and failure detection.

Health Checks →

Rate Limiting

Protect workers from overload with token bucket rate limiting.

Rate Limiting →

Metrics Reference

Complete list of circuit breaker metrics.

Metrics Reference →