Load Balancing¶
SMG provides multiple load balancing policies to distribute requests across workers. Set the policy with --policy:
Before you begin¶
- Completed the Getting Started guide
- Two or more workers running
Policy Comparison¶
| Policy | Load Aware | Cache Affinity | Session Affinity | Best For |
|---|---|---|---|---|
cache_aware | Yes | Yes | — | Production LLM |
bucket | Yes | — | — | PD disaggregation |
power_of_two | Yes | — | — | General load balancing |
consistent_hashing | — | — | Yes | Session affinity |
prefix_hash | Yes | Partial | — | Lightweight caching |
manual | — | — | Yes | Stateful chat |
round_robin | — | — | — | Even distribution |
random | — | — | — | Testing |
Cache-Aware (Recommended)¶
The production default. Maintains a radix tree mirroring backend KV cache state for optimal prefix routing with load balancing fallback. Maximizes KV cache hits (60-90% hit rate), reduces TTFT by 70-75%.
smg \
--policy cache_aware \
--worker-urls http://w1:8000 http://w2:8000 \
--cache-threshold 0.3 \
--balance-abs-threshold 64 \
--balance-rel-threshold 1.5
| Parameter | Default | Description |
|---|---|---|
--cache-threshold | 0.3 | Minimum prefix match ratio (0.0–1.0) to route to highest-match worker. Below this, routes to worker with most available cache space |
--balance-abs-threshold | 64 | Absolute load difference threshold — triggers load balancing when exceeded |
--balance-rel-threshold | 1.5 | Relative load ratio threshold — triggers load balancing when max_load > min_load × ratio |
--eviction-interval | 120 | Seconds between LRU eviction cycles for the radix trees |
--max-tree-size | 67108864 | Maximum nodes per radix tree. Excess nodes are evicted during maintenance cycles |
Best for multi-turn conversations, RAG applications, and batch processing with shared templates.
Power of Two Choices¶
Samples two random workers and routes to the one with lower load. Good load distribution with minimal overhead.
Best for heterogeneous workers with varying response times.
Consistent Hashing¶
Header-based routing with minimal redistribution on scaling. Routes based on X-SMG-Routing-Key header or implicit keys (Authorization, X-Forwarded-For, Cookie).
Routing Headers¶
| Header | Description |
|---|---|
X-SMG-Target-Worker | Direct routing by worker index (0-based) |
X-SMG-Routing-Key | Consistent hash routing for session affinity |
Priority: X-SMG-Target-Worker > X-SMG-Routing-Key > Implicit keys > Random fallback
Best for session affinity and user-to-worker pinning.
Prefix Hash¶
A lightweight alternative to full cache-aware routing. Routes based on a hash of the first N tokens using consistent hashing with bounded load balancing.
smg \
--policy prefix_hash \
--worker-urls http://w1:8000 http://w2:8000 \
--prefix-token-count 256 \
--prefix-hash-load-factor 1.25
| Parameter | Default | Description |
|---|---|---|
--prefix-token-count | 256 | Number of prefix tokens to hash. Longer = more precise routing, shorter = more requests grouped together |
--prefix-hash-load-factor | 1.25 | Load threshold ratio — if a worker's load exceeds avg_load × factor, walk the hash ring to find a less loaded worker |
Lower memory than cache_aware with predictable O(log n) performance.
Bucket¶
Routes requests based on text length with adaptive boundaries. Periodically adjusts boundaries based on observed load distribution.
smg \
--policy bucket \
--worker-urls http://w1:8000 http://w2:8000 http://w3:8000 \
--balance-abs-threshold 64 \
--balance-rel-threshold 1.5
| Parameter | Default | Description |
|---|---|---|
--balance-abs-threshold | 64 | Absolute load difference threshold for load balancing |
--balance-rel-threshold | 1.5 | Relative load ratio threshold for balancing decisions |
Best for PD disaggregation where prefill workers handle different request sizes.
Manual¶
Sticky session routing with explicit routing key mapping. Sessions stay with their assigned worker even when new workers are added. Requires X-SMG-Routing-Key header.
smg \
--policy manual \
--worker-urls http://w1:8000 http://w2:8000 \
--assignment-mode min_load \
--max-idle-secs 14400 \
--eviction-interval 120
| Parameter | Default | Description |
|---|---|---|
--assignment-mode | random | Strategy for assigning new routing keys: random, min_load (fewest active requests), or min_group (fewest routing keys) |
--max-idle-secs | 14400 | Maximum idle time (seconds) before a routing entry is evicted. Default is 4 hours |
--eviction-interval | 120 | Seconds between TTL eviction cycles |
Best for stateful chat where context is stored on workers.
Round Robin¶
Rotates through workers sequentially. Skips unhealthy workers automatically.
Random¶
Each healthy worker has equal probability of selection. Zero state overhead.
Choosing a Policy¶
| Requirement | Recommended Policy |
|---|---|
| Production LLM inference | cache_aware |
| Session affinity (sticky sessions) | manual or consistent_hashing |
| PD disaggregation | bucket |
| Load balancing without cache | power_of_two |
| Lightweight cache locality | prefix_hash |
| Even distribution | round_robin |
| Testing/development | random |
Next Steps¶
- Load Balancing Concepts — Detailed policy architecture, advantages/limitations, scenario guides
- Cache-Aware Routing Concepts — Radix tree architecture and routing algorithm deep dive
- Tokenizer Caching — Reduce tokenization overhead with two-level caching