Skip to content

Load Balancing

SMG provides multiple load balancing policies to distribute requests across workers. Choosing the right policy depends on your workload characteristics.


Overview

Cache-Aware

Production default. Maintains radix tree mirroring backend KV cache for optimal prefix routing with load balancing fallback.

Bucket

Request-length-based routing with adaptive boundaries. Designed for PD disaggregation workloads.

Power of Two

Load-aware selection without global state. Samples two workers, routes to the lighter one.

Consistent Hashing

Header-based routing with minimal redistribution on scaling. Ideal for session affinity.


Policy Comparison

Policy Load Aware Cache Affinity Session Affinity Complexity Best For
cache_aware O(prefix) Production LLM
bucket O(n) PD disaggregation
power_of_two O(1) Load balancing
consistent_hashing O(log n) Session affinity
prefix_hash Partial O(log n) Lightweight caching
manual O(1) Stateful chat
round_robin O(1) Even distribution
random O(1) Testing

Cache-Aware

The recommended policy for production LLM inference. Maintains a multi-tenant radix tree that mirrors backend KV cache state, enabling perfect cache prediction with integrated load balancing.

smg --policy cache_aware --worker-urls http://w1:8000 http://w2:8000

Advantages

  • Maximizes KV cache hits (60-90% hit rate)
  • Reduces TTFT by 70-75%
  • Integrated load balancing fallback
  • 100% accurate prefix matching

Limitations

  • Higher memory usage (radix tree per worker)
  • O(prefix) selection time
  • Requires tokenization

Use when: Production workloads with repeated prefixes—multi-turn conversations, RAG applications, batch processing with templates.

Learn more about Cache-Aware Routing →


Bucket

Routes requests based on request text length using adaptive boundaries. Periodically adjusts boundaries based on observed load distribution.

smg --policy bucket --worker-urls http://w1:8000 http://w2:8000 http://w3:8000

Advantages

  • Request-length awareness
  • Adaptive boundary adjustment
  • Falls back to load balancing when imbalanced

Limitations

  • O(n) complexity
  • No cache locality
  • Requires understanding of length distribution

Use when: PD disaggregation where prefill workers handle different request sizes, or workloads with bimodal request length distribution.


Power of Two Choices

Samples two random workers and selects the one with lower load. Provides good load distribution with minimal coordination overhead—a proven algorithm from distributed systems research.

smg --policy power_of_two --worker-urls http://w1:8000 http://w2:8000

Advantages

  • Load-aware without global state
  • O(1) selection time
  • Exponentially better than random

Limitations

  • No cache locality
  • Requires load metrics from workers
  • May not find optimal worker

Use when: Heterogeneous workers with varying response times, or when cache locality doesn't matter.


Consistent Hashing

Provides header-based consistent routing using a hash ring. Minimizes redistribution when workers scale—only ~1/N keys move when adding/removing workers.

smg --policy consistent_hashing --worker-urls http://w1:8000 http://w2:8000

Advantages

  • Minimal redistribution on scaling
  • Automatic failover to next healthy worker
  • O(log n) lookup time

Limitations

  • No load awareness
  • No cache locality
  • Requires routing key header

Routing Headers

Header Description
X-SMG-Target-Worker Direct routing by worker index (0-based)
X-SMG-Routing-Key Consistent hash routing for session affinity

Priority order: X-SMG-Target-WorkerX-SMG-Routing-Key → Implicit keys (Authorization, X-Forwarded-For, Cookie) → Random fallback

Use when: Session affinity needed, user-to-worker pinning, or consistent routing for stateful applications.


Prefix Hash

A lightweight alternative to full cache-aware routing. Routes requests based on a hash of the first N tokens, using consistent hashing with load factor override.

smg --policy prefix_hash --prefix-token-count 256 --worker-urls http://w1:8000 http://w2:8000

Advantages

  • Predictable O(log n) performance
  • Lower memory than cache_aware
  • Groups similar prefixes together

Limitations

  • Prefix grouping, not exact matching
  • Less precise than cache_aware
  • Load factor can cause redistribution

Comparison with Cache-Aware

Aspect prefix_hash cache_aware
Lookup O(log n) O(prefix_len)
Memory O(workers × virtual_nodes) O(total_tokens)
Precision Prefix grouping Exact matching

Use when: Need some cache locality with predictable performance and lower memory footprint.


Manual

Provides sticky session routing with explicit routing key mapping. Unlike consistent hashing, sessions stay with their assigned worker even when new workers are added.

smg --policy manual --assignment-mode min_load --worker-urls http://w1:8000 http://w2:8000

Advantages

  • Strong session stickiness
  • Automatic failover with recovery
  • TTL-based eviction prevents memory growth

Limitations

  • No load balancing for existing sessions
  • Requires X-SMG-Routing-Key header
  • Memory grows with active sessions

Assignment Modes

Mode Description
random Randomly select from healthy workers
min_load Select worker with fewest active requests
min_group Select worker with fewest routing keys assigned

Use when: Stateful chat sessions where context is stored on workers, or when session continuity is critical.


Round Robin

Rotates through workers sequentially, guaranteeing even distribution over time. Skips unhealthy workers automatically.

smg --policy round_robin --worker-urls http://w1:8000 http://w2:8000

Advantages

  • Guaranteed even distribution
  • Predictable routing pattern
  • Minimal state (counter only)

Limitations

  • No load awareness
  • No cache locality
  • Ignores request characteristics

Use when: All workers have equal capacity and you want predictable, even distribution.


Random

The simplest policy—each healthy worker has equal probability of selection. Zero state overhead.

smg --policy random --worker-urls http://w1:8000 http://w2:8000

Advantages

  • Zero state overhead
  • O(1) selection time
  • Completely stateless

Limitations

  • No load awareness
  • No cache locality
  • Can create hot spots

Use when: Testing environments or completely homogeneous workloads where simplicity is preferred.


Choosing a Policy

Decision Guide

Requirement Recommended Policy
Production LLM inference cache_aware
Session affinity (sticky sessions) manual or consistent_hashing
PD disaggregation bucket
Load balancing without cache power_of_two
Lightweight cache locality prefix_hash
Even distribution round_robin
Testing/development random

Scenario Guide

Conversational AI

Recommended: cache_aware

Maximizes KV cache reuse for multi-turn conversations with shared system prompts.

RAG Applications

Recommended: cache_aware

Exploits common document prefixes for faster Time to First Token.

Multi-Tenant Platform

Recommended: consistent_hashing or manual

User-to-worker affinity for tenant isolation or stateful sessions.

PD Disaggregation

Recommended: bucket (prefill) + power_of_two (decode)

Length-based routing for prefill, load-based for decode workers.


What's Next?

Cache-Aware Routing

Deep dive into the radix tree architecture and routing algorithm.

Cache-Aware Routing →

Circuit Breakers

How SMG handles worker failures gracefully.

Circuit Breakers →