Skip to content

KV Events Cache-Aware Routing

This guide walks through wiring an SGLang worker emitting KV cache events to SMG running the cache-aware policy in event-driven mode, so the gateway routes each request to the worker whose KV cache already holds the longest prefix.

Before you begin

  • Completed the Getting Started guide
  • Read Cache-Aware Routing for the routing concepts
  • A machine that can run an SGLang worker (GPU + CUDA-capable Python environment)
  • smg-grpc-servicer[sglang] installed alongside SGLang

Why event-driven?

Cache-aware routing has three internal flavours. The one this guide configures is the most accurate of the three because it routes against the worker's actual KV cache state rather than an approximation.

Flavour Tree Input Worker connection Triggered when
Event-driven PositionalIndexer (event-built) Token IDs gRPC Worker emits KV events
Approximate token tree TokenTree (prefix observed at routing time) Token IDs gRPC Worker is gRPC but emits no events
Approximate string tree Tree (prefix observed at routing time) Raw text HTTP Worker is HTTP

Selection is automatic and per-worker: enabling events on one worker upgrades that worker's routing path; the others keep using the approximate tree.


How the pieces fit together

┌────────────┐     ┌────────────────────────┐     ┌──────────────────┐
│   client   │ ──▶ │ smg gateway            │ ──▶ │ smg-grpc-servicer│
│            │     │ ─ cache_aware policy   │     │ + sglang scheduler│
│            │     │ ─ KvEventMonitor       │ ◀── │ ZMQ PUB ─ KV evt │
└────────────┘     └────────────────────────┘     └──────────────────┘
                         gRPC                            ZMQ (in-process)
                         SubscribeKvEvents
  1. SGLang's scheduler publishes block-stored / block-removed events on a ZMQ PUB socket configured by --kv-events-config.
  2. smg-grpc-servicer (running in the same process, launched via --grpc-mode) subscribes to that ZMQ socket and re-publishes the events as a gRPC server-streaming RPC (SubscribeKvEvents).
  3. SMG's KvEventMonitor opens one gRPC subscription per worker, feeds the events into a per-model PositionalIndexer, and the cache_aware policy queries that indexer at routing time.

Step 1 — Launch the SGLang worker

Install the SGLang extra of the servicer, then launch the SGLang server with both --grpc-mode and --kv-events-config:

pip install "smg-grpc-servicer[sglang]"

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode \
  --page-size 16 \
  --kv-events-config '{"publisher":"zmq","endpoint":"tcp://*:5557","topic":"kv-events"}'

What each flag does:

Flag Why
--grpc-mode Hands the request loop off to smg-grpc-servicer's gRPC SglangScheduler service instead of SGLang's default HTTP server. Required for SMG to talk to this worker in gRPC mode.
--page-size 16 The KV cache block size, in tokens. Mirror this in SMG's worker config so the gateway can align its overlap scoring to the right page boundaries (see Block size alignment).
--kv-events-config A JSON object parsed by SGLang's KVEventsConfig.from_cli. Setting publisher: "zmq" is what actually turns on event publishing — the default publisher: "null" is a no-op.

--kv-events-config field reference

All fields and defaults match SGLang's KVEventsConfig (see python/sglang/srt/disaggregation/kv_events.py upstream):

Field Default Notes
publisher "null" Set to "zmq" to enable. Any other value disables event bridging in the servicer.
endpoint "tcp://*:5557" ZMQ PUB socket address. The publisher binds when the endpoint contains *, ::, or starts with ipc:// / inproc://; otherwise it connects.
topic "" ZMQ topic prefix. Match this on the subscriber side; SMG accepts any topic, so the value here matters only if you wire other subscribers in parallel.
replay_endpoint null Optional REQ/REP socket for replaying missed events. SMG does not currently use replay.
buffer_steps 10000 Size of the in-publisher replay buffer (events).
hwm 100000 ZMQ high-water mark. Once N events are queued and the consumer hasn't drained them, new events drop.
max_queue_size 100000 Internal queue between SGLang and the ZMQ thread.

For data-parallel deployments, the actual TCP port becomes endpoint_port + dp_rank (rank 0 keeps the configured port).


Step 2 — Launch SMG

Point SMG at the gRPC worker and select cache_aware:

smg \
  --worker-urls grpc://worker-1:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --policy cache_aware \
  --block-size 16 \
  --host 0.0.0.0 \
  --port 30000

The flags that matter for event-driven routing:

Flag Why
grpc://... worker URL Event subscription only runs over gRPC; HTTP workers are skipped silently.
--policy cache_aware The only policy that consults the PositionalIndexer.
--block-size 16 Fallback block size used until the first event arrives. After events start flowing, SMG learns the worker's true block size from the event payload and uses the learned value automatically.

--model-path is still required for tokenization at the gateway, the same as any gRPC-worker deployment (gRPC Workers).

Block size alignment

The cache-aware policy chunks an incoming request's token IDs into blocks of block_size tokens to look them up in the PositionalIndexer. If the block size does not match what SGLang actually wrote to its cache, the lookup misses every block and the policy silently falls back to load-only routing.

Order of precedence inside SMG:

  1. Event-learned block size (highest priority — discovered per-model from the event stream).
  2. Per-worker kv_block_size in the worker spec, if you load workers from a config file.
  3. --block-size CLI flag (router-wide default).

In practice: keep --page-size (SGLang) and --block-size (SMG) numerically equal, and let SMG correct itself once events arrive.

Worker config file

If you load workers from a config file rather than CLI, pin the block size per worker so event-driven routing works on the very first request:

workers:
  - url: grpc://worker-1:50051
    connection_mode: grpc
    kv_block_size: 16
  - url: grpc://worker-2:50052
    connection_mode: grpc
    kv_block_size: 16

Step 3 — Send a request

The API surface is unchanged:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello, who are you?"}
    ]
  }'

Send the same prompt twice. On the second call the request should land on the worker that already serves the first call's prefix.


Verifying event delivery

The gateway logs three events that prove the path is live.

1. Subscription started. When SMG registers a gRPC worker, KvEventMonitor::on_worker_added logs:

INFO Starting KV event subscription worker_url=grpc://worker-1:50051 model_id=meta-llama/Llama-3.1-8B-Instruct

If you do not see this line for a worker, that worker is either HTTP or the subscription task crashed before the first connect — check the worker logs.

2. Backend block size learned. Once the first event arrives, SMG records the backend's actual block size:

DEBUG Learned block_size=16 model_id=meta-llama/Llama-3.1-8B-Instruct

3. Routing decision uses the indexer. With RUST_LOG=model_gateway::policies::cache_aware=debug, a routed request prints the overlap count and the chosen worker.

If events never arrive, the policy keeps working — it falls back to the approximate TokenTree for that worker — so cache hits will still happen, just less accurately.


Tuning

Knob Where Effect
--cache-threshold SMG Minimum prefix overlap ratio before cache affinity overrides load. Default 0.5. Lower for more aggressive cache stickiness.
--balance-abs-threshold / --balance-rel-threshold SMG Imbalance triggers. When workers diverge in load past both thresholds, the policy switches to shortest-queue regardless of cache.
hwm SGLang --kv-events-config Raise if you see SGLang logs reporting dropped events under bursty load.
buffer_steps SGLang --kv-events-config Raise if SMG ever reports gap-detected reconnects on its KV event stream.

Caveats

  • gRPC only. Event-driven routing requires a gRPC worker — smg-grpc-servicer is the bridge that turns SGLang's in-process ZMQ feed into a gRPC server-streaming surface SMG can subscribe to. HTTP workers fall back to the approximate string tree automatically.
  • Per-worker block size assumed homogeneous within a model. If you mix workers serving the same model with different --page-size values, the policy uses whichever block size the most recent event reported. Keep page sizes homogeneous within a model.
  • mesh mode synchronizes the approximate trees, not events. When multiple SMG instances cluster via --enable-mesh, the event-driven indexer is local to each gateway. Each gateway independently subscribes to each worker.
  • No replay on reconnect today. SMG reconnects with exponential backoff on stream drops, but does not currently consume SGLang's replay_endpoint. A drop window may briefly degrade routing to load-only until events resume.

Reference

  • Policy implementation: model_gateway/src/policies/cache_aware.rs
  • Event subscription manager: model_gateway/src/worker/kv_event_monitor.rs
  • KV event proto: crates/grpc_client/proto/common.proto (messages KvEventBatch, KvCacheEvent, KvBlocksStored, KvBlocksRemoved)
  • Servicer bridge: grpc_servicer/smg_grpc_servicer/sglang/servicer.py (SubscribeKvEvents)
  • SGLang upstream config: python/sglang/srt/disaggregation/kv_events.py (class KVEventsConfig)