KV Events Cache-Aware Routing¶
This guide walks through wiring an SGLang worker emitting KV cache events to SMG running the cache-aware policy in event-driven mode, so the gateway routes each request to the worker whose KV cache already holds the longest prefix.
Before you begin¶
- Completed the Getting Started guide
- Read Cache-Aware Routing for the routing concepts
- A machine that can run an SGLang worker (GPU + CUDA-capable Python environment)
smg-grpc-servicer[sglang]installed alongside SGLang
Why event-driven?¶
Cache-aware routing has three internal flavours. The one this guide configures is the most accurate of the three because it routes against the worker's actual KV cache state rather than an approximation.
| Flavour | Tree | Input | Worker connection | Triggered when |
|---|---|---|---|---|
| Event-driven | PositionalIndexer (event-built) | Token IDs | gRPC | Worker emits KV events |
| Approximate token tree | TokenTree (prefix observed at routing time) | Token IDs | gRPC | Worker is gRPC but emits no events |
| Approximate string tree | Tree (prefix observed at routing time) | Raw text | HTTP | Worker is HTTP |
Selection is automatic and per-worker: enabling events on one worker upgrades that worker's routing path; the others keep using the approximate tree.
How the pieces fit together¶
┌────────────┐ ┌────────────────────────┐ ┌──────────────────┐
│ client │ ──▶ │ smg gateway │ ──▶ │ smg-grpc-servicer│
│ │ │ ─ cache_aware policy │ │ + sglang scheduler│
│ │ │ ─ KvEventMonitor │ ◀── │ ZMQ PUB ─ KV evt │
└────────────┘ └────────────────────────┘ └──────────────────┘
gRPC ZMQ (in-process)
SubscribeKvEvents
- SGLang's scheduler publishes block-stored / block-removed events on a ZMQ
PUBsocket configured by--kv-events-config. smg-grpc-servicer(running in the same process, launched via--grpc-mode) subscribes to that ZMQ socket and re-publishes the events as a gRPC server-streaming RPC (SubscribeKvEvents).- SMG's
KvEventMonitoropens one gRPC subscription per worker, feeds the events into a per-modelPositionalIndexer, and thecache_awarepolicy queries that indexer at routing time.
Step 1 — Launch the SGLang worker¶
Install the SGLang extra of the servicer, then launch the SGLang server with both --grpc-mode and --kv-events-config:
pip install "smg-grpc-servicer[sglang]"
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--grpc-mode \
--page-size 16 \
--kv-events-config '{"publisher":"zmq","endpoint":"tcp://*:5557","topic":"kv-events"}'
What each flag does:
| Flag | Why |
|---|---|
--grpc-mode | Hands the request loop off to smg-grpc-servicer's gRPC SglangScheduler service instead of SGLang's default HTTP server. Required for SMG to talk to this worker in gRPC mode. |
--page-size 16 | The KV cache block size, in tokens. Mirror this in SMG's worker config so the gateway can align its overlap scoring to the right page boundaries (see Block size alignment). |
--kv-events-config | A JSON object parsed by SGLang's KVEventsConfig.from_cli. Setting publisher: "zmq" is what actually turns on event publishing — the default publisher: "null" is a no-op. |
--kv-events-config field reference¶
All fields and defaults match SGLang's KVEventsConfig (see python/sglang/srt/disaggregation/kv_events.py upstream):
| Field | Default | Notes |
|---|---|---|
publisher | "null" | Set to "zmq" to enable. Any other value disables event bridging in the servicer. |
endpoint | "tcp://*:5557" | ZMQ PUB socket address. The publisher binds when the endpoint contains *, ::, or starts with ipc:// / inproc://; otherwise it connects. |
topic | "" | ZMQ topic prefix. Match this on the subscriber side; SMG accepts any topic, so the value here matters only if you wire other subscribers in parallel. |
replay_endpoint | null | Optional REQ/REP socket for replaying missed events. SMG does not currently use replay. |
buffer_steps | 10000 | Size of the in-publisher replay buffer (events). |
hwm | 100000 | ZMQ high-water mark. Once N events are queued and the consumer hasn't drained them, new events drop. |
max_queue_size | 100000 | Internal queue between SGLang and the ZMQ thread. |
For data-parallel deployments, the actual TCP port becomes endpoint_port + dp_rank (rank 0 keeps the configured port).
Step 2 — Launch SMG¶
Point SMG at the gRPC worker and select cache_aware:
smg \
--worker-urls grpc://worker-1:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--policy cache_aware \
--block-size 16 \
--host 0.0.0.0 \
--port 30000
The flags that matter for event-driven routing:
| Flag | Why |
|---|---|
grpc://... worker URL | Event subscription only runs over gRPC; HTTP workers are skipped silently. |
--policy cache_aware | The only policy that consults the PositionalIndexer. |
--block-size 16 | Fallback block size used until the first event arrives. After events start flowing, SMG learns the worker's true block size from the event payload and uses the learned value automatically. |
--model-path is still required for tokenization at the gateway, the same as any gRPC-worker deployment (gRPC Workers).
Block size alignment¶
The cache-aware policy chunks an incoming request's token IDs into blocks of block_size tokens to look them up in the PositionalIndexer. If the block size does not match what SGLang actually wrote to its cache, the lookup misses every block and the policy silently falls back to load-only routing.
Order of precedence inside SMG:
- Event-learned block size (highest priority — discovered per-model from the event stream).
- Per-worker
kv_block_sizein the worker spec, if you load workers from a config file. --block-sizeCLI flag (router-wide default).
In practice: keep --page-size (SGLang) and --block-size (SMG) numerically equal, and let SMG correct itself once events arrive.
Worker config file¶
If you load workers from a config file rather than CLI, pin the block size per worker so event-driven routing works on the very first request:
workers:
- url: grpc://worker-1:50051
connection_mode: grpc
kv_block_size: 16
- url: grpc://worker-2:50052
connection_mode: grpc
kv_block_size: 16
Step 3 — Send a request¶
The API surface is unchanged:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello, who are you?"}
]
}'
Send the same prompt twice. On the second call the request should land on the worker that already serves the first call's prefix.
Verifying event delivery¶
The gateway logs three events that prove the path is live.
1. Subscription started. When SMG registers a gRPC worker, KvEventMonitor::on_worker_added logs:
INFO Starting KV event subscription worker_url=grpc://worker-1:50051 model_id=meta-llama/Llama-3.1-8B-Instruct
If you do not see this line for a worker, that worker is either HTTP or the subscription task crashed before the first connect — check the worker logs.
2. Backend block size learned. Once the first event arrives, SMG records the backend's actual block size:
3. Routing decision uses the indexer. With RUST_LOG=model_gateway::policies::cache_aware=debug, a routed request prints the overlap count and the chosen worker.
If events never arrive, the policy keeps working — it falls back to the approximate TokenTree for that worker — so cache hits will still happen, just less accurately.
Tuning¶
| Knob | Where | Effect |
|---|---|---|
--cache-threshold | SMG | Minimum prefix overlap ratio before cache affinity overrides load. Default 0.5. Lower for more aggressive cache stickiness. |
--balance-abs-threshold / --balance-rel-threshold | SMG | Imbalance triggers. When workers diverge in load past both thresholds, the policy switches to shortest-queue regardless of cache. |
hwm | SGLang --kv-events-config | Raise if you see SGLang logs reporting dropped events under bursty load. |
buffer_steps | SGLang --kv-events-config | Raise if SMG ever reports gap-detected reconnects on its KV event stream. |
Caveats¶
- gRPC only. Event-driven routing requires a gRPC worker —
smg-grpc-serviceris the bridge that turns SGLang's in-process ZMQ feed into a gRPC server-streaming surface SMG can subscribe to. HTTP workers fall back to the approximate string tree automatically. - Per-worker block size assumed homogeneous within a model. If you mix workers serving the same model with different
--page-sizevalues, the policy uses whichever block size the most recent event reported. Keep page sizes homogeneous within a model. meshmode synchronizes the approximate trees, not events. When multiple SMG instances cluster via--enable-mesh, the event-driven indexer is local to each gateway. Each gateway independently subscribes to each worker.- No replay on reconnect today. SMG reconnects with exponential backoff on stream drops, but does not currently consume SGLang's
replay_endpoint. A drop window may briefly degrade routing to load-only until events resume.
Reference¶
- Policy implementation:
model_gateway/src/policies/cache_aware.rs - Event subscription manager:
model_gateway/src/worker/kv_event_monitor.rs - KV event proto:
crates/grpc_client/proto/common.proto(messagesKvEventBatch,KvCacheEvent,KvBlocksStored,KvBlocksRemoved) - Servicer bridge:
grpc_servicer/smg_grpc_servicer/sglang/servicer.py(SubscribeKvEvents) - SGLang upstream config:
python/sglang/srt/disaggregation/kv_events.py(classKVEventsConfig)