KV Events Cache-Aware Routing¶

This guide walks through wiring an SGLang worker emitting KV cache events to SMG running the cache-aware policy in event-driven mode, so the gateway routes each request to the worker whose KV cache already holds the longest prefix.

Before you begin¶

Completed the Getting Started guide
Read Cache-Aware Routing for the routing concepts
A machine that can run an SGLang worker (GPU + CUDA-capable Python environment)
smg-grpc-servicer[sglang] installed alongside SGLang

Why event-driven?¶

Cache-aware routing has three internal flavours. The one this guide configures is the most accurate of the three because it routes against the worker's actual KV cache state rather than an approximation.

Flavour	Tree	Input	Worker connection	Triggered when
Event-driven	`PositionalIndexer` (event-built)	Token IDs	gRPC	Worker emits KV events
Approximate token tree	`TokenTree` (prefix observed at routing time)	Token IDs	gRPC	Worker is gRPC but emits no events
Approximate string tree	`Tree` (prefix observed at routing time)	Raw text	HTTP	Worker is HTTP

Selection is automatic and per-worker: enabling events on one worker upgrades that worker's routing path; the others keep using the approximate tree.

How the pieces fit together¶

┌────────────┐     ┌────────────────────────┐     ┌──────────────────┐
│   client   │ ──▶ │ smg gateway            │ ──▶ │ smg-grpc-servicer│
│            │     │ ─ cache_aware policy   │     │ + sglang scheduler│
│            │     │ ─ KvEventMonitor       │ ◀── │ ZMQ PUB ─ KV evt │
└────────────┘     └────────────────────────┘     └──────────────────┘
                         gRPC                            ZMQ (in-process)
                         SubscribeKvEvents

SGLang's scheduler publishes block-stored / block-removed events on a ZMQ PUB socket configured by --kv-events-config.
smg-grpc-servicer (running in the same process, launched via --grpc-mode) subscribes to that ZMQ socket and re-publishes the events as a gRPC server-streaming RPC (SubscribeKvEvents).
SMG's KvEventMonitor opens one gRPC subscription per worker, feeds the events into a per-model PositionalIndexer, and the cache_aware policy queries that indexer at routing time.

Step 1 — Launch the SGLang worker¶

Install the SGLang extra of the servicer, then launch the SGLang server with both --grpc-mode and --kv-events-config:

pip install "smg-grpc-servicer[sglang]"

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode \
  --page-size 16 \
  --kv-events-config '{"publisher":"zmq","endpoint":"tcp://*:5557","topic":"kv-events"}'

What each flag does:

Flag	Why
`--grpc-mode`	Hands the request loop off to `smg-grpc-servicer`'s gRPC `SglangScheduler` service instead of SGLang's default HTTP server. Required for SMG to talk to this worker in gRPC mode.
`--page-size 16`	The KV cache block size, in tokens. Mirror this in SMG's worker config so the gateway can align its overlap scoring to the right page boundaries (see Block size alignment).
`--kv-events-config`	A JSON object parsed by SGLang's `KVEventsConfig.from_cli`. Setting `publisher: "zmq"` is what actually turns on event publishing — the default `publisher: "null"` is a no-op.

`--kv-events-config` field reference¶

All fields and defaults match SGLang's KVEventsConfig (see python/sglang/srt/disaggregation/kv_events.py upstream):

Field	Default	Notes
`publisher`	`"null"`	Set to `"zmq"` to enable. Any other value disables event bridging in the servicer.
`endpoint`	`"tcp://*:5557"`	ZMQ `PUB` socket address. The publisher binds when the endpoint contains `*`, `::`, or starts with `ipc://` / `inproc://`; otherwise it connects.
`topic`	`""`	ZMQ topic prefix. Match this on the subscriber side; SMG accepts any topic, so the value here matters only if you wire other subscribers in parallel.
`replay_endpoint`	`null`	Optional REQ/REP socket for replaying missed events. SMG does not currently use replay.
`buffer_steps`	`10000`	Size of the in-publisher replay buffer (events).
`hwm`	`100000`	ZMQ high-water mark. Once N events are queued and the consumer hasn't drained them, new events drop.
`max_queue_size`	`100000`	Internal queue between SGLang and the ZMQ thread.

For data-parallel deployments, the actual TCP port becomes endpoint_port + dp_rank (rank 0 keeps the configured port).

Alternative: launch a vLLM worker¶

vLLM publishes KV cache events on a ZMQ socket; enable them with --kv-events-config and run the worker in SMG gRPC mode:

pip install "smg-grpc-servicer[vllm]"

# `--grpc` runs vLLM in SMG gRPC mode (loads smg-grpc-servicer);
# --kv-events-config turns on KV-event publishing:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --kv-events-config '{"enable_kv_cache_events": true, "publisher": "zmq", "endpoint": "tcp://*:5557", "topic": "kv-events"}'

Event-driven routing needs the worker in SMG gRPC mode (--grpc) — KV events stream over the SubscribeKvEvents RPC, so an HTTP worker can't participate. See gRPC Workers for additional launch flags (host, port, TP size).

Field	Why
`enable_kv_cache_events: true`	vLLM-specific master switch. Without it no events are published even if a publisher is set.
`publisher: "zmq"`	Selects the ZMQ publisher the servicer bridges.
`endpoint` / `topic`	ZMQ `PUB` address and topic prefix. For data-parallel, the port is `endpoint_port + dp_rank`; SMG currently consumes rank 0.

SMG learns the block size from the BlockStored events themselves, so you needn't set it; pass vLLM's --block-size N (the analogue of SGLang's --page-size) only to pin a non-default value.

Everything downstream — SMG flags, block-size learning, and the verification logs — is unchanged; KvEventMonitor consumes the events the same way for any gRPC worker.

Alternative: launch a TokenSpeed worker¶

TokenSpeed's scheduler publishes KV cache events on a ZMQ socket; enable them with --kv-events-config. The TokenSpeed gRPC server is the SMG gRPC entrypoint, so there is no separate --grpc flag:

# TokenSpeed is installed from source (engine + kernel + scheduler); see
# scripts/ci_install_tokenspeed.sh. Install the bridge's extra deps:
pip install "smg-grpc-servicer[tokenspeed]"

# --kv-events-config turns on KV-event publishing in the scheduler:
python -m smg_grpc_servicer.tokenspeed \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --kv-events-config '{"enable_kv_cache_events": true, "publisher": "zmq", "endpoint": "tcp://*:5557", "topic": "kv-events"}'

Field	Why
`enable_kv_cache_events: true`	TokenSpeed master switch. Without it the scheduler records no events even if a publisher is set.
`publisher: "zmq"`	Selects the ZMQ publisher the servicer bridges. Unset defaults to `"zmq"` when events are enabled; `"null"` (or any other value) disables bridging.
`endpoint` / `topic`	ZMQ `PUB` address and topic prefix. Use a bind-style endpoint (`tcp://:PORT`) — TokenSpeed only binds* when the endpoint contains ``/`::`/`ipc://`/`inproc://`, so a concrete address like `tcp://127.0.0.1:PORT` makes it connect* instead, leaving nothing bound and the stream idle. For data-parallel the port is `endpoint_port + dp_rank`, and SMG currently consumes rank 0.

--kv-events-config is parsed by TokenSpeed's KVEventsConfig.from_cli. SMG learns the block size from the BlockStored events themselves, so you needn't set it; pass TokenSpeed's --page-size N only to pin a non-default value. Everything downstream is identical to the SGLang and vLLM paths.

Step 2 — Launch SMG¶

Point SMG at the gRPC worker and select cache_aware:

smg \
  --worker-urls grpc://worker-1:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --policy cache_aware \
  --block-size 16 \
  --host 0.0.0.0 \
  --port 30000

The flags that matter for event-driven routing:

Flag	Why
`grpc://...` worker URL	Event subscription only runs over gRPC; HTTP workers are skipped silently.
`--policy cache_aware`	The only policy that consults the `PositionalIndexer`.
`--block-size 16`	Fallback block size used until the first event arrives. After events start flowing, SMG learns the worker's true block size from the event payload and uses the learned value automatically.

--model-path is still required for tokenization at the gateway, the same as any gRPC-worker deployment (gRPC Workers).

Block size alignment¶

The cache-aware policy chunks an incoming request's token IDs into blocks of block_size tokens to look them up in the PositionalIndexer. If the block size does not match what SGLang actually wrote to its cache, the lookup misses every block and the policy silently falls back to load-only routing.

Order of precedence inside SMG:

Event-learned block size (highest priority — discovered per-model from the event stream).
Per-worker kv_block_size in the worker spec, if you load workers from a config file.
--block-size CLI flag (router-wide default).

In practice: keep --page-size (SGLang) and --block-size (SMG) numerically equal, and let SMG correct itself once events arrive.

Worker config file¶

If you load workers from a config file rather than CLI, pin the block size per worker so event-driven routing works on the very first request:

workers:
  - url: grpc://worker-1:50051
    connection_mode: grpc
    kv_block_size: 16
  - url: grpc://worker-2:50052
    connection_mode: grpc
    kv_block_size: 16

Step 3 — Send a request¶

The API surface is unchanged:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello, who are you?"}
    ]
  }'

Send the same prompt twice. On the second call the request should land on the worker that already serves the first call's prefix.

Verifying event delivery¶

The gateway logs three events that prove the path is live.

1. Subscription started. When SMG registers a gRPC worker, KvEventMonitor::on_worker_added logs:

INFO Starting KV event subscription worker_url=grpc://worker-1:50051 model_id=meta-llama/Llama-3.1-8B-Instruct

If you do not see this line for a worker, that worker is either HTTP or the subscription task crashed before the first connect — check the worker logs.

2. Backend block size learned. Once the first event arrives, SMG records the backend's actual block size:

DEBUG Learned block_size=16 model_id=meta-llama/Llama-3.1-8B-Instruct

3. Routing decision uses the indexer. With RUST_LOG=model_gateway::policies::cache_aware=debug, a routed request prints the overlap count and the chosen worker.

If events never arrive, the policy keeps working — it falls back to the approximate TokenTree for that worker — so cache hits will still happen, just less accurately.

Tuning¶

Knob	Where	Effect
`--cache-threshold`	SMG	Minimum prefix overlap ratio before cache affinity overrides load. Default 0.5. Lower for more aggressive cache stickiness.
`--balance-abs-threshold` / `--balance-rel-threshold`	SMG	Imbalance triggers. When workers diverge in load past both thresholds, the policy switches to shortest-queue regardless of cache.
`hwm`	SGLang `--kv-events-config`	Raise if you see SGLang logs reporting dropped events under bursty load.
`buffer_steps`	SGLang `--kv-events-config`	Raise if SMG ever reports gap-detected reconnects on its KV event stream.

Caveats¶

gRPC only. Event-driven routing requires a gRPC worker — smg-grpc-servicer is the bridge that turns SGLang's in-process ZMQ feed into a gRPC server-streaming surface SMG can subscribe to. HTTP workers fall back to the approximate string tree automatically.
Per-worker block size assumed homogeneous within a model. If you mix workers serving the same model with different --page-size values, the policy uses whichever block size the most recent event reported. Keep page sizes homogeneous within a model.
mesh mode synchronizes the approximate trees, not events. When multiple SMG instances cluster via --enable-mesh, the event-driven indexer is local to each gateway. Each gateway independently subscribes to each worker.
No replay on reconnect today. SMG reconnects with exponential backoff on stream drops, but does not currently consume SGLang's replay_endpoint. A drop window may briefly degrade routing to load-only until events resume.

Reference¶

Policy implementation: model_gateway/src/policies/cache_aware.rs
Event subscription manager: model_gateway/src/worker/kv_event_monitor.rs
KV event proto: crates/grpc_client/proto/common.proto (messages KvEventBatch, KvCacheEvent, KvBlocksStored, KvBlocksRemoved)
Servicer bridge: grpc_servicer/smg_grpc_servicer/sglang/servicer.py (SubscribeKvEvents)
Shared ZMQ→proto conversion: grpc_servicer/smg_grpc_servicer/kv_events.py (engine-neutral; used by the vLLM and TokenSpeed bridges)
TokenSpeed servicer bridge: grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (SubscribeKvEvents) + config resolver grpc_servicer/smg_grpc_servicer/tokenspeed/kv_events.py
SGLang upstream config: python/sglang/srt/disaggregation/kv_events.py (class KVEventsConfig)
TokenSpeed upstream config: tokenspeed/runtime/pd/kv_events.py (class KVEventsConfig)