PD Disaggregation¶

Prefill-Decode (PD) disaggregation separates the two phases of LLM inference — prompt processing (prefill) and token generation (decode) — onto specialized workers. This optimizes Time to First Token (TTFT) and throughput independently.

Before you begin¶

Completed the Getting Started guide
At least one prefill worker and one decode worker
For vLLM PD: workers started with gRPC entrypoint and KV transfer backend

Why Disaggregate?¶

Phase	Compute Pattern	Bottleneck
Prefill	Compute-bound, parallel	GPU compute
Decode	Memory-bound, sequential	Memory bandwidth

Running both on the same worker creates contention — prefill batches wait for decode slots, and decode batches stay small due to memory pressure. Dedicating workers to each phase removes this conflict.

SGLang PD¶

SMG sends the request to both prefill and decode workers simultaneously, and they coordinate KV cache transfer through a bootstrap mechanism.

Start SGLang Workers¶

# Prefill worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --port 8000 \
  --prefill-only

# Decode worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --port 8001 \
  --decode-only

Start SMG¶

Each prefill worker needs a bootstrap port for coordination:

smg \
  --pd-disaggregation \
  --prefill http://prefill:8000 9001 \
  --decode http://decode:8001 \
  --host 0.0.0.0 \
  --port 30000

Multiple Workers¶

smg \
  --pd-disaggregation \
  --prefill http://prefill1:8000 9001 \
  --prefill http://prefill2:8000 9002 \
  --decode http://decode1:8001 \
  --decode http://decode2:8001 \
  --prefill-policy cache_aware \
  --decode-policy power_of_two

vLLM PD¶

SMG sends to prefill first with max_tokens=1, then sends the original request to decode. The KV backend (NIXL or Mooncake) transfers the cache transparently.

Start vLLM Workers with NIXL¶

# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'

Start SMG¶

vLLM workers use grpc:// URLs and require --model-path for tokenizer loading:

smg \
  --pd-disaggregation \
  --prefill grpc://prefill:50051 \
  --decode grpc://decode:50052 \
  --model-path /path/to/model \
  --host 0.0.0.0 \
  --port 30000

Alternative: Mooncake Backend¶

Mooncake supports TCP transport (no RDMA required). Each prefill worker needs a unique bootstrap port:

# Prefill worker
VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

# Decode worker
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

smg \
  --pd-disaggregation \
  --prefill grpc://prefill:50051 8998 \
  --decode grpc://decode:50052 \
  --model-path /path/to/model

Helper Script¶

Use the provided script to launch workers with either backend:

# NIXL (default)
./scripts/launch-pd-workers.sh vllm /path/to/model

# Mooncake
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model

Verify¶

# Check workers and their roles
curl http://localhost:30000/workers | jq

# Send a request
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

SGLang vs vLLM PD at a Glance¶

	SGLang PD	vLLM PD
Protocol	HTTP	gRPC
Dispatch	Both workers receive request simultaneously	Prefill first, then decode
KV Transfer	Bootstrap-based coordination	NIXL (RDMA) or Mooncake (TCP/RDMA)
SMG flags	`--prefill http://... <bootstrap_port>`	`--prefill grpc://...` + `--model-path`

Next Steps¶

For sizing guidelines, per-phase routing policies, Kubernetes service discovery, and monitoring, see the full PD Disaggregation Concepts page.