Skip to content

PD Disaggregation

Prefill-Decode (PD) disaggregation separates the two phases of LLM inference — prompt processing (prefill) and token generation (decode) — onto specialized workers. This optimizes Time to First Token (TTFT) and throughput independently.

Before you begin

  • Completed the Getting Started guide
  • At least one prefill worker and one decode worker
  • For vLLM PD: workers started with gRPC entrypoint and KV transfer backend

Why Disaggregate?

Phase Compute Pattern Bottleneck
Prefill Compute-bound, parallel GPU compute
Decode Memory-bound, sequential Memory bandwidth

Running both on the same worker creates contention — prefill batches wait for decode slots, and decode batches stay small due to memory pressure. Dedicating workers to each phase removes this conflict.


SGLang PD

SMG sends the request to both prefill and decode workers simultaneously, and they coordinate KV cache transfer through a bootstrap mechanism.

Start SGLang Workers

# Prefill worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --port 8000 \
  --prefill-only

# Decode worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --port 8001 \
  --decode-only

Start SMG

Each prefill worker needs a bootstrap port for coordination:

smg \
  --pd-disaggregation \
  --prefill http://prefill:8000 9001 \
  --decode http://decode:8001 \
  --host 0.0.0.0 \
  --port 30000

Multiple Workers

smg \
  --pd-disaggregation \
  --prefill http://prefill1:8000 9001 \
  --prefill http://prefill2:8000 9002 \
  --decode http://decode1:8001 \
  --decode http://decode2:8001 \
  --prefill-policy cache_aware \
  --decode-policy power_of_two

vLLM PD

SMG sends to prefill first with max_tokens=1, then sends the original request to decode. The KV backend (NIXL or Mooncake) transfers the cache transparently.

Start vLLM Workers with NIXL

# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'

Start SMG

vLLM workers use grpc:// URLs and require --model-path for tokenizer loading:

smg \
  --pd-disaggregation \
  --prefill grpc://prefill:50051 \
  --decode grpc://decode:50052 \
  --model-path /path/to/model \
  --host 0.0.0.0 \
  --port 30000

Alternative: Mooncake Backend

Mooncake supports TCP transport (no RDMA required). Each prefill worker needs a unique bootstrap port:

# Prefill worker
VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

# Decode worker
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
smg \
  --pd-disaggregation \
  --prefill grpc://prefill:50051 8998 \
  --decode grpc://decode:50052 \
  --model-path /path/to/model

Helper Script

Use the provided script to launch workers with either backend:

# NIXL (default)
./scripts/launch-pd-workers.sh vllm /path/to/model

# Mooncake
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model

Verify

# Check workers and their roles
curl http://localhost:30000/workers | jq

# Send a request
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

SGLang vs vLLM PD at a Glance

SGLang PD vLLM PD
Protocol HTTP gRPC
Dispatch Both workers receive request simultaneously Prefill first, then decode
KV Transfer Bootstrap-based coordination NIXL (RDMA) or Mooncake (TCP/RDMA)
SMG flags --prefill http://... <bootstrap_port> --prefill grpc://... + --model-path

Next Steps

For sizing guidelines, per-phase routing policies, Kubernetes service discovery, and monitoring, see the full PD Disaggregation Concepts page.