PD Disaggregation¶
Prefill-Decode (PD) disaggregation separates the two phases of LLM inference onto specialized workers, optimizing Time to First Token (TTFT) and Time Per Output Token (TPOT) independently.
Overview¶
Optimized TTFT¶
Dedicated prefill workers process prompts with maximum throughput.
Optimized TPOT¶
Dedicated decode workers generate tokens with minimal latency.
Independent Scaling¶
Scale prefill and decode workers based on their specific resource needs.
KV Cache Transfer¶
Automatic coordination of KV cache transfer between worker types.
Why Disaggregate?¶
Traditional LLM inference has two distinct phases with different characteristics:
| Phase | Compute Pattern | Bottleneck | Optimization |
|---|---|---|---|
| Prefill | Compute-bound, parallel | GPU compute | Batch similar-length prompts |
| Decode | Memory-bound, sequential | Memory bandwidth | Maximize batch size |
Running both phases on the same worker creates inefficiencies:
- Prefill batches are delayed waiting for decode slots
- Decode batches are small due to prefill memory pressure
- Neither phase is optimally configured
PD disaggregation solves this by dedicating workers to each phase.
Supported Runtimes¶
SMG supports PD disaggregation with two inference backends:
| Runtime | Protocol | Dispatch | KV Transfer | Best For |
|---|---|---|---|---|
| SGLang | HTTP | Parallel | Bootstrap-based coordination | Production deployments with SGLang |
| vLLM | gRPC | Sequential | NIXL or Mooncake | High-performance with RDMA/TCP networking |
vLLM KV Transfer Backends¶
vLLM supports two backends for KV cache transfer:
| Backend | Transport | Configuration | Best For |
|---|---|---|---|
| NIXL | RDMA | VLLM_NIXL_SIDE_CHANNEL_PORT env var | High-bandwidth RDMA networks |
| Mooncake | TCP/RDMA | MOONCAKE_MASTER env var or config file | Flexible deployment, TCP fallback |
How It Works¶
SGLang PD (Parallel Dispatch)¶
SGLang uses parallel dispatch with bootstrap-based coordination:
- SMG sends the request to both prefill and decode workers simultaneously
- Metadata (bootstrap host/port) enables workers to coordinate
- Prefill completes and transfers KV cache to decode
- Decode streams tokens back to client
vLLM PD (Sequential Dispatch)¶
vLLM uses sequential dispatch with NIXL or Mooncake KV transfer:
- SMG sends request to prefill worker with
max_tokens=1 - Prefill computes KV cache and returns (response discarded)
- SMG sends original request to decode worker
- KV backend (NIXL/Mooncake) transparently transfers KV cache
- Decode streams tokens back to client
Request Flow¶
- Request arrives at SMG gateway
- Find P/D pair: Select a prefill worker and decode worker
- Prefill phase: Prefill worker processes the prompt
- KV transfer: KV cache is transferred to decode worker
- Decode phase: Decode worker generates tokens
- Stream response: Tokens are streamed back to client
Configuration¶
SGLang PD Setup¶
SGLang workers use HTTP and require a bootstrap port for coordination:
smg \
--pd-disaggregation \
--prefill http://prefill1:8000 9001 \
--prefill http://prefill2:8000 9002 \
--decode http://decode1:8000 \
--decode http://decode2:8000
vLLM PD Setup¶
vLLM workers use gRPC and NIXL/Mooncake for KV transfer (no bootstrap port needed):
smg \
--pd-disaggregation \
--prefill grpc://prefill1:50051 \
--prefill grpc://prefill2:50052 \
--decode grpc://decode1:50053 \
--decode grpc://decode2:50054 \
--model-path /path/to/model
Model Path Required
The --model-path parameter is required for vLLM PD mode to load the tokenizer for request processing.
Parameters¶
| Parameter | Description |
|---|---|
--pd-disaggregation | Enable PD disaggregated mode |
--prefill | Prefill worker URL (and optional bootstrap port for SGLang) |
--decode | Decode worker URLs |
--prefill-policy | Routing policy for prefill workers |
--decode-policy | Routing policy for decode workers |
Per-Phase Policies¶
Configure different routing policies for each phase:
smg \
--pd-disaggregation \
--worker-urls http://prefill1:8000 http://prefill2:8000 \
--decode http://decode1:8000 http://decode2:8000 \
--prefill-policy cache_aware \
--decode-policy power_of_two
Supported Policies¶
Both prefill and decode support these policies:
| Policy | Prefill Use Case | Decode Use Case |
|---|---|---|
cache_aware | Maximize prompt cache hits | Less beneficial |
power_of_two | Balance prefill load | Balance decode load |
round_robin | Even distribution | Even distribution |
random | Simple distribution | Simple distribution |
Recommended: cache_aware for prefill, power_of_two for decode.
Kubernetes Service Discovery¶
Use label selectors to automatically discover prefill and decode workers.
Configuration¶
smg \
--service-discovery \
--pd-disaggregation \
--prefill-selector "app=sglang,role=prefill" \
--decode-selector "app=sglang,role=decode" \
--service-discovery-namespace inference
Worker Deployments¶
# Prefill workers
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: sglang-prefill
namespace: inference
spec:
replicas: 2
selector:
matchLabels:
app: sglang
role: prefill
template:
metadata:
labels:
app: sglang
role: prefill
spec:
containers:
- name: sglang
image: lmsysorg/sglang:latest
args:
- --model-path=meta-llama/Llama-3.1-70B-Instruct
- --port=8000
- --prefill-only
resources:
limits:
nvidia.com/gpu: 4
---
# Decode workers
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: sglang-decode
namespace: inference
spec:
replicas: 4
selector:
matchLabels:
app: sglang
role: decode
template:
metadata:
labels:
app: sglang
role: decode
spec:
containers:
- name: sglang
image: lmsysorg/sglang:latest
args:
- --model-path=meta-llama/Llama-3.1-70B-Instruct
- --port=8000
- --decode-only
resources:
limits:
nvidia.com/gpu: 2
P/D Pair Selection¶
SMG maintains awareness of which prefill and decode workers can communicate.
Pairing Strategies¶
| Strategy | Description | Best For |
|---|---|---|
| Any-to-Any | Any prefill can send to any decode | Network with uniform latency |
| Affinity | Prefer co-located pairs | Reduce KV transfer latency |
| Load-Based | Select least loaded pair | Maximize throughput |
KV Cache Transfer¶
The KV cache is transferred between workers using the backend's native mechanism:
| Backend | Transfer Method | Coordination |
|---|---|---|
| SGLang | NCCL/Gloo over network | Bootstrap metadata (host/port/room) |
| vLLM + NIXL | RDMA | Automatic prefix matching |
| vLLM + Mooncake | TCP/RDMA | P2P handshake via master server |
SGLang: SMG injects bootstrap metadata (DisaggregatedParams) into requests, enabling workers to coordinate KV transfer through a shared "room".
vLLM: SMG uses the simple proxy pattern—sends max_tokens=1 to prefill to trigger KV cache computation, then the KV backend (NIXL or Mooncake) automatically discovers and transfers the cache to decode. No protocol changes required.
- NIXL: Uses RDMA for high-bandwidth KV transfer with automatic prefix matching
- Mooncake: Supports both TCP and RDMA, uses P2P handshake for coordination (no external metadata server required)
Sizing Guidelines¶
Prefill Workers¶
Prefill is compute-bound:
- More GPUs per worker = faster prefill
- Fewer workers with more GPUs is often better
- Size for your longest prompts
| Prompt Length | Recommended GPUs |
|---|---|
| < 4K tokens | 1-2 GPUs |
| 4K - 16K tokens | 2-4 GPUs |
| 16K - 64K tokens | 4-8 GPUs |
| > 64K tokens | 8+ GPUs |
Decode Workers¶
Decode is memory-bandwidth-bound:
- More workers = higher throughput
- Smaller workers can batch more requests
- Size for your target concurrency
| Concurrent Users | Recommended Setup |
|---|---|
| < 50 | 2 decode workers |
| 50 - 200 | 4 decode workers |
| 200 - 500 | 8 decode workers |
| > 500 | 16+ decode workers |
Ratio Guidelines¶
| Workload Type | Prefill:Decode Ratio |
|---|---|
| Short prompts, long outputs | 1:4 |
| Balanced prompts/outputs | 1:2 |
| Long prompts, short outputs | 1:1 or 2:1 |
| RAG with large context | 2:1 |
Monitoring¶
Metrics¶
| Metric | Description |
|---|---|
smg_pd_prefill_duration_seconds | Prefill phase duration |
smg_pd_decode_duration_seconds | Decode phase duration |
smg_pd_kv_transfer_duration_seconds | KV cache transfer time |
smg_pd_pair_selections_total | P/D pair selection count |
Key Performance Indicators¶
| KPI | Target | Indicates |
|---|---|---|
| TTFT | < 500ms | Prefill performance |
| TPOT | < 50ms | Decode performance |
| KV transfer time | < 100ms | Network performance |
PromQL Queries¶
Troubleshooting¶
| Symptom | Cause | Solution |
|---|---|---|
| High TTFT | Prefill workers overloaded | Add prefill workers or GPUs |
| High TPOT | Decode workers overloaded | Add decode workers |
| KV transfer timeout | Network congestion | Check network bandwidth |
| Uneven load | Poor pairing | Adjust routing policy |
| Decode queue buildup | Prefill too fast | Balance P:D ratio |
Debug Logging¶
Verify Configuration¶
# Check discovered workers
curl http://smg:3001/workers | jq
# Check worker roles
curl http://smg:3001/workers | jq '.[] | {url, role}'
Complete Example¶
SGLang PD (Kubernetes Service Discovery)¶
smg \
--service-discovery \
--pd-disaggregation \
--prefill-selector "app=sglang,role=prefill" \
--decode-selector "app=sglang,role=decode" \
--prefill-policy cache_aware \
--decode-policy power_of_two \
--cb-failure-threshold 3 \
--health-check-interval-secs 10 \
--host 0.0.0.0 \
--port 8000
SGLang PD (Static Workers)¶
smg \
--pd-disaggregation \
--prefill http://prefill-0:8000 9001 \
--prefill http://prefill-1:8000 9002 \
--decode http://decode-0:8000 \
--decode http://decode-1:8000 \
--prefill-policy cache_aware \
--decode-policy power_of_two
vLLM PD (Static Workers with NIXL)¶
smg \
--pd-disaggregation \
--prefill grpc://prefill-0:50051 \
--prefill grpc://prefill-1:50052 \
--decode grpc://decode-0:50053 \
--decode grpc://decode-1:50054 \
--model-path /path/to/model \
--prefill-policy cache_aware \
--decode-policy round_robin
vLLM PD (Static Workers with Mooncake)¶
smg \
--pd-disaggregation \
--prefill grpc://prefill-0:50051 8998 \
--prefill grpc://prefill-1:50052 8999 \
--decode grpc://decode-0:50053 \
--decode grpc://decode-1:50054 \
--model-path /path/to/model \
--prefill-policy cache_aware \
--decode-policy round_robin
Note: For Mooncake, each prefill worker needs a unique bootstrap port (8998, 8999, etc.) passed to SMG.
Workers should be started with Mooncake configuration:
# Launch workers using helper script (auto-assigns unique bootstrap ports)
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model
Launching vLLM PD Workers¶
Option 1: NIXL Backend (RDMA)¶
# Prefill worker (kv_producer)
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
--model /path/to/model \
--port 50051 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
# Decode worker (kv_consumer)
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
--model /path/to/model \
--port 50052 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
Option 2: Mooncake Backend (TCP/RDMA)¶
Mooncake requires each prefill worker to have a unique bootstrap port for P2P coordination:
# Prefill worker 1 (kv_producer) - bootstrap port 8998
VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
python -m vllm.entrypoints.grpc_server \
--model /path/to/model \
--port 50051 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
# Prefill worker 2 (kv_producer) - bootstrap port 8999 (must be unique!)
VLLM_MOONCAKE_BOOTSTRAP_PORT=8999 \
python -m vllm.entrypoints.grpc_server \
--model /path/to/model \
--port 50052 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
# Decode worker (kv_consumer) - no bootstrap port needed
python -m vllm.entrypoints.grpc_server \
--model /path/to/model \
--port 50053 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
Mooncake environment variables:
| Variable | Description | Default |
|---|---|---|
VLLM_MOONCAKE_BOOTSTRAP_PORT | Bootstrap port for prefill workers (must be unique per worker) | Required for prefill |
MOONCAKE_PROTOCOL | Transport protocol: tcp or rdma | tcp |
MOONCAKE_DEVICE | RDMA device name (for RDMA protocol) | "" |
Helper Script¶
Use the provided helper script to launch workers with either backend:
# NIXL backend (default)
./scripts/launch-pd-workers.sh vllm /path/to/model
# Mooncake backend
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model
# Mooncake with custom bootstrap port
KV_BACKEND=mooncake MOONCAKE_BOOTSTRAP_PORT=9000 \
./scripts/launch-pd-workers.sh vllm /path/to/model