PD Disaggregation¶

Prefill-Decode (PD) disaggregation separates the two phases of LLM inference onto specialized workers, optimizing Time to First Token (TTFT) and Time Per Output Token (TPOT) independently.

Overview¶

Optimized TTFT¶

Dedicated prefill workers process prompts with maximum throughput.

Optimized TPOT¶

Dedicated decode workers generate tokens with minimal latency.

Independent Scaling¶

Scale prefill and decode workers based on their specific resource needs.

KV Cache Transfer¶

Automatic coordination of KV cache transfer between worker types.

Why Disaggregate?¶

Traditional LLM inference has two distinct phases with different characteristics:

Phase	Compute Pattern	Bottleneck	Optimization
Prefill	Compute-bound, parallel	GPU compute	Batch similar-length prompts
Decode	Memory-bound, sequential	Memory bandwidth	Maximize batch size

Running both phases on the same worker creates inefficiencies:

Prefill batches are delayed waiting for decode slots
Decode batches are small due to prefill memory pressure
Neither phase is optimally configured

PD disaggregation solves this by dedicating workers to each phase.

Supported Runtimes¶

SMG supports PD disaggregation with two inference backends:

Runtime	Protocol	Dispatch	KV Transfer	Best For
SGLang	HTTP	Parallel	Bootstrap-based coordination	Production deployments with SGLang
vLLM	gRPC	Sequential	NIXL or Mooncake	High-performance with RDMA/TCP networking

vLLM KV Transfer Backends¶

vLLM supports two backends for KV cache transfer:

Backend	Transport	Configuration	Best For
NIXL	RDMA	`VLLM_NIXL_SIDE_CHANNEL_PORT` env var	High-bandwidth RDMA networks
Mooncake	TCP/RDMA	`MOONCAKE_MASTER` env var or config file	Flexible deployment, TCP fallback

How It Works¶

SGLang PD (Parallel Dispatch)¶

SGLang uses parallel dispatch with bootstrap-based coordination:

SMG sends the request to both prefill and decode workers simultaneously
Metadata (bootstrap host/port) enables workers to coordinate
Prefill completes and transfers KV cache to decode
Decode streams tokens back to client

vLLM PD (Sequential Dispatch)¶

vLLM uses sequential dispatch with NIXL or Mooncake KV transfer:

SMG sends request to prefill worker with max_tokens=1
Prefill computes KV cache and returns (response discarded)
SMG sends original request to decode worker
KV backend (NIXL/Mooncake) transparently transfers KV cache
Decode streams tokens back to client

Request Flow¶

Request arrives at SMG gateway
Find P/D pair: Select a prefill worker and decode worker
Prefill phase: Prefill worker processes the prompt
KV transfer: KV cache is transferred to decode worker
Decode phase: Decode worker generates tokens
Stream response: Tokens are streamed back to client

Configuration¶

SGLang PD Setup¶

SGLang workers use HTTP and require a bootstrap port for coordination:

smg \
  --pd-disaggregation \
  --prefill http://prefill1:8000 9001 \
  --prefill http://prefill2:8000 9002 \
  --decode http://decode1:8000 \
  --decode http://decode2:8000

vLLM PD Setup¶

vLLM workers use gRPC and NIXL/Mooncake for KV transfer (no bootstrap port needed):

smg \
  --pd-disaggregation \
  --prefill grpc://prefill1:50051 \
  --prefill grpc://prefill2:50052 \
  --decode grpc://decode1:50053 \
  --decode grpc://decode2:50054 \
  --model-path /path/to/model

Model Path Required

The --model-path parameter is required for vLLM PD mode to load the tokenizer for request processing.

Parameters¶

Parameter	Description
`--pd-disaggregation`	Enable PD disaggregated mode
`--prefill`	Prefill worker URL (and optional bootstrap port for SGLang)
`--decode`	Decode worker URLs
`--prefill-policy`	Routing policy for prefill workers
`--decode-policy`	Routing policy for decode workers

Per-Phase Policies¶

Configure different routing policies for each phase:

smg \
  --pd-disaggregation \
  --worker-urls http://prefill1:8000 http://prefill2:8000 \
  --decode http://decode1:8000 http://decode2:8000 \
  --prefill-policy cache_aware \
  --decode-policy power_of_two

Supported Policies¶

Both prefill and decode support these policies:

Policy	Prefill Use Case	Decode Use Case
`cache_aware`	Maximize prompt cache hits	Less beneficial
`power_of_two`	Balance prefill load	Balance decode load
`round_robin`	Even distribution	Even distribution
`random`	Simple distribution	Simple distribution

Recommended: cache_aware for prefill, power_of_two for decode.

Kubernetes Service Discovery¶

Use label selectors to automatically discover prefill and decode workers.

Configuration¶

smg \
  --service-discovery \
  --pd-disaggregation \
  --prefill-selector "app=sglang,role=prefill" \
  --decode-selector "app=sglang,role=decode" \
  --service-discovery-namespace inference

Worker Deployments¶

# Prefill workers
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sglang-prefill
  namespace: inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sglang
      role: prefill
  template:
    metadata:
      labels:
        app: sglang
        role: prefill
    spec:
      containers:
        - name: sglang
          image: lmsysorg/sglang:latest
          args:
            - --model-path=meta-llama/Llama-3.1-70B-Instruct
            - --port=8000
            - --prefill-only
          resources:
            limits:
              nvidia.com/gpu: 4

---
# Decode workers
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sglang-decode
  namespace: inference
spec:
  replicas: 4
  selector:
    matchLabels:
      app: sglang
      role: decode
  template:
    metadata:
      labels:
        app: sglang
        role: decode
    spec:
      containers:
        - name: sglang
          image: lmsysorg/sglang:latest
          args:
            - --model-path=meta-llama/Llama-3.1-70B-Instruct
            - --port=8000
            - --decode-only
          resources:
            limits:
              nvidia.com/gpu: 2

P/D Pair Selection¶

SMG maintains awareness of which prefill and decode workers can communicate.

Pairing Strategies¶

Strategy	Description	Best For
Any-to-Any	Any prefill can send to any decode	Network with uniform latency
Affinity	Prefer co-located pairs	Reduce KV transfer latency
Load-Based	Select least loaded pair	Maximize throughput

KV Cache Transfer¶

The KV cache is transferred between workers using the backend's native mechanism:

Backend	Transfer Method	Coordination
SGLang	NCCL/Gloo over network	Bootstrap metadata (host/port/room)
vLLM + NIXL	RDMA	Automatic prefix matching
vLLM + Mooncake	TCP/RDMA	P2P handshake via master server

SGLang: SMG injects bootstrap metadata (DisaggregatedParams) into requests, enabling workers to coordinate KV transfer through a shared "room".

vLLM: SMG uses the simple proxy pattern—sends max_tokens=1 to prefill to trigger KV cache computation, then the KV backend (NIXL or Mooncake) automatically discovers and transfers the cache to decode. No protocol changes required.

NIXL: Uses RDMA for high-bandwidth KV transfer with automatic prefix matching
Mooncake: Supports both TCP and RDMA, uses P2P handshake for coordination (no external metadata server required)

Sizing Guidelines¶

Prefill Workers¶

Prefill is compute-bound:

More GPUs per worker = faster prefill
Fewer workers with more GPUs is often better
Size for your longest prompts

Prompt Length	Recommended GPUs
< 4K tokens	1-2 GPUs
4K - 16K tokens	2-4 GPUs
16K - 64K tokens	4-8 GPUs
> 64K tokens	8+ GPUs

Decode Workers¶

Decode is memory-bandwidth-bound:

More workers = higher throughput
Smaller workers can batch more requests
Size for your target concurrency

Concurrent Users	Recommended Setup
< 50	2 decode workers
50 - 200	4 decode workers
200 - 500	8 decode workers
> 500	16+ decode workers

Ratio Guidelines¶

Workload Type	Prefill:Decode Ratio
Short prompts, long outputs	1:4
Balanced prompts/outputs	1:2
Long prompts, short outputs	1:1 or 2:1
RAG with large context	2:1

Monitoring¶

Metrics¶

Metric	Description
`smg_pd_prefill_duration_seconds`	Prefill phase duration
`smg_pd_decode_duration_seconds`	Decode phase duration
`smg_pd_kv_transfer_duration_seconds`	KV cache transfer time
`smg_pd_pair_selections_total`	P/D pair selection count

Key Performance Indicators¶

KPI	Target	Indicates
TTFT	< 500ms	Prefill performance
TPOT	< 50ms	Decode performance
KV transfer time	< 100ms	Network performance

PromQL Queries¶

Phase Duration¶

# Average prefill time
rate(smg_pd_prefill_duration_seconds_sum[5m]) /
rate(smg_pd_prefill_duration_seconds_count[5m])

# Average decode time per token
rate(smg_pd_decode_duration_seconds_sum[5m]) /
rate(smg_pd_decode_duration_seconds_count[5m])

Worker Utilization¶

# Prefill worker utilization
smg_worker_requests_active{role="prefill"} /
smg_worker_max_concurrent{role="prefill"}

# Decode worker utilization
smg_worker_requests_active{role="decode"} /
smg_worker_max_concurrent{role="decode"}

Troubleshooting¶

Symptom	Cause	Solution
High TTFT	Prefill workers overloaded	Add prefill workers or GPUs
High TPOT	Decode workers overloaded	Add decode workers
KV transfer timeout	Network congestion	Check network bandwidth
Uneven load	Poor pairing	Adjust routing policy
Decode queue buildup	Prefill too fast	Balance P:D ratio

Debug Logging¶

RUST_LOG=smg::pd=debug smg --pd-disaggregation ...

Verify Configuration¶

# Check discovered workers
curl http://smg:3001/workers | jq

# Check worker roles
curl http://smg:3001/workers | jq '.[] | {url, role}'

Complete Example¶

SGLang PD (Kubernetes Service Discovery)¶

smg \
  --service-discovery \
  --pd-disaggregation \
  --prefill-selector "app=sglang,role=prefill" \
  --decode-selector "app=sglang,role=decode" \
  --prefill-policy cache_aware \
  --decode-policy power_of_two \
  --cb-failure-threshold 3 \
  --health-check-interval-secs 10 \
  --host 0.0.0.0 \
  --port 8000

SGLang PD (Static Workers)¶

smg \
  --pd-disaggregation \
  --prefill http://prefill-0:8000 9001 \
  --prefill http://prefill-1:8000 9002 \
  --decode http://decode-0:8000 \
  --decode http://decode-1:8000 \
  --prefill-policy cache_aware \
  --decode-policy power_of_two

vLLM PD (Static Workers with NIXL)¶

smg \
  --pd-disaggregation \
  --prefill grpc://prefill-0:50051 \
  --prefill grpc://prefill-1:50052 \
  --decode grpc://decode-0:50053 \
  --decode grpc://decode-1:50054 \
  --model-path /path/to/model \
  --prefill-policy cache_aware \
  --decode-policy round_robin

vLLM PD (Static Workers with Mooncake)¶

smg \
  --pd-disaggregation \
  --prefill grpc://prefill-0:50051 8998 \
  --prefill grpc://prefill-1:50052 8999 \
  --decode grpc://decode-0:50053 \
  --decode grpc://decode-1:50054 \
  --model-path /path/to/model \
  --prefill-policy cache_aware \
  --decode-policy round_robin

Note: For Mooncake, each prefill worker needs a unique bootstrap port (8998, 8999, etc.) passed to SMG.

Workers should be started with Mooncake configuration:

# Launch workers using helper script (auto-assigns unique bootstrap ports)
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model

Launching vLLM PD Workers¶

Option 1: NIXL Backend (RDMA)¶

# Prefill worker (kv_producer)
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode worker (kv_consumer)
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'

Option 2: Mooncake Backend (TCP/RDMA)¶

Mooncake requires each prefill worker to have a unique bootstrap port for P2P coordination:

# Prefill worker 1 (kv_producer) - bootstrap port 8998
VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

# Prefill worker 2 (kv_producer) - bootstrap port 8999 (must be unique!)
VLLM_MOONCAKE_BOOTSTRAP_PORT=8999 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

# Decode worker (kv_consumer) - no bootstrap port needed
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50053 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

Mooncake environment variables:

Variable	Description	Default
`VLLM_MOONCAKE_BOOTSTRAP_PORT`	Bootstrap port for prefill workers (must be unique per worker)	Required for prefill
`MOONCAKE_PROTOCOL`	Transport protocol: `tcp` or `rdma`	`tcp`
`MOONCAKE_DEVICE`	RDMA device name (for RDMA protocol)	`""`

Helper Script¶

Use the provided helper script to launch workers with either backend:

# NIXL backend (default)
./scripts/launch-pd-workers.sh vllm /path/to/model

# Mooncake backend
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model

# Mooncake with custom bootstrap port
KV_BACKEND=mooncake MOONCAKE_BOOTSTRAP_PORT=9000 \
  ./scripts/launch-pd-workers.sh vllm /path/to/model

What's Next?¶

Service Discovery¶

Automatic worker discovery in Kubernetes.

Service Discovery →

Cache-Aware Routing¶

Optimize prefill with cache-aware routing.

Cache-Aware Routing →

Load Balancing¶

All available routing policies.

Load Balancing →