Skip to content

PD Disaggregation

Prefill-Decode (PD) disaggregation separates the two phases of LLM inference onto specialized workers, optimizing Time to First Token (TTFT) and Time Per Output Token (TPOT) independently.


Overview

Optimized TTFT

Dedicated prefill workers process prompts with maximum throughput.

Optimized TPOT

Dedicated decode workers generate tokens with minimal latency.

Independent Scaling

Scale prefill and decode workers based on their specific resource needs.

KV Cache Transfer

Automatic coordination of KV cache transfer between worker types.


Why Disaggregate?

Traditional LLM inference has two distinct phases with different characteristics:

Phase Compute Pattern Bottleneck Optimization
Prefill Compute-bound, parallel GPU compute Batch similar-length prompts
Decode Memory-bound, sequential Memory bandwidth Maximize batch size

Running both phases on the same worker creates inefficiencies:

  • Prefill batches are delayed waiting for decode slots
  • Decode batches are small due to prefill memory pressure
  • Neither phase is optimally configured

PD disaggregation solves this by dedicating workers to each phase.


Supported Runtimes

SMG supports PD disaggregation with two inference backends:

Runtime Protocol Dispatch KV Transfer Best For
SGLang HTTP Parallel Bootstrap-based coordination Production deployments with SGLang
vLLM gRPC Sequential NIXL or Mooncake High-performance with RDMA/TCP networking

vLLM KV Transfer Backends

vLLM supports two backends for KV cache transfer:

Backend Transport Configuration Best For
NIXL RDMA VLLM_NIXL_SIDE_CHANNEL_PORT env var High-bandwidth RDMA networks
Mooncake TCP/RDMA MOONCAKE_MASTER env var or config file Flexible deployment, TCP fallback

How It Works

SGLang PD (Parallel Dispatch)

SGLang PD Parallel Dispatch Sequence

SGLang uses parallel dispatch with bootstrap-based coordination:

  1. SMG sends the request to both prefill and decode workers simultaneously
  2. Metadata (bootstrap host/port) enables workers to coordinate
  3. Prefill completes and transfers KV cache to decode
  4. Decode streams tokens back to client

vLLM PD (Sequential Dispatch)

vLLM PD Sequential Dispatch Sequence

vLLM uses sequential dispatch with NIXL or Mooncake KV transfer:

  1. SMG sends request to prefill worker with max_tokens=1
  2. Prefill computes KV cache and returns (response discarded)
  3. SMG sends original request to decode worker
  4. KV backend (NIXL/Mooncake) transparently transfers KV cache
  5. Decode streams tokens back to client

Request Flow

  1. Request arrives at SMG gateway
  2. Find P/D pair: Select a prefill worker and decode worker
  3. Prefill phase: Prefill worker processes the prompt
  4. KV transfer: KV cache is transferred to decode worker
  5. Decode phase: Decode worker generates tokens
  6. Stream response: Tokens are streamed back to client

Configuration

SGLang PD Setup

SGLang workers use HTTP and require a bootstrap port for coordination:

smg \
  --pd-disaggregation \
  --prefill http://prefill1:8000 9001 \
  --prefill http://prefill2:8000 9002 \
  --decode http://decode1:8000 \
  --decode http://decode2:8000

vLLM PD Setup

vLLM workers use gRPC and NIXL/Mooncake for KV transfer (no bootstrap port needed):

smg \
  --pd-disaggregation \
  --prefill grpc://prefill1:50051 \
  --prefill grpc://prefill2:50052 \
  --decode grpc://decode1:50053 \
  --decode grpc://decode2:50054 \
  --model-path /path/to/model

Model Path Required

The --model-path parameter is required for vLLM PD mode to load the tokenizer for request processing.

Parameters

Parameter Description
--pd-disaggregation Enable PD disaggregated mode
--prefill Prefill worker URL (and optional bootstrap port for SGLang)
--decode Decode worker URLs
--prefill-policy Routing policy for prefill workers
--decode-policy Routing policy for decode workers

Per-Phase Policies

Configure different routing policies for each phase:

smg \
  --pd-disaggregation \
  --worker-urls http://prefill1:8000 http://prefill2:8000 \
  --decode http://decode1:8000 http://decode2:8000 \
  --prefill-policy cache_aware \
  --decode-policy power_of_two

Supported Policies

Both prefill and decode support these policies:

Policy Prefill Use Case Decode Use Case
cache_aware Maximize prompt cache hits Less beneficial
power_of_two Balance prefill load Balance decode load
round_robin Even distribution Even distribution
random Simple distribution Simple distribution

Recommended: cache_aware for prefill, power_of_two for decode.


Kubernetes Service Discovery

Use label selectors to automatically discover prefill and decode workers.

Configuration

smg \
  --service-discovery \
  --pd-disaggregation \
  --prefill-selector "app=sglang,role=prefill" \
  --decode-selector "app=sglang,role=decode" \
  --service-discovery-namespace inference

Worker Deployments

# Prefill workers
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sglang-prefill
  namespace: inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sglang
      role: prefill
  template:
    metadata:
      labels:
        app: sglang
        role: prefill
    spec:
      containers:
        - name: sglang
          image: lmsysorg/sglang:latest
          args:
            - --model-path=meta-llama/Llama-3.1-70B-Instruct
            - --port=8000
            - --prefill-only
          resources:
            limits:
              nvidia.com/gpu: 4

---
# Decode workers
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sglang-decode
  namespace: inference
spec:
  replicas: 4
  selector:
    matchLabels:
      app: sglang
      role: decode
  template:
    metadata:
      labels:
        app: sglang
        role: decode
    spec:
      containers:
        - name: sglang
          image: lmsysorg/sglang:latest
          args:
            - --model-path=meta-llama/Llama-3.1-70B-Instruct
            - --port=8000
            - --decode-only
          resources:
            limits:
              nvidia.com/gpu: 2

P/D Pair Selection

SMG maintains awareness of which prefill and decode workers can communicate.

Pairing Strategies

Strategy Description Best For
Any-to-Any Any prefill can send to any decode Network with uniform latency
Affinity Prefer co-located pairs Reduce KV transfer latency
Load-Based Select least loaded pair Maximize throughput

KV Cache Transfer

The KV cache is transferred between workers using the backend's native mechanism:

Backend Transfer Method Coordination
SGLang NCCL/Gloo over network Bootstrap metadata (host/port/room)
vLLM + NIXL RDMA Automatic prefix matching
vLLM + Mooncake TCP/RDMA P2P handshake via master server

SGLang: SMG injects bootstrap metadata (DisaggregatedParams) into requests, enabling workers to coordinate KV transfer through a shared "room".

vLLM: SMG uses the simple proxy pattern—sends max_tokens=1 to prefill to trigger KV cache computation, then the KV backend (NIXL or Mooncake) automatically discovers and transfers the cache to decode. No protocol changes required.

  • NIXL: Uses RDMA for high-bandwidth KV transfer with automatic prefix matching
  • Mooncake: Supports both TCP and RDMA, uses P2P handshake for coordination (no external metadata server required)

Sizing Guidelines

Prefill Workers

Prefill is compute-bound:

  • More GPUs per worker = faster prefill
  • Fewer workers with more GPUs is often better
  • Size for your longest prompts
Prompt Length Recommended GPUs
< 4K tokens 1-2 GPUs
4K - 16K tokens 2-4 GPUs
16K - 64K tokens 4-8 GPUs
> 64K tokens 8+ GPUs

Decode Workers

Decode is memory-bandwidth-bound:

  • More workers = higher throughput
  • Smaller workers can batch more requests
  • Size for your target concurrency
Concurrent Users Recommended Setup
< 50 2 decode workers
50 - 200 4 decode workers
200 - 500 8 decode workers
> 500 16+ decode workers

Ratio Guidelines

Workload Type Prefill:Decode Ratio
Short prompts, long outputs 1:4
Balanced prompts/outputs 1:2
Long prompts, short outputs 1:1 or 2:1
RAG with large context 2:1

Monitoring

Metrics

Metric Description
smg_pd_prefill_duration_seconds Prefill phase duration
smg_pd_decode_duration_seconds Decode phase duration
smg_pd_kv_transfer_duration_seconds KV cache transfer time
smg_pd_pair_selections_total P/D pair selection count

Key Performance Indicators

KPI Target Indicates
TTFT < 500ms Prefill performance
TPOT < 50ms Decode performance
KV transfer time < 100ms Network performance

PromQL Queries

Phase Duration

# Average prefill time
rate(smg_pd_prefill_duration_seconds_sum[5m]) /
rate(smg_pd_prefill_duration_seconds_count[5m])

# Average decode time per token
rate(smg_pd_decode_duration_seconds_sum[5m]) /
rate(smg_pd_decode_duration_seconds_count[5m])

Worker Utilization

# Prefill worker utilization
smg_worker_requests_active{role="prefill"} /
smg_worker_max_concurrent{role="prefill"}

# Decode worker utilization
smg_worker_requests_active{role="decode"} /
smg_worker_max_concurrent{role="decode"}

Troubleshooting

Symptom Cause Solution
High TTFT Prefill workers overloaded Add prefill workers or GPUs
High TPOT Decode workers overloaded Add decode workers
KV transfer timeout Network congestion Check network bandwidth
Uneven load Poor pairing Adjust routing policy
Decode queue buildup Prefill too fast Balance P:D ratio

Debug Logging

RUST_LOG=smg::pd=debug smg --pd-disaggregation ...

Verify Configuration

# Check discovered workers
curl http://smg:3001/workers | jq

# Check worker roles
curl http://smg:3001/workers | jq '.[] | {url, role}'

Complete Example

SGLang PD (Kubernetes Service Discovery)

smg \
  --service-discovery \
  --pd-disaggregation \
  --prefill-selector "app=sglang,role=prefill" \
  --decode-selector "app=sglang,role=decode" \
  --prefill-policy cache_aware \
  --decode-policy power_of_two \
  --cb-failure-threshold 3 \
  --health-check-interval-secs 10 \
  --host 0.0.0.0 \
  --port 8000

SGLang PD (Static Workers)

smg \
  --pd-disaggregation \
  --prefill http://prefill-0:8000 9001 \
  --prefill http://prefill-1:8000 9002 \
  --decode http://decode-0:8000 \
  --decode http://decode-1:8000 \
  --prefill-policy cache_aware \
  --decode-policy power_of_two

vLLM PD (Static Workers with NIXL)

smg \
  --pd-disaggregation \
  --prefill grpc://prefill-0:50051 \
  --prefill grpc://prefill-1:50052 \
  --decode grpc://decode-0:50053 \
  --decode grpc://decode-1:50054 \
  --model-path /path/to/model \
  --prefill-policy cache_aware \
  --decode-policy round_robin

vLLM PD (Static Workers with Mooncake)

smg \
  --pd-disaggregation \
  --prefill grpc://prefill-0:50051 8998 \
  --prefill grpc://prefill-1:50052 8999 \
  --decode grpc://decode-0:50053 \
  --decode grpc://decode-1:50054 \
  --model-path /path/to/model \
  --prefill-policy cache_aware \
  --decode-policy round_robin

Note: For Mooncake, each prefill worker needs a unique bootstrap port (8998, 8999, etc.) passed to SMG.

Workers should be started with Mooncake configuration:

# Launch workers using helper script (auto-assigns unique bootstrap ports)
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model

Launching vLLM PD Workers

Option 1: NIXL Backend (RDMA)

# Prefill worker (kv_producer)
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode worker (kv_consumer)
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'

Option 2: Mooncake Backend (TCP/RDMA)

Mooncake requires each prefill worker to have a unique bootstrap port for P2P coordination:

# Prefill worker 1 (kv_producer) - bootstrap port 8998
VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

# Prefill worker 2 (kv_producer) - bootstrap port 8999 (must be unique!)
VLLM_MOONCAKE_BOOTSTRAP_PORT=8999 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

# Decode worker (kv_consumer) - no bootstrap port needed
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50053 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

Mooncake environment variables:

Variable Description Default
VLLM_MOONCAKE_BOOTSTRAP_PORT Bootstrap port for prefill workers (must be unique per worker) Required for prefill
MOONCAKE_PROTOCOL Transport protocol: tcp or rdma tcp
MOONCAKE_DEVICE RDMA device name (for RDMA protocol) ""

Helper Script

Use the provided helper script to launch workers with either backend:

# NIXL backend (default)
./scripts/launch-pd-workers.sh vllm /path/to/model

# Mooncake backend
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model

# Mooncake with custom bootstrap port
KV_BACKEND=mooncake MOONCAKE_BOOTSTRAP_PORT=9000 \
  ./scripts/launch-pd-workers.sh vllm /path/to/model

What's Next?

Service Discovery

Automatic worker discovery in Kubernetes.

Service Discovery →

Cache-Aware Routing

Optimize prefill with cache-aware routing.

Cache-Aware Routing →

Load Balancing

All available routing policies.

Load Balancing →