Getting Started¶

Shepherd Model Gateway (SMG) routes and manages LLM traffic across workers. This page gives you a fast path to a working gateway, then points you to feature-specific setup guides.

Install¶

pip (recommended)Cargo (crates.io)DockerFrom Source

Pre-built wheels are available for Linux (x86_64, aarch64, musllinux), macOS (Apple Silicon), and Windows (x86_64), with Python 3.9–3.14.

pip install smg

This installs both:

smg serve (Python orchestration command for workers + gateway)
smg launch (router launch path in Rust CLI)

cargo install smg

Multi-architecture images are available for x86_64 and ARM64.

docker pull lightseekorg/smg:latest

Verify:

docker run --rm lightseekorg/smg:latest --version

Available tags: latest (stable), v0.3.x (specific version), main (development).

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"

# Clone and build
git clone https://github.com/lightseekorg/smg.git
cd smg
cargo build --release

The binary is available at ./target/release/smg.

Step 1: Start SMG¶

Choose one of these startup paths.

Option A: All-in-one with `smg serve`¶

smg serve launches backend worker process(es) and then starts SMG with generated worker URLs.

SGLangvLLMTensorRT-LLM (gRPC)

smg serve \
  --backend sglang \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --data-parallel-size 2 \
  --connection-mode grpc \
  --host 0.0.0.0 \
  --port 30000

smg serve \
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --data-parallel-size 2 \
  --host 0.0.0.0 \
  --port 30000

smg serve \
  --backend trtllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --data-parallel-size 2 \
  --host 0.0.0.0 \
  --port 30000

This starts --data-parallel-size worker replicas, waits for readiness, then starts the gateway.

Option	Default	Description
`--backend`	`sglang`	Inference backend: `sglang`, `vllm`, or `trtllm`
`--connection-mode`	`grpc`	Worker connection mode: `grpc` or `http` (vLLM and TensorRT-LLM only support gRPC)
`--data-parallel-size`	`1`	Number of worker replicas (one per GPU)
`--worker-base-port`	`31000`	Base port for worker processes
`--host`	`127.0.0.1`	Router host
`--port`	`8080`	Router port

Option B: Launch gateway only with `smg launch`¶

Use this when workers are already running or managed by another platform.

For gRPC workers:

smg launch \
  --worker-urls grpc://localhost:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --policy round_robin \
  --host 0.0.0.0 \
  --port 30000

For HTTP workers:

smg launch \
  --worker-urls http://localhost:8000 \
  --policy round_robin \
  --host 0.0.0.0 \
  --port 30000

Step 2: Verify Core Endpoints¶

Health:

curl http://localhost:30000/health
curl http://localhost:30000/readiness

OpenAI-compatible chat completions:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
  }'

Responses API:

curl http://localhost:30000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "input": "Say hello in one sentence."
  }'

Step 3: Choose Your Setup Track¶

Core Deployment¶

Operations and Security¶

Reliability and Data¶

Advanced Features¶

Worker Startup Recipes (Standalone)¶

Use these when workers are not started via smg serve.

SGLang (gRPC)SGLang (HTTP)vLLM (gRPC)TensorRT-LLM (gRPC)

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --tensor-parallel-size 1

python -m tensorrt_llm.commands.serve serve \
  meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --host 0.0.0.0 \
  --port 50051 \
  --backend pytorch \
  --tp_size 1

PD Disaggregation Workers¶

For prefill-decode disaggregation, start separate prefill and decode workers:

SGLang PD (gRPC)SGLang PD (HTTP)vLLM PD (gRPC + NIXL)

# Prefill worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode \
  --disaggregation-mode prefill \
  --disaggregation-bootstrap-port 8998

# Decode worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50052 \
  --grpc-mode \
  --disaggregation-mode decode \
  --disaggregation-bootstrap-port 8999

Start SMG with bootstrap ports for SGLang coordination:

smg launch \
  --pd-disaggregation \
  --prefill grpc://localhost:50051 8998 \
  --decode grpc://localhost:50052 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

# Prefill worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --disaggregation-mode prefill \
  --disaggregation-bootstrap-port 8998

# Decode worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8001 \
  --disaggregation-mode decode \
  --disaggregation-bootstrap-port 8999

Start SMG with bootstrap ports for SGLang coordination:

smg launch \
  --pd-disaggregation \
  --prefill http://localhost:8000 8998 \
  --decode http://localhost:8001 \
  --host 0.0.0.0 \
  --port 30000

vLLM uses NIXL for KV cache transfer between prefill and decode workers:

# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'

Start SMG (no bootstrap ports needed — NIXL handles KV transfer):

smg \
  --pd-disaggregation \
  --prefill grpc://localhost:50051 \
  --decode grpc://localhost:50052 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

See PD Disaggregation for full details including Mooncake backend and scaling.

Send a Request¶

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 50
  }'

Expected response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 8,
    "total_tokens": 22
  }
}

Verify Health¶

# Gateway health
curl http://localhost:30000/health

# Worker status
curl http://localhost:30000/workers

Deploy with Docker¶

For local deployment, run SMG in a container and point it at your worker:

docker pull lightseekorg/smg:latest

docker run -d \
  --name smg \
  -p 30000:30000 \
  -p 29000:29000 \
  lightseekorg/smg:latest \
  --worker-urls http://host.docker.internal:8000 \
  --policy cache_aware \
  --prometheus-port 29000

Verify:

docker ps | grep smg
curl http://localhost:30000/health

Deploy to Kubernetes (Quick Start)¶

Run SMG in-cluster and use service discovery to pick up worker pods automatically.

Start SMG with service discovery:

smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --service-discovery-port 8000 \
  --policy cache_aware

Required RBAC permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: smg-discovery
  namespace: inference
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]

Verify:

kubectl get pods -n inference -l app=sglang-worker
curl http://localhost:30000/workers

Navigate by Category¶

Core Setup¶

Multiple Workers — connect local or external worker endpoints
gRPC Workers — gateway-side tokenization, parsing, and tool handling
PD Disaggregation — split prefill and decode paths
Service Discovery — Kubernetes pod-based worker registration

Operations¶

Monitoring — Prometheus metrics, tracing, and alerts
Logging — structured logs and aggregation patterns
TLS — HTTPS gateway configuration
Control Plane Auth — secure worker/tokenizer/WASM management endpoints

Reliability and Data¶

Reliability Controls — concurrency limits, retries, and circuit breakers
Data Connections — history backend setup for Postgres, Redis, and Oracle
Tokenization and Parsing APIs — tokenize, detokenize, and parser endpoints

Advanced Features¶

Load Balancing — policy selection and tuning
Tokenizer Caching — L0/L1 cache setup for gRPC mode
MCP in Responses API — configure and execute MCP tools through /v1/responses

Troubleshooting¶

Gateway starts but can't connect to worker

Symptoms: Gateway logs show connection errors.

Solutions:

Verify the worker is running: curl http://localhost:8000/health
Check network connectivity between gateway and worker
If using Docker, ensure proper network configuration (--network host or Docker network)

Request times out

Symptoms: Requests hang or return 504 errors.

Solutions:

Check worker health: curl http://localhost:30000/workers
Increase timeout: --request-timeout-secs 120
Check worker logs for errors

Model not found error

Symptoms: model not found in response.

Solutions:

The model field in requests should match the model loaded on the worker
Check available models: curl http://localhost:30000/v1/models

Getting Started¶

Install¶

Step 1: Start SMG¶

Option A: All-in-one with smg serve¶

Option B: Launch gateway only with smg launch¶

Step 2: Verify Core Endpoints¶

Step 3: Choose Your Setup Track¶

Core Deployment¶

Operations and Security¶

Reliability and Data¶

Advanced Features¶

Worker Startup Recipes (Standalone)¶

PD Disaggregation Workers¶

Send a Request¶

Verify Health¶

Deploy with Docker¶

Deploy to Kubernetes (Quick Start)¶

Navigate by Category¶

Core Setup¶

Operations¶

Reliability and Data¶

Advanced Features¶

Troubleshooting¶

Option A: All-in-one with `smg serve`¶

Option B: Launch gateway only with `smg launch`¶