Getting Started¶
Shepherd Model Gateway (SMG) routes and manages LLM traffic across workers. This page gives you a fast path to a working gateway, then points you to feature-specific setup guides.
Install¶
Pre-built wheels are available for Linux (x86_64, aarch64, musllinux), macOS (Apple Silicon), and Windows (x86_64), with Python 3.9–3.14.
This installs both:
smg serve(Python orchestration command for workers + gateway)smg launch(router launch path in Rust CLI)
Multi-architecture images are available for x86_64 and ARM64.
Verify:
Available tags: latest (stable), v0.3.x (specific version), main (development).
Step 1: Start SMG¶
Choose one of these startup paths.
Option A: All-in-one with smg serve¶
smg serve launches backend worker process(es) and then starts SMG with generated worker URLs.
This starts --data-parallel-size worker replicas, waits for readiness, then starts the gateway.
| Option | Default | Description |
|---|---|---|
--backend | sglang | Inference backend: sglang, vllm, or trtllm |
--connection-mode | grpc | Worker connection mode: grpc or http (vLLM and TensorRT-LLM only support gRPC) |
--data-parallel-size | 1 | Number of worker replicas (one per GPU) |
--worker-base-port | 31000 | Base port for worker processes |
--host | 127.0.0.1 | Router host |
--port | 8080 | Router port |
Option B: Launch gateway only with smg launch¶
Use this when workers are already running or managed by another platform.
For gRPC workers:
smg launch \
--worker-urls grpc://localhost:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--policy round_robin \
--host 0.0.0.0 \
--port 30000
For HTTP workers:
smg launch \
--worker-urls http://localhost:8000 \
--policy round_robin \
--host 0.0.0.0 \
--port 30000
Step 2: Verify Core Endpoints¶
Health:
OpenAI-compatible chat completions:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'
Responses API:
curl http://localhost:30000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"input": "Say hello in one sentence."
}'
Step 3: Choose Your Setup Track¶
Core Deployment¶
Operations and Security¶
Reliability and Data¶
Advanced Features¶
Worker Startup Recipes (Standalone)¶
Use these when workers are not started via smg serve.
PD Disaggregation Workers¶
For prefill-decode disaggregation, start separate prefill and decode workers:
# Prefill worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--grpc-mode \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 8998
# Decode worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50052 \
--grpc-mode \
--disaggregation-mode decode \
--disaggregation-bootstrap-port 8999
Start SMG with bootstrap ports for SGLang coordination:
# Prefill worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 8998
# Decode worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8001 \
--disaggregation-mode decode \
--disaggregation-bootstrap-port 8999
Start SMG with bootstrap ports for SGLang coordination:
vLLM uses NIXL for KV cache transfer between prefill and decode workers:
# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50052 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
Start SMG (no bootstrap ports needed — NIXL handles KV transfer):
See PD Disaggregation for full details including Mooncake backend and scaling.
Send a Request¶
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 50
}'
Expected response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 8,
"total_tokens": 22
}
}
Verify Health¶
# Gateway health
curl http://localhost:30000/health
# Worker status
curl http://localhost:30000/workers
Deploy with Docker¶
For local deployment, run SMG in a container and point it at your worker:
docker pull lightseekorg/smg:latest
docker run -d \
--name smg \
-p 30000:30000 \
-p 29000:29000 \
lightseekorg/smg:latest \
--worker-urls http://host.docker.internal:8000 \
--policy cache_aware \
--prometheus-port 29000
Verify:
Deploy to Kubernetes (Quick Start)¶
Run SMG in-cluster and use service discovery to pick up worker pods automatically.
Start SMG with service discovery:
smg \
--service-discovery \
--selector app=sglang-worker \
--service-discovery-namespace inference \
--service-discovery-port 8000 \
--policy cache_aware
Required RBAC permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: smg-discovery
namespace: inference
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
Verify:
Navigate by Category¶
Core Setup¶
- Multiple Workers — connect local or external worker endpoints
- gRPC Workers — gateway-side tokenization, parsing, and tool handling
- PD Disaggregation — split prefill and decode paths
- Service Discovery — Kubernetes pod-based worker registration
Operations¶
- Monitoring — Prometheus metrics, tracing, and alerts
- Logging — structured logs and aggregation patterns
- TLS — HTTPS gateway configuration
- Control Plane Auth — secure worker/tokenizer/WASM management endpoints
Reliability and Data¶
- Reliability Controls — concurrency limits, retries, and circuit breakers
- Data Connections — history backend setup for Postgres, Redis, and Oracle
- Tokenization and Parsing APIs — tokenize, detokenize, and parser endpoints
Advanced Features¶
- Load Balancing — policy selection and tuning
- Tokenizer Caching — L0/L1 cache setup for gRPC mode
- MCP in Responses API — configure and execute MCP tools through
/v1/responses
Troubleshooting¶
Gateway starts but can't connect to worker
Symptoms: Gateway logs show connection errors.
Solutions:
- Verify the worker is running:
curl http://localhost:8000/health - Check network connectivity between gateway and worker
- If using Docker, ensure proper network configuration (
--network hostor Docker network)
Request times out
Symptoms: Requests hang or return 504 errors.
Solutions:
- Check worker health:
curl http://localhost:30000/workers - Increase timeout:
--request-timeout-secs 120 - Check worker logs for errors
Model not found error
Symptoms: model not found in response.
Solutions:
- The
modelfield in requests should match the model loaded on the worker - Check available models:
curl http://localhost:30000/v1/models