Multiple Workers¶
SMG can route across many workers simultaneously — local inference servers, remote cloud APIs, or a mix of both. This guide covers how to add workers and balance traffic across them.
Before you begin¶
- Completed the Getting Started guide
Supported Worker Types¶
SMG connects to workers over HTTP or gRPC, and supports both local inference servers and remote API providers:
| Worker Type | Protocol | Example URL |
|---|---|---|
| SGLang | HTTP / gRPC | http://worker:8000 or grpc://worker:50051 |
| vLLM | gRPC | grpc://worker:50051 |
| TensorRT-LLM | gRPC | grpc://worker:50051 |
| OpenAI (GPT) | HTTP | https://api.openai.com |
| Anthropic (Claude) | HTTP | https://api.anthropic.com |
| xAI (Grok) | HTTP | https://api.x.ai |
| Google (Gemini) | HTTP | https://generativelanguage.googleapis.com |
| Any OpenAI-compatible API | HTTP | https://your-provider.com |
Static Workers via CLI¶
Pass multiple URLs to --worker-urls:
smg \
--worker-urls http://worker1:8000 http://worker2:8000 http://worker3:8000 \
--policy round_robin \
--host 0.0.0.0 \
--port 30000
For gRPC workers, use the grpc:// scheme and provide --model-path so the gateway can load the tokenizer:
smg \
--worker-urls grpc://worker1:50051 grpc://worker2:50052 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--policy round_robin
See gRPC Workers for details on what gRPC mode enables.
Cloud API Workers¶
Route to cloud providers by setting --backend openai and passing the provider URL. API keys are read from environment variables. SMG auto-detects the provider (OpenAI, Anthropic, xAI, Gemini) from the model name and applies the correct API transformations.
Dynamic Workers with IGW Mode¶
In Inference Gateway (IGW) mode, SMG starts with no workers and you add or remove them at runtime via the REST API:
Add a worker¶
curl -X POST http://localhost:30000/workers \
-H "Content-Type: application/json" \
-d '{"url": "http://worker1:8000"}'
Response:
{
"status": "accepted",
"worker_id": "a1b2c3d4",
"url": "http://worker1:8000",
"location": "/workers/a1b2c3d4",
"message": "Worker addition queued for background processing"
}
List workers¶
Remove a worker¶
Worker configuration options¶
The POST /workers endpoint accepts additional fields:
{
"url": "http://worker:8000",
"api_key": "optional-key",
"model_id": "llama-70b",
"runtime": "sglang",
"worker_type": "regular",
"priority": 50,
"cost": 1.0,
"disable_health_check": false,
"labels": {"region": "us-east"}
}
| Field | Default | Description |
|---|---|---|
url | (required) | Worker URL (http://, grpc://, or https:// for cloud) |
api_key | — | API key for authenticated workers |
model_id | auto-discovered | Model identifier |
runtime | sglang | Runtime: sglang, vllm, trtllm, or external |
worker_type | regular | Type: regular, prefill, or decode |
priority | 50 | Routing priority (0–100, higher = preferred) |
cost | 1.0 | Cost multiplier for cost-aware routing |
disable_health_check | false | Skip health checking for this worker |
labels | {} | Arbitrary metadata |
Verify¶
# List connected workers
curl http://localhost:30000/workers
# Check health
curl http://localhost:30000/health
# Send a request
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Next Steps¶
- Monitoring — Track request rates, latency, and worker health
- gRPC Workers — Enable tokenization, chat templates, and tool parsing at the gateway
- PD Disaggregation — Separate prefill and decode onto specialized workers