Admin API Reference¶
SMG provides administrative endpoints for managing tokenizers, workers, cache, and cluster operations.
Related Documentation
For health checks, worker status, and monitoring endpoints, see Gateway Extensions.
Tokenizer Management¶
Manage tokenizers for text processing and tokenization.
Authentication Required
These endpoints require admin authentication via API key or control plane credentials.
Add Tokenizer¶
Adds a new tokenizer from a local path or HuggingFace model ID.
Request Body:
{
"name": "llama3-tokenizer",
"source": "meta-llama/Meta-Llama-3-8B",
"chat_template_path": "/path/to/template.jinja"
}
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique tokenizer identifier |
source | string | Yes | HuggingFace model ID or local path |
chat_template_path | string | No | Path to custom Jinja2 chat template |
Response: 202 Accepted
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"message": "Tokenizer 'llama3-tokenizer' registration job submitted. Loading from: meta-llama/Meta-Llama-3-8B"
}
List Tokenizers¶
Returns all registered tokenizers.
Response: 200 OK
{
"tokenizers": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "llama3-tokenizer",
"source": "meta-llama/Meta-Llama-3-8B",
"vocab_size": 128256
}
]
}
Get Tokenizer¶
Returns details for a specific tokenizer.
Response: 200 OK
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "llama3-tokenizer",
"source": "meta-llama/Meta-Llama-3-8B",
"vocab_size": 128256
}
Response: 404 Not Found
Get Tokenizer Status¶
Returns the loading status of a tokenizer.
Response: 200 OK
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"message": "Tokenizer 'llama3-tokenizer' is loaded and ready",
"vocab_size": 128256
}
| Status | Description |
|---|---|
pending | Tokenizer loading queued |
processing | Tokenizer currently loading |
completed | Tokenizer ready for use |
failed | Loading failed (see message) |
Remove Tokenizer¶
Removes a tokenizer.
Response: 200 OK
Worker Management¶
Manage backend inference workers.
Tip
For listing workers and viewing metrics, see Gateway Extensions.
Create Worker¶
Registers a new backend worker.
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Worker base URL |
worker_type | string | No | regular, prefill, or decode (default: regular) |
connection_mode | string | No | http or grpc (default: http) |
runtime_type | string | No | sglang, vllm, trtllm, mlx, external, or unspecified (default: unspecified, which triggers auto-detection) |
models | array | No | Model cards served by this worker (empty = wildcard) |
api_key | string | No | API key for worker authentication |
priority | integer | No | Routing priority (higher = preferred, default: 50) |
Response: 202 Accepted
{
"status": "accepted",
"worker_id": "worker-abc123",
"url": "http://gpu1:8000",
"location": "/workers/worker-abc123",
"message": "Worker addition queued for background processing"
}
Update Worker (partial)¶
Partially updates worker configuration. Only the fields you include are changed.
Request Body:
| Field | Type | Description |
|---|---|---|
priority | integer | New routing priority |
cost | number | New cost factor |
labels | object | Updated labels |
api_key | string | New API key (for key rotation) |
health | object | Partial health-check overrides (timeout_secs, check_interval_secs, success_threshold, failure_threshold, disable_health_check) |
Response: 202 Accepted
{
"status": "accepted",
"worker_id": "worker-abc123",
"message": "Worker update queued for background processing"
}
Replace Worker (full)¶
Re-runs the full worker registration workflow (model discovery and all). The request body must be a complete WorkerSpec whose url matches the existing worker's URL — URL changes are not supported via PUT; use DELETE + POST instead.
Response: 202 Accepted with the same shape as PATCH.
Delete Worker¶
Removes a worker from the pool.
Response: 202 Accepted
{
"status": "accepted",
"worker_id": "worker-abc123",
"message": "Worker removal queued for background processing"
}
Cache Management¶
Manage the routing cache and load information.
Flush Cache¶
Flushes the KV cache on all HTTP workers. gRPC workers are skipped. The response status is 200 OK on full success and 206 Partial Content when some workers fail.
Response: 200 OK
{
"status": "success",
"message": "Successfully flushed cache on all 3 HTTP workers",
"workers_flushed": 3,
"total_http_workers": 3,
"total_workers": 3
}
On partial failure, the response additionally includes successful (list of worker URLs) and failed (list of {worker, error} entries), and status becomes "partial_success".
Get Loads¶
Returns the current load distribution across workers. The gateway fans out to every registered worker (HTTP and gRPC) and returns whatever each backend reports. The load field is the total number of KV-cache tokens in use across all data-parallel ranks for that worker; -1 indicates the worker failed to respond.
Response: 200 OK
{
"workers": [
{
"worker": "http://gpu1:8000",
"load": 1234,
"details": {
"timestamp": "2024-01-15T12:00:00Z",
"dp_rank_count": 1,
"loads": [
{
"dp_rank": 0,
"num_running_reqs": 5,
"num_waiting_reqs": 2,
"num_total_reqs": 7,
"num_used_tokens": 1234,
"max_total_num_tokens": 16384,
"token_usage": 0.075,
"gen_throughput": 45.2,
"cache_hit_rate": 0.82,
"utilization": 0.31,
"max_running_requests": 256
}
]
}
}
]
}
Model Information¶
Query model and server information.
List Models¶
Returns available models (proxied to workers).
Response: 200 OK
{
"object": "list",
"data": [
{
"id": "llama3-70b",
"object": "model",
"created": 1700000000,
"owned_by": "meta"
}
]
}
Get Model Info¶
Returns detailed model information (proxied to workers).
Response: 200 OK
Get Server Info¶
Returns server information (proxied to workers).
Response: 200 OK
WASM Module Management¶
Manage WebAssembly plugins. Modules are registered from files accessible to the gateway process; the request body contains descriptors with paths, not binary payloads.
Add WASM Module¶
Registers one or more WASM modules.
Request Body: JSON WasmModuleAddRequest
{
"modules": [
{
"name": "custom-middleware",
"file_path": "/etc/smg/wasm/custom-middleware.wasm",
"module_type": "Middleware",
"attach_points": [
{"Middleware": "OnRequest"},
{"Middleware": "OnResponse"}
]
}
]
}
The only supported module_type today is Middleware. Valid Middleware attach points are OnRequest, OnResponse, and OnError.
Response: 200 OK on full success, 400 Bad Request if any module failed to register. The response body echoes every requested module with an add_result field indicating success (carrying the assigned UUID) or failure (carrying the error message).
{
"modules": [
{
"name": "custom-middleware",
"file_path": "/etc/smg/wasm/custom-middleware.wasm",
"module_type": "Middleware",
"attach_points": [
{"Middleware": "OnRequest"},
{"Middleware": "OnResponse"}
],
"add_result": {
"Success": "550e8400-e29b-41d4-a716-446655440000"
}
}
]
}
List WASM Modules¶
Returns all registered WASM modules together with aggregate execution metrics.
Response: 200 OK
{
"modules": [
{
"module_uuid": "550e8400-e29b-41d4-a716-446655440000",
"module_meta": {
"name": "custom-middleware",
"file_path": "/etc/smg/wasm/custom-middleware.wasm",
"sha256_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"size_bytes": 65536,
"created_at": "2024-01-15T12:00:00.000000000Z",
"last_accessed_at": "2024-01-15T12:05:00.000000000Z",
"access_count": 42,
"attach_points": [
{"Middleware": "OnRequest"}
]
}
}
],
"metrics": {
"total_executions": 42,
"successful_executions": 42,
"failed_executions": 0,
"total_execution_time_ms": 125,
"max_execution_time_ms": 8,
"average_execution_time_ms": 2.97
}
}
Remove WASM Module¶
Removes a WASM module. The body is a plain text status message, not JSON.
Response: 200 OK
On failure returns 400 Bad Request with the error text as the body.
Error Responses¶
All endpoints return errors in a consistent format:
| HTTP Status | Error Type | Description |
|---|---|---|
400 | bad_request | Invalid request format or parameters |
401 | unauthorized | Missing or invalid authentication |
403 | forbidden | Insufficient permissions |
404 | not_found | Resource not found |
409 | conflict | Resource already exists |
503 | service_unavailable | No healthy workers available |
Authentication¶
Admin endpoints require authentication via one of:
- API Key: Pass via
Authorization: Bearer <api-key>header - Control Plane Key: For cluster management operations
Public endpoints (health checks, model info) do not require authentication.