Admin API Reference¶

SMG provides administrative endpoints for managing tokenizers, workers, cache, and cluster operations.

Tokenizer Management¶

Manage tokenizers for text processing and tokenization.

Authentication Required

These endpoints require admin authentication via API key or control plane credentials.

Add Tokenizer¶

POST /v1/tokenizers

Adds a new tokenizer from a local path or HuggingFace model ID.

Request Body:

{
  "name": "llama3-tokenizer",
  "source": "meta-llama/Meta-Llama-3-8B",
  "chat_template_path": "/path/to/template.jinja"
}

Field	Type	Required	Description
`name`	string	Yes	Unique tokenizer identifier
`source`	string	Yes	HuggingFace model ID or local path
`chat_template_path`	string	No	Path to custom Jinja2 chat template

Response: 202 Accepted

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "pending",
  "message": "Tokenizer 'llama3-tokenizer' registration job submitted. Loading from: meta-llama/Meta-Llama-3-8B"
}

List Tokenizers¶

GET /v1/tokenizers

Returns all registered tokenizers.

Response: 200 OK

{
  "tokenizers": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "name": "llama3-tokenizer",
      "source": "meta-llama/Meta-Llama-3-8B",
      "vocab_size": 128256
    }
  ]
}

Get Tokenizer¶

GET /v1/tokenizers/{tokenizer_id}

Returns details for a specific tokenizer.

Response: 200 OK

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "llama3-tokenizer",
  "source": "meta-llama/Meta-Llama-3-8B",
  "vocab_size": 128256
}

Response: 404 Not Found

{
  "error": {
    "message": "Tokenizer 'llama3-tokenizer' not found",
    "type": "tokenizer_not_found"
  }
}

Get Tokenizer Status¶

GET /v1/tokenizers/{tokenizer_id}/status

Returns the loading status of a tokenizer.

Response: 200 OK

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "message": "Tokenizer 'llama3-tokenizer' is loaded and ready",
  "vocab_size": 128256
}

Status	Description
`pending`	Tokenizer loading queued
`processing`	Tokenizer currently loading
`completed`	Tokenizer ready for use
`failed`	Loading failed (see message)

Remove Tokenizer¶

DELETE /v1/tokenizers/{tokenizer_id}

Removes a tokenizer.

Response: 200 OK

{
  "success": true,
  "message": "Tokenizer 'llama3-tokenizer' removed successfully"
}

Worker Management¶

Manage backend inference workers.

Tip

For listing workers and viewing metrics, see Gateway Extensions.

Create Worker¶

POST /workers

Registers a new backend worker.

Request Body:

{
  "url": "http://gpu1:8000",
  "models": [{ "id": "llama3-70b" }],
  "api_key": "worker-secret-key"
}

Field	Type	Required	Description
`url`	string	Yes	Worker base URL
`worker_type`	string	No	`regular`, `prefill`, or `decode` (default: `regular`)
`connection_mode`	string	No	`http` or `grpc` (default: `http`)
`runtime_type`	string	No	`sglang`, `vllm`, `trtllm`, `mlx`, `external`, or `unspecified` (default: `unspecified`, which triggers auto-detection)
`models`	array	No	Model cards served by this worker (empty = wildcard)
`api_key`	string	No	API key for worker authentication
`priority`	integer	No	Routing priority (higher = preferred, default: 50)

Response: 202 Accepted

{
  "status": "accepted",
  "worker_id": "worker-abc123",
  "url": "http://gpu1:8000",
  "location": "/workers/worker-abc123",
  "message": "Worker addition queued for background processing"
}

Update Worker (partial)¶

PATCH /workers/{worker_id}

Partially updates worker configuration. Only the fields you include are changed.

Request Body:

{
  "priority": 75,
  "api_key": "new-api-key"
}

Field	Type	Description
`priority`	integer	New routing priority
`cost`	number	New cost factor
`labels`	object	Updated labels
`api_key`	string	New API key (for key rotation)
`health`	object	Partial health-check overrides (`timeout_secs`, `check_interval_secs`, `success_threshold`, `failure_threshold`, `disable_health_check`)

Response: 202 Accepted

{
  "status": "accepted",
  "worker_id": "worker-abc123",
  "message": "Worker update queued for background processing"
}

Replace Worker (full)¶

PUT /workers/{worker_id}

Re-runs the full worker registration workflow (model discovery and all). The request body must be a complete WorkerSpec whose url matches the existing worker's URL — URL changes are not supported via PUT; use DELETE + POST instead.

Response: 202 Accepted with the same shape as PATCH.

Delete Worker¶

DELETE /workers/{worker_id}

Removes a worker from the pool.

Response: 202 Accepted

{
  "status": "accepted",
  "worker_id": "worker-abc123",
  "message": "Worker removal queued for background processing"
}

Cache Management¶

Manage the routing cache and load information.

Flush Cache¶

POST /flush_cache

Flushes the KV cache on all HTTP workers. gRPC workers are skipped. The response status is 200 OK on full success and 206 Partial Content when some workers fail.

Response: 200 OK

{
  "status": "success",
  "message": "Successfully flushed cache on all 3 HTTP workers",
  "workers_flushed": 3,
  "total_http_workers": 3,
  "total_workers": 3
}

On partial failure, the response additionally includes successful (list of worker URLs) and failed (list of {worker, error} entries), and status becomes "partial_success".

Get Loads¶

GET /get_loads

Returns the current load distribution across workers. The gateway fans out to every registered worker (HTTP and gRPC) and returns whatever each backend reports. The load field is the total number of KV-cache tokens in use across all data-parallel ranks for that worker; -1 indicates the worker failed to respond.

Response: 200 OK

id=__span-24-1>{ "workers": [ { "worker": "http://gpu1:8000", "load": 1234, "details": { "timestamp": "2024-01-15T12:00:00Z", "dp_rank_count": 1, "loads": [ { "dp_rank": 0, "num_running_reqs": 5, "num_waiting_reqs": 2, "num_total_reqs": 7, "num_used_tokens": 1234, "max_total_num_tokens": 16384, "token_usage": 0.075, "gen_throughput": 45.2, "cache_hit_rate": 0.82, "utilization": 0.31, "max_running_requests": 256 } ] } } ] class=p>}

Model Information¶

Query model and server information.

List Models¶

GET /v1/models

Returns available models (proxied to workers).

Response: 200 OK

{
  "object": "list",
  "data": [
    {
      "id": "llama3-70b",
      "object": "model",
      "created": 1700000000,
      "owned_by": "meta"
    }
  ]
}

Get Model Info¶

GET /get_model_info

Returns detailed model information (proxied to workers).

Response: 200 OK

{
  "model_name": "llama3-70b",
  "max_tokens": 8192,
  "vocab_size": 128256
}

Get Server Info¶

GET /get_server_info

Returns server information (proxied to workers).

Response: 200 OK

{
  "version": "0.1.0",
  "backend": "vllm",
  "gpu_count": 8
}

WASM Module Management¶

Manage WebAssembly plugins. Modules are registered from files accessible to the gateway process; the request body contains descriptors with paths, not binary payloads.

Add WASM Module¶

POST /wasm

Registers one or more WASM modules.

Request Body: JSON WasmModuleAddRequest

{
  "modules": [
    {
      "name": "custom-middleware",
      "file_path": "/etc/smg/wasm/custom-middleware.wasm",
      "module_type": "Middleware",
      "attach_points": [
        {"Middleware": "OnRequest"},
        {"Middleware": "OnResponse"}
      ]
    }
  ]
}

The only supported module_type today is Middleware. Valid Middleware attach points are OnRequest, OnResponse, and OnError.

Response: 200 OK on full success, 400 Bad Request if any module failed to register. The response body echoes every requested module with an add_result field indicating success (carrying the assigned UUID) or failure (carrying the error message).

{
  "modules": [
    {
      "name": "custom-middleware",
      "file_path": "/etc/smg/wasm/custom-middleware.wasm",
      "module_type": "Middleware",
      "attach_points": [
        {"Middleware": "OnRequest"},
        {"Middleware": "OnResponse"}
      ],
      "add_result": {
        "Success": "550e8400-e29b-41d4-a716-446655440000"
      }
    }
  ]
}

List WASM Modules¶

GET /wasm

Returns all registered WASM modules together with aggregate execution metrics.

Response: 200 OK

{
  "modules": [
    {
      "module_uuid": "550e8400-e29b-41d4-a716-446655440000",
      "module_meta": {
        "name": "custom-middleware",
        "file_path": "/etc/smg/wasm/custom-middleware.wasm",
        "sha256_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "size_bytes": 65536,
        "created_at": "2024-01-15T12:00:00.000000000Z",
        "last_accessed_at": "2024-01-15T12:05:00.000000000Z",
        "access_count": 42,
        "attach_points": [
          {"Middleware": "OnRequest"}
        ]
      }
    }
  ],
  "metrics": {
    "total_executions": 42,
    "successful_executions": 42,
    "failed_executions": 0,
    "total_execution_time_ms": 125,
    "max_execution_time_ms": 8,
    "average_execution_time_ms": 2.97
  }
}

Remove WASM Module¶

DELETE /wasm/{module_uuid}

Removes a WASM module. The body is a plain text status message, not JSON.

Response: 200 OK

Module removed successfully

On failure returns 400 Bad Request with the error text as the body.

Error Responses¶

All endpoints return errors in a consistent format:

{
  "error": {
    "message": "Detailed error description",
    "type": "error_type"
  }
}

HTTP Status	Error Type	Description
`400`	`bad_request`	Invalid request format or parameters
`401`	`unauthorized`	Missing or invalid authentication
`403`	`forbidden`	Insufficient permissions
`404`	`not_found`	Resource not found
`409`	`conflict`	Resource already exists
`503`	`service_unavailable`	No healthy workers available

Authentication¶

Admin endpoints require authentication via one of:

API Key: Pass via Authorization: Bearer <api-key> header
Control Plane Key: For cluster management operations

Public endpoints (health checks, model info) do not require authentication.