Skip to content

Admin API Reference

SMG provides administrative endpoints for managing tokenizers, workers, cache, and cluster operations.

Related Documentation

For health checks, worker status, and monitoring endpoints, see Gateway Extensions.


Tokenizer Management

Manage tokenizers for text processing and tokenization.

Authentication Required

These endpoints require admin authentication via API key or control plane credentials.

Add Tokenizer

POST /v1/tokenizers

Adds a new tokenizer from a local path or HuggingFace model ID.

Request Body:

{
  "name": "llama3-tokenizer",
  "source": "meta-llama/Meta-Llama-3-8B",
  "chat_template_path": "/path/to/template.jinja"
}

Field Type Required Description
name string Yes Unique tokenizer identifier
source string Yes HuggingFace model ID or local path
chat_template_path string No Path to custom Jinja2 chat template

Response: 202 Accepted

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "pending",
  "message": "Tokenizer 'llama3-tokenizer' registration job submitted. Loading from: meta-llama/Meta-Llama-3-8B"
}


List Tokenizers

GET /v1/tokenizers

Returns all registered tokenizers.

Response: 200 OK

{
  "tokenizers": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "name": "llama3-tokenizer",
      "source": "meta-llama/Meta-Llama-3-8B",
      "vocab_size": 128256
    }
  ]
}


Get Tokenizer

GET /v1/tokenizers/{tokenizer_id}

Returns details for a specific tokenizer.

Response: 200 OK

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "llama3-tokenizer",
  "source": "meta-llama/Meta-Llama-3-8B",
  "vocab_size": 128256
}

Response: 404 Not Found

{
  "error": {
    "message": "Tokenizer 'llama3-tokenizer' not found",
    "type": "tokenizer_not_found"
  }
}


Get Tokenizer Status

GET /v1/tokenizers/{tokenizer_id}/status

Returns the loading status of a tokenizer.

Response: 200 OK

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "message": "Tokenizer 'llama3-tokenizer' is loaded and ready",
  "vocab_size": 128256
}

Status Description
pending Tokenizer loading queued
processing Tokenizer currently loading
completed Tokenizer ready for use
failed Loading failed (see message)

Remove Tokenizer

DELETE /v1/tokenizers/{tokenizer_id}

Removes a tokenizer.

Response: 200 OK

{
  "success": true,
  "message": "Tokenizer 'llama3-tokenizer' removed successfully"
}


Worker Management

Manage backend inference workers.

Tip

For listing workers and viewing metrics, see Gateway Extensions.

Create Worker

POST /workers

Registers a new backend worker.

Request Body:

{
  "url": "http://gpu1:8000",
  "models": [{ "id": "llama3-70b" }],
  "api_key": "worker-secret-key"
}

Field Type Required Description
url string Yes Worker base URL
worker_type string No regular, prefill, or decode (default: regular)
connection_mode string No http or grpc (default: http)
runtime_type string No sglang, vllm, trtllm, mlx, external, or unspecified (default: unspecified, which triggers auto-detection)
models array No Model cards served by this worker (empty = wildcard)
api_key string No API key for worker authentication
priority integer No Routing priority (higher = preferred, default: 50)

Response: 202 Accepted

{
  "status": "accepted",
  "worker_id": "worker-abc123",
  "url": "http://gpu1:8000",
  "location": "/workers/worker-abc123",
  "message": "Worker addition queued for background processing"
}


Update Worker (partial)

PATCH /workers/{worker_id}

Partially updates worker configuration. Only the fields you include are changed.

Request Body:

{
  "priority": 75,
  "api_key": "new-api-key"
}

Field Type Description
priority integer New routing priority
cost number New cost factor
labels object Updated labels
api_key string New API key (for key rotation)
health object Partial health-check overrides (timeout_secs, check_interval_secs, success_threshold, failure_threshold, disable_health_check)

Response: 202 Accepted

{
  "status": "accepted",
  "worker_id": "worker-abc123",
  "message": "Worker update queued for background processing"
}


Replace Worker (full)

PUT /workers/{worker_id}

Re-runs the full worker registration workflow (model discovery and all). The request body must be a complete WorkerSpec whose url matches the existing worker's URL — URL changes are not supported via PUT; use DELETE + POST instead.

Response: 202 Accepted with the same shape as PATCH.


Delete Worker

DELETE /workers/{worker_id}

Removes a worker from the pool.

Response: 202 Accepted

{
  "status": "accepted",
  "worker_id": "worker-abc123",
  "message": "Worker removal queued for background processing"
}


Cache Management

Manage the routing cache and load information.

Flush Cache

POST /flush_cache

Flushes the KV cache on all HTTP workers. gRPC workers are skipped. The response status is 200 OK on full success and 206 Partial Content when some workers fail.

Response: 200 OK

{
  "status": "success",
  "message": "Successfully flushed cache on all 3 HTTP workers",
  "workers_flushed": 3,
  "total_http_workers": 3,
  "total_workers": 3
}

On partial failure, the response additionally includes successful (list of worker URLs) and failed (list of {worker, error} entries), and status becomes "partial_success".


Get Loads

GET /get_loads

Returns the current load distribution across workers. The gateway fans out to every registered worker (HTTP and gRPC) and returns whatever each backend reports. The load field is the total number of KV-cache tokens in use across all data-parallel ranks for that worker; -1 indicates the worker failed to respond.

Response: 200 OK

{
  "workers": [
    {
      "worker": "http://gpu1:8000",
      "load": 1234,
      "details": {
        "timestamp": "2024-01-15T12:00:00Z",
        "dp_rank_count": 1,
        "loads": [
          {
            "dp_rank": 0,
            "num_running_reqs": 5,
            "num_waiting_reqs": 2,
            "num_total_reqs": 7,
            "num_used_tokens": 1234,
            "max_total_num_tokens": 16384,
            "token_usage": 0.075,
            "gen_throughput": 45.2,
            "cache_hit_rate": 0.82,
            "utilization": 0.31,
            "max_running_requests": 256
          }
        ]
      }
    }
  ]
}


Model Information

Query model and server information.

List Models

GET /v1/models

Returns available models (proxied to workers).

Response: 200 OK

{
  "object": "list",
  "data": [
    {
      "id": "llama3-70b",
      "object": "model",
      "created": 1700000000,
      "owned_by": "meta"
    }
  ]
}


Get Model Info

GET /get_model_info

Returns detailed model information (proxied to workers).

Response: 200 OK

{
  "model_name": "llama3-70b",
  "max_tokens": 8192,
  "vocab_size": 128256
}


Get Server Info

GET /get_server_info

Returns server information (proxied to workers).

Response: 200 OK

{
  "version": "0.1.0",
  "backend": "vllm",
  "gpu_count": 8
}


WASM Module Management

Manage WebAssembly plugins. Modules are registered from files accessible to the gateway process; the request body contains descriptors with paths, not binary payloads.

Add WASM Module

POST /wasm

Registers one or more WASM modules.

Request Body: JSON WasmModuleAddRequest

{
  "modules": [
    {
      "name": "custom-middleware",
      "file_path": "/etc/smg/wasm/custom-middleware.wasm",
      "module_type": "Middleware",
      "attach_points": [
        {"Middleware": "OnRequest"},
        {"Middleware": "OnResponse"}
      ]
    }
  ]
}

The only supported module_type today is Middleware. Valid Middleware attach points are OnRequest, OnResponse, and OnError.

Response: 200 OK on full success, 400 Bad Request if any module failed to register. The response body echoes every requested module with an add_result field indicating success (carrying the assigned UUID) or failure (carrying the error message).

{
  "modules": [
    {
      "name": "custom-middleware",
      "file_path": "/etc/smg/wasm/custom-middleware.wasm",
      "module_type": "Middleware",
      "attach_points": [
        {"Middleware": "OnRequest"},
        {"Middleware": "OnResponse"}
      ],
      "add_result": {
        "Success": "550e8400-e29b-41d4-a716-446655440000"
      }
    }
  ]
}

List WASM Modules

GET /wasm

Returns all registered WASM modules together with aggregate execution metrics.

Response: 200 OK

{
  "modules": [
    {
      "module_uuid": "550e8400-e29b-41d4-a716-446655440000",
      "module_meta": {
        "name": "custom-middleware",
        "file_path": "/etc/smg/wasm/custom-middleware.wasm",
        "sha256_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "size_bytes": 65536,
        "created_at": "2024-01-15T12:00:00.000000000Z",
        "last_accessed_at": "2024-01-15T12:05:00.000000000Z",
        "access_count": 42,
        "attach_points": [
          {"Middleware": "OnRequest"}
        ]
      }
    }
  ],
  "metrics": {
    "total_executions": 42,
    "successful_executions": 42,
    "failed_executions": 0,
    "total_execution_time_ms": 125,
    "max_execution_time_ms": 8,
    "average_execution_time_ms": 2.97
  }
}


Remove WASM Module

DELETE /wasm/{module_uuid}

Removes a WASM module. The body is a plain text status message, not JSON.

Response: 200 OK

Module removed successfully

On failure returns 400 Bad Request with the error text as the body.


Error Responses

All endpoints return errors in a consistent format:

{
  "error": {
    "message": "Detailed error description",
    "type": "error_type"
  }
}
HTTP Status Error Type Description
400 bad_request Invalid request format or parameters
401 unauthorized Missing or invalid authentication
403 forbidden Insufficient permissions
404 not_found Resource not found
409 conflict Resource already exists
503 service_unavailable No healthy workers available

Authentication

Admin endpoints require authentication via one of:

  1. API Key: Pass via Authorization: Bearer <api-key> header
  2. Control Plane Key: For cluster management operations

Public endpoints (health checks, model info) do not require authentication.