Skip to content

Configuration Reference

Complete configuration reference for tuning SMG behavior.


Configuration Methods

SMG can be configured through:

  1. Command-line arguments (highest priority)
  2. Environment variables
  3. Default values (lowest priority)

Worker Configuration

Host

Network interface to bind to.

Option --host
Environment -
Default 0.0.0.0
Value Description
127.0.0.1 Localhost only
0.0.0.0 All IPv4 interfaces
:: All IPv6 interfaces
::1 IPv6 localhost

Port

Port for the main API server.

Option --port
Environment -
Default 30000

Worker URLs

List of worker URLs to route requests to.

Option --worker-urls
Environment -
Default Empty
Format Space-separated URLs

Examples:

--worker-urls http://worker1:8000 http://worker2:8000
--worker-urls http://[::1]:8000 http://192.168.1.1:8000  # IPv6 and IPv4
--worker-urls grpc://worker1:50051  # gRPC mode


Routing Policy Configuration

Load Balancing Policy

Controls how requests are distributed across workers.

Option --policy
Environment -
Default cache_aware
Values random, round_robin, cache_aware, power_of_two, prefix_hash, consistent_hashing, bucket, manual

Policy Comparison:

Policy Use Case KV Cache Load Balance
random Simple deployments Poor Fair
round_robin Uniform workloads Poor Good
power_of_two Variable workloads Poor Excellent
cache_aware LLM inference Excellent Good
prefix_hash Consistent routing by prefix Good Good
consistent_hashing Session affinity via hash ring Good Good
bucket Load balancing with bucket boundaries Poor Excellent
manual Sticky sessions with LRU eviction Good Manual

Recommendation: Use cache_aware for LLM workloads to maximize KV cache hit rates.

Cache-Aware Policy Options

Option Description Default
--cache-threshold Cache threshold (0.0-1.0) for cache-aware routing 0.3
--balance-abs-threshold Absolute threshold for load balancing trigger 64
--balance-rel-threshold Relative threshold for load balancing trigger 1.5
--eviction-interval Interval in seconds between cache eviction operations 120
--max-tree-size Maximum size of the approximation tree 67108864
--block-size KV cache block size for event-driven cache-aware routing 16

Prefix Hash Policy Options

Option Description Default
--prefix-token-count Number of prefix tokens to use for hashing 256
--prefix-hash-load-factor Load factor threshold for rebalancing 1.25

Manual Policy Options

Option Description Default
--max-idle-secs Maximum idle time before eviction 14400 (4 hours)
--assignment-mode Mode for new routing key assignment random

Assignment Modes: - random - Assign to a random worker - min_load - Assign to worker with fewest active requests - min_group - Assign to worker with fewest routing keys

Advanced Routing Options

Option Description Default
--dp-aware Enable data parallelism aware scheduling false
--enable-igw Enable IGW (Inference Gateway) mode for multi-model support false
--dp-minimum-tokens-scheduler Enable minimum tokens scheduler for data parallel group false
--load-monitor-interval Interval in seconds between load monitor checks for PowerOfTwo routing 10

PD Disaggregation Configuration

Prefill-Decode disaggregated mode separates prefill and decode operations across different workers.

Enable PD Mode

Option --pd-disaggregation
Environment -
Default false

Prefill Servers

Option --prefill
Format URL [BOOTSTRAP_PORT]
Multiple Yes (specify multiple times)

Examples:

--prefill http://prefill1:30001 9001 \
--prefill http://prefill2:30002 9002 \
--prefill http://prefill3:30003 none  # No bootstrap port

Decode Servers

Option --decode
Format URL
Multiple Yes (specify multiple times)

Example:

--decode http://decode1:30003 \
--decode http://decode2:30004

PD-Specific Policies

Option Description Default
--prefill-policy Specific policy for prefill nodes Uses main --policy
--decode-policy Specific policy for decode nodes Uses main --policy

Worker Startup Configuration

Option Description Default
--worker-startup-timeout-secs Timeout for worker startup and registration 1800 (30 min)
--worker-startup-check-interval Interval between worker startup checks 30

Service Discovery (Kubernetes)

Enable Service Discovery

Option --service-discovery
Environment -
Default false

Note: Enabling service discovery automatically enables IGW mode.

Label Selector

Option --selector
Format key=value (space-separated for multiple)

Example:

--selector app=sglang-worker tier=inference

Namespace

Option --service-discovery-namespace
Environment -
Default All namespaces

Worker Port

Option --service-discovery-port
Environment -
Default 80

PD Service Discovery Selectors

Option Description
--prefill-selector Label selector for prefill server pods
--decode-selector Label selector for decode server pods

HA Mesh Router Discovery

Option Description
--router-selector Label selector for router pod discovery in HA mesh mode (format: key=value)

Per-Worker Model ID Override

Option Description
--model-id-from Override each worker's model_id from pod metadata. Accepted values: namespace, label:<key>, or annotation:<key>.

Tokenizer Configuration

Model Path

Option --model-path
Environment -
Default None
Description HuggingFace model ID or local path for loading tokenizer

Tokenizer Path

Option --tokenizer-path
Environment -
Default None
Description Explicit tokenizer path (overrides model_path tokenizer)

Chat Template

Option --chat-template
Environment -
Default None
Description Path to chat template file

Disable Tokenizer Autoload

Option --disable-tokenizer-autoload
Environment -
Default false
Description Disable automatic tokenizer loading at startup and during worker registration. Useful when tokenizers are loaded on-demand via the API.

Tokenizer Cache (L0 - Exact Match)

Option Description Default
--tokenizer-cache-enable-l0 Enable L0 exact match cache false
--tokenizer-cache-l0-max-entries Maximum entries in L0 cache 10000

Tokenizer Cache (L1 - Prefix Matching)

Option Description Default
--tokenizer-cache-enable-l1 Enable L1 prefix matching cache false
--tokenizer-cache-l1-max-memory Maximum memory for L1 cache (bytes) 52428800 (50MB)

Parser Configuration

Reasoning Parser

Option --reasoning-parser
Environment -
Default None
Values deepseek-r1, qwen3, etc.
Description Parser for reasoning models with thinking tokens

Tool Call Parser

Option --tool-call-parser
Environment -
Default None
Values json, qwen, etc.
Description Parser for tool-call/function-calling interactions

MCP Configuration

MCP Config Path

Option --mcp-config-path
Environment -
Default None
Description Path to MCP (Model Context Protocol) server configuration file

Backend Configuration

Backend Runtime

Option --backend
Environment -
Default None (auto-detected)
Values sglang, vllm, trtllm, openai, anthropic, gemini

History Backend

Option --history-backend
Environment -
Default memory
Values memory, none, oracle, postgres, redis

Storage Configuration

Oracle Database

Option Environment Description
--oracle-wallet-path ATP_WALLET_PATH Path to Oracle ATP wallet directory
--oracle-tns-alias ATP_TNS_ALIAS Oracle TNS alias from tnsnames.ora
--oracle-dsn ATP_DSN Oracle connection descriptor/DSN
--oracle-user ATP_USER Oracle database username
--oracle-password ATP_PASSWORD Oracle database password
--oracle-external-auth ATP_EXTERNAL_AUTH Enable Oracle external authentication (default: false)
--oracle-pool-min ATP_POOL_MIN Minimum connection pool size (default: 1)
--oracle-pool-max ATP_POOL_MAX Maximum connection pool size (default: 16)
--oracle-pool-timeout-secs ATP_POOL_TIMEOUT_SECS Pool timeout in seconds (default: 30)

PostgreSQL Database

Option Environment Description Default
--postgres-db-url POSTGRES_DB_URL PostgreSQL connection URL -
--postgres-pool-max-size POSTGRES_POOL_MAX Maximum pool size 16

Redis Database

Option Environment Description Default
--redis-url REDIS_URL Redis connection URL -
--redis-pool-max-size REDIS_POOL_MAX Maximum pool size 16
--redis-retention-days REDIS_RETENTION_DAYS Data retention (-1 for persistent) 30

WASM Configuration

Enable WebAssembly

Option --enable-wasm
Environment -
Default false
Description Enable WebAssembly support

Storage Hook WASM Component

Option --storage-hook-wasm-path
Environment -
Default None
Description Path to a WASM component implementing storage hooks. When set, wraps all storage backends with hook-based interceptors.

Schema Config File

Option --schema-config
Environment -
Default None
Description Path to a YAML schema config file for storage table/column remapping.

WebRTC Configuration

Option Description Default
--webrtc-bind-addr Bind address for WebRTC UDP sockets (client-facing ICE candidate IP). Set to 127.0.0.1 for local development on the same machine. 0.0.0.0 (auto-detect via routing table)
--webrtc-stun-server STUN server for ICE candidate gathering (host:port). Set to your own STUN server for enterprise deployments that restrict outbound traffic to external STUN servers. stun.l.google.com:19302

Mesh Server Configuration

High-availability mesh networking for multi-router coordination.

Option Description Default
--enable-mesh Enable mesh server for HA multi-router coordination. Requires at least two SMG instances. false
--mesh-server-name Name for this mesh node. If not set, a random name is generated (e.g., Mesh_a1b2). Auto-generated
--mesh-host Bind address for the mesh server. 0.0.0.0
--mesh-advertise-host Routable address advertised to other mesh peers. Required when --mesh-host is an unspecified bind address such as 0.0.0.0. --mesh-host
--mesh-port Port for the mesh server. 39527
--mesh-peer-urls Peer mesh node addresses to join (format: host:port). Used for initial cluster formation. (none)

Example:

smg \
  --enable-mesh \
  --mesh-server-name router-1 \
  --mesh-advertise-host 192.168.1.10 \
  --mesh-port 39527 \
  --mesh-peer-urls 192.168.1.10:39527


Request Handling Configuration

Request Timeout

Option --request-timeout-secs
Environment -
Default 1800 (30 minutes)
Description Maximum time for request processing

Shutdown Grace Period

Option --shutdown-grace-period-secs
Environment -
Default 180 (3 minutes)
Description Time to wait for in-flight requests during shutdown

Maximum Payload Size

Option --max-payload-size
Environment -
Default 536870912 (512MB)
Description Maximum request payload size in bytes

CORS Configuration

Option --cors-allowed-origins
Environment -
Default Empty
Format Space-separated URLs

Example:

--cors-allowed-origins http://localhost:3000 https://example.com

Request ID Headers

Option --request-id-headers
Environment -
Default None (uses common defaults)
Description Custom HTTP headers to check for request IDs

Example:

--request-id-headers x-request-id x-trace-id x-correlation-id

Storage Context Headers

Option --storage-context-headers
Environment -
Default Empty
Format Space-separated header=context_key entries
Description Maps request headers into storage hook request context

Example:

--storage-context-headers x-tenant-id=tenant_id x-user-id=user_id

This lets storage hooks read values such as tenant_id and user_id from the request context without hard-coding specific headers in the gateway.

Only map headers that are injected or sanitized by a trusted upstream. Client-supplied headers can otherwise spoof storage hook request context values.


Rate Limiting Configuration

Concurrent Request Limit

Option --max-concurrent-requests
Environment -
Default -1 (unlimited)
Range -1 or 1+

Sizing Guide:

max_concurrent_requests = num_workers * requests_per_worker_capacity
Worker GPU Memory Suggested per Worker
16GB 4-8
40GB 8-16
80GB 16-32

Queue Configuration

Option Description Default
--queue-size Maximum requests waiting when rate limit reached 100
--queue-timeout-secs Maximum time a request can wait in queue 60

Token Bucket Rate Limiting

Option --rate-limit-tokens-per-second
Environment -
Default Same as max-concurrent-requests
Description Token bucket refill rate

Retry Configuration

Retry Options

Option Description Default
--retry-max-retries Maximum retry attempts 5
--retry-initial-backoff-ms Initial backoff delay (ms) 50
--retry-max-backoff-ms Maximum backoff delay (ms) 30000
--retry-backoff-multiplier Exponential backoff multiplier 1.5
--retry-jitter-factor Jitter factor (0.0-1.0) 0.2
--disable-retries Disable automatic retries false

Backoff Formula:

delay = min(initial_backoff * multiplier^attempt, max_backoff) * (1 + random(0, jitter_factor))


Circuit Breaker Configuration

Option Description Default
--cb-failure-threshold Failures before circuit opens 10
--cb-success-threshold Successes needed to close in half-open state 3
--cb-timeout-duration-secs Time before attempting recovery 60
--cb-window-duration-secs Sliding window for tracking failures 120
--disable-circuit-breaker Disable circuit breaker false

Circuit Breaker States: - Closed: Normal operation, tracking failures - Open: All requests fail fast, circuit tripped - Half-Open: Testing if service recovered


Health Check Configuration

Option Description Default
--health-failure-threshold Failures before marking unhealthy 3
--health-success-threshold Successes before marking healthy 2
--health-check-timeout-secs Timeout for health check requests 5
--health-check-interval-secs Interval between health checks 60
--health-check-endpoint Health check endpoint path /health
--disable-health-check Disable all health checks false
--remove-unhealthy-workers Remove workers from the registry when marked unhealthy by health checks. Useful for ephemeral worker pools where failed workers should be deregistered. false

Prometheus Metrics Configuration

Metrics Server

Option Description Default
--prometheus-port Port for Prometheus metrics endpoint 29000
--prometheus-host Host for Prometheus metrics server 0.0.0.0
--prometheus-duration-buckets Custom histogram buckets Default buckets

Example:

--prometheus-duration-buckets 0.001 0.005 0.01 0.025 0.05 0.1 0.25 0.5 1.0 2.5 5.0 10.0


OpenTelemetry Configuration

Enable Tracing

Option --enable-trace
Environment -
Default false

OTLP Endpoint

Option --otlp-traces-endpoint
Environment -
Default localhost:4317
Format host:port

Example:

smg --enable-trace --otlp-traces-endpoint jaeger:4317


TLS/mTLS Security Configuration

Server TLS

For HTTPS on the gateway:

Option Description
--tls-cert-path Path to server certificate (PEM format)
--tls-key-path Path to server private key (PEM format)

Client mTLS

For secure communication to workers (Python bindings):

Option Description
--client-cert-path Path to client certificate
--client-key-path Path to client private key
--ca-cert-paths Path(s) to CA certificate(s)

Control Plane Authentication

API Key (Worker Authorization)

Option --api-key
Environment -
Default None
Description API key for worker authorization (useful with dp-aware scheduling)

Control Plane API Keys

Option --control-plane-api-keys
Environment CONTROL_PLANE_API_KEYS
Format id:name:role:key
Multiple Yes

Example:

--control-plane-api-keys 'key1:Admin:admin:secret123' 'key2:ReadOnly:user:secret456'

JWT/OIDC Authentication

Option Environment Description
--jwt-issuer JWT_ISSUER OIDC issuer URL
--jwt-audience JWT_AUDIENCE Expected audience claim
--jwt-jwks-uri JWT_JWKS_URI Explicit JWKS URI (auto-discovered if not set)
--jwt-role-claim - JWT claim containing role (default: roles)
--jwt-role-mapping - Role mapping from IDP to gateway role

JWT Role Mapping Example:

--jwt-role-mapping 'Gateway.Admin=admin' 'Gateway.User=user'

Audit Logging

Option --disable-audit-logging
Environment -
Default false (audit logging enabled)

Logging Configuration

Log Level

Option --log-level
Environment RUST_LOG
Default info
Values debug, info, warn, error

Per-Module Logging:

RUST_LOG=smg=debug,hyper=warn smg ...

Log Directory

Option --log-dir
Environment -
Default None (console only)
Description Directory to store log files

JSON Logs

Option --log-json
Environment -
Default false
Description Output logs as JSON (structured). Defaults to human-readable text logs.

Configuration Examples

Minimal Configuration

smg --worker-urls http://localhost:8000

High-Throughput Configuration

smg \
  --worker-urls http://w1:8000 http://w2:8000 http://w3:8000 http://w4:8000 \
  --policy cache_aware \
  --max-concurrent-requests 200 \
  --queue-size 400 \
  --queue-timeout-secs 60 \
  --retry-max-retries 3

Low-Latency Configuration

smg \
  --worker-urls http://w1:8000 http://w2:8000 \
  --policy power_of_two \
  --max-concurrent-requests 50 \
  --queue-size 25 \
  --queue-timeout-secs 5 \
  --health-check-interval-secs 5 \
  --request-timeout-secs 30

PD Disaggregated Mode

smg \
  --pd-disaggregation \
  --prefill http://prefill1:30001 9001 \
  --prefill http://prefill2:30002 9002 \
  --decode http://decode1:30003 \
  --decode http://decode2:30004 \
  --prefill-policy cache_aware \
  --decode-policy round_robin

Kubernetes Service Discovery

smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --service-discovery-port 8000 \
  --policy cache_aware

High-Availability Mesh

# Router 1
smg \
  --enable-mesh \
  --mesh-server-name router-1 \
  --mesh-advertise-host 192.168.1.10 \
  --mesh-port 39527 \
  --mesh-peer-urls 192.168.1.11:39527 \
  --worker-urls http://worker1:8000

# Router 2
smg \
  --enable-mesh \
  --mesh-server-name router-2 \
  --mesh-advertise-host 192.168.1.11 \
  --mesh-port 39527 \
  --mesh-peer-urls 192.168.1.10:39527 \
  --worker-urls http://worker2:8000

Secure Production Configuration

smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --policy cache_aware \
  --max-concurrent-requests 100 \
  --tls-cert-path /etc/certs/server.crt \
  --tls-key-path /etc/certs/server.key \
  --jwt-issuer https://login.microsoftonline.com/tenant/v2.0 \
  --jwt-audience api://smg-gateway \
  --jwt-role-mapping 'Gateway.Admin=admin' 'Gateway.User=user' \
  --enable-trace \
  --otlp-traces-endpoint jaeger:4317 \
  --host 0.0.0.0 \
  --port 443

With Tokenizer and Parsers

smg \
  --worker-urls http://localhost:8000 \
  --model-path meta-llama/Llama-3-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 50000 \
  --reasoning-parser deepseek-r1 \
  --tool-call-parser json

With Database Backend

# PostgreSQL
smg \
  --worker-urls http://localhost:8000 \
  --history-backend postgres \
  --postgres-db-url "postgres://user:pass@localhost:5432/smg" \
  --postgres-pool-max-size 32

# Redis
smg \
  --worker-urls http://localhost:8000 \
  --history-backend redis \
  --redis-url "redis://localhost:6379" \
  --redis-pool-max-size 32 \
  --redis-retention-days 7

Environment Variable Reference

Environment Variable CLI Option Description
RUST_LOG --log-level Log level
ATP_WALLET_PATH --oracle-wallet-path Oracle wallet path
ATP_TNS_ALIAS --oracle-tns-alias Oracle TNS alias
ATP_DSN --oracle-dsn Oracle DSN
ATP_USER --oracle-user Oracle username
ATP_PASSWORD --oracle-password Oracle password
ATP_EXTERNAL_AUTH --oracle-external-auth Enable Oracle external authentication
ATP_POOL_MIN --oracle-pool-min Oracle min pool size
ATP_POOL_MAX --oracle-pool-max Oracle max pool size
ATP_POOL_TIMEOUT_SECS --oracle-pool-timeout-secs Oracle pool timeout
POSTGRES_DB_URL --postgres-db-url PostgreSQL URL
POSTGRES_POOL_MAX --postgres-pool-max-size PostgreSQL max pool
REDIS_URL --redis-url Redis URL
REDIS_POOL_MAX --redis-pool-max-size Redis max pool
REDIS_RETENTION_DAYS --redis-retention-days Redis retention
JWT_ISSUER --jwt-issuer JWT issuer URL
JWT_AUDIENCE --jwt-audience JWT audience
JWT_JWKS_URI --jwt-jwks-uri JWKS URI
CONTROL_PLANE_API_KEYS --control-plane-api-keys Control plane API keys