Architecture Overview¶

SMG is a high-performance inference gateway that sits between your applications and LLM workers. It provides unified routing, enterprise features, and full observability across heterogeneous model deployments.

System Architecture¶

Registries & State¶

Registries hold the configuration and state needed for request processing.

Registry	Purpose	Used By
Model Registry	Maps model names to backends and capabilities	Router Manager
LB Policy Registry	Load balancing configurations per model	All routing paths
Tokenizer Registry	Tokenizers for gateway-side processing	gRPC path
Chat History	Multi-turn conversation context	Responses API
WASM Plugins	Custom request/response transformations	Middleware

API Endpoints¶

SMG exposes three categories of endpoints:

Inference APIs¶

Endpoint	Description
`POST /v1/chat/completions`	OpenAI-compatible chat completions
`POST /v1/completions`	Text completions
`POST /v1/responses`	Agentic workflows with tool execution
`POST /v1/embeddings`	Embedding generation
`POST /v1/rerank`	Reranking API
`POST /messages`	Anthropic Messages API

Utility APIs¶

Endpoint	Description
`POST /tokenize`	Tokenize text using model's tokenizer
`POST /detokenize`	Convert token IDs back to text
`POST /v1/parser/tool`	Parse tool calls from text
`POST /v1/parser/reasoning`	Parse reasoning chains

Admin APIs¶

Endpoint	Description
`GET/POST /workers`	Worker management
`GET/POST /tokenizers`	Tokenizer management
`GET/POST /wasm`	WASM plugin management
`GET/POST /mcp`	MCP server management

Gateway Layer¶

The gateway layer handles cross-cutting concerns before requests reach the router.

Middleware Pipeline¶

Component	Function
Rate Limiter	Multi-tenant token bucket with per-user quotas
OIDC Auth	JWT validation and tenant extraction
WASM Plugins	Custom request transformation logic
Request ID	Assigns unique ID for tracing
Metrics	Records latency, throughput, error rates
OpenTelemetry	Distributed tracing spans

Router Layer¶

The router layer handles LLM-specific request processing. It selects one of three routing paths based on worker type.

Router Manager¶

Worker Type	Path Selected	Gateway Behavior
gRPC workers	gRPC Path	Full server - tokenization, chat templates, tool parsing
HTTP workers	HTTP Path	Smart proxy - load balancing, PD disaggregation
External APIs	3^rd Party Path	Unified router - provider abstraction

gRPC Path (Token-Level Streaming)¶

The gRPC path provides maximum performance by handling all text processing at the gateway.

Pipeline Stages¶

Stage	Function
Chat Template	Apply model-specific chat template (Jinja2)
Tokenization	Convert text to token IDs using model tokenizer
Token Cache	Cache tokenized prefixes for reuse
Load Balance	Select worker using cache-aware policy
Detokenize	Convert streaming tokens back to text
Reasoning Parser	Extract thinking/reasoning from output (DeepSeek-R1, etc.)
Tool Parser	Parse function/tool calls from output

Supported Backends¶

SGLang (gRPC)
vLLM (gRPC)
TensorRT-LLM (gRPC)

HTTP Path (OpenAI Compatible)¶

The HTTP path supports two modes for OpenAI-compatible backends.

Regular HTTP Mode¶

Standard load balancing across HTTP workers running full inference.

PD (Prefill-Decode) Mode¶

Disaggregated inference with separate prefill and decode workers:

Find P/D Pair - Select a prefill worker and decode worker pair
Mutate Headers - Add routing headers for KV cache transfer
Prefill Worker - Processes prompt, transfers KV cache
Decode Worker - Generates tokens using transferred KV cache

Supported Backends¶

SGLang (HTTP)
vLLM (HTTP)
TensorRT-LLM (HTTP)

3^rd Party Path¶

The 3^rd party path routes to external LLM providers through a unified interface.

Model Discovery¶

The gateway discovers available models from external providers and exposes them through /v1/models.

Supported Providers¶

Provider	API Style
OpenAI	OpenAI
Anthropic	Messages
Google Gemini	Gemini
xAI Grok	OpenAI
Together AI	OpenAI
OpenRouter	OpenAI
AWS Bedrock	Bedrock
OCI Generative AI	OCI

Response Processing¶

All paths converge at response processing for tool handling and MCP execution.

Components¶

Component	Function
Tool Parser	Extracts function/tool calls from model output
MCP Handler	Executes tools via Model Context Protocol servers
Response Builder	Assembles final response with tool results

MCP Loop¶

When the model requests tool execution:

Tool parser extracts the tool call
MCP handler executes the tool
Result is re-routed through the router for continued generation
Loop continues until model produces final response

Load Balancing¶

All paths use the same load balancing infrastructure with multiple policies.

Policy	Algorithm	Best For
`cache_aware`	Radix tree prefix matching + load	Production default
`bucket`	Request-length buckets	PD disaggregation
`power_of_two`	Sample two, pick lighter	Load-aware routing
`consistent_hashing`	Hash ring with virtual nodes	Session affinity
`prefix_hash`	Prefix token hash	Lightweight cache locality
`manual`	Explicit routing key mapping	Stateful chat
`round_robin`	Sequential cycling	Even distribution
`random`	Uniform random	Testing

Cache-Aware Routing¶

The cache-aware policy optimizes for KV cache reuse:

Tokenize the request prefix
Search radix tree for longest matching prefix per worker
If match ratio ≥ threshold, route to matched worker
Otherwise, route to worker with most cache capacity
Falls back to least-loaded when system is imbalanced

This integrates with SGLang, vLLM, and TensorRT-LLM's native KV cache management.

Resilience¶

Built-in resilience features protect against failures.

Feature	Function
Circuit Breaker	Stops routing to failing workers
Retry Handler	Retries failed requests with exponential backoff
Health Checker	Periodic worker health probes
Timeout Manager	Request and connection timeouts

What's Next?¶

Service Discovery - Automatic worker discovery in Kubernetes
gRPC Pipeline - Token-level streaming implementation
High Availability - Multi-instance mesh networking
Load Balancing - Routing policy deep dive
Cache-Aware Routing - KV cache optimization
PD Disaggregation - Prefill-decode separation

Architecture Overview¶

System Architecture¶

Registries & State¶

API Endpoints¶

Inference APIs¶

Utility APIs¶

Admin APIs¶

Gateway Layer¶

Middleware Pipeline¶

Router Layer¶

Router Manager¶

gRPC Path (Token-Level Streaming)¶

Pipeline Stages¶

Supported Backends¶

HTTP Path (OpenAI Compatible)¶

Regular HTTP Mode¶

PD (Prefill-Decode) Mode¶

Supported Backends¶

3rd Party Path¶

Model Discovery¶

Supported Providers¶

Response Processing¶

Components¶

MCP Loop¶

Load Balancing¶

Cache-Aware Routing¶

Resilience¶

What's Next?¶

3^rd Party Path¶