Architecture Overview¶
SMG is a high-performance inference gateway that sits between your applications and LLM workers. It provides unified routing, enterprise features, and full observability across heterogeneous model deployments.
System Architecture¶
Registries & State¶
Registries hold the configuration and state needed for request processing.
| Registry | Purpose | Used By |
|---|---|---|
| Model Registry | Maps model names to backends and capabilities | Router Manager |
| LB Policy Registry | Load balancing configurations per model | All routing paths |
| Tokenizer Registry | Tokenizers for gateway-side processing | gRPC path |
| Chat History | Multi-turn conversation context | Responses API |
| WASM Plugins | Custom request/response transformations | Middleware |
API Endpoints¶
SMG exposes three categories of endpoints:
Inference APIs¶
| Endpoint | Description |
|---|---|
POST /v1/chat/completions | OpenAI-compatible chat completions |
POST /v1/completions | Text completions |
POST /v1/responses | Agentic workflows with tool execution |
POST /v1/embeddings | Embedding generation |
POST /v1/rerank | Reranking API |
POST /messages | Anthropic Messages API |
Utility APIs¶
| Endpoint | Description |
|---|---|
POST /tokenize | Tokenize text using model's tokenizer |
POST /detokenize | Convert token IDs back to text |
POST /v1/parser/tool | Parse tool calls from text |
POST /v1/parser/reasoning | Parse reasoning chains |
Admin APIs¶
| Endpoint | Description |
|---|---|
GET/POST /workers | Worker management |
GET/POST /tokenizers | Tokenizer management |
GET/POST /wasm | WASM plugin management |
GET/POST /mcp | MCP server management |
Gateway Layer¶
The gateway layer handles cross-cutting concerns before requests reach the router.
Middleware Pipeline¶
| Component | Function |
|---|---|
| Rate Limiter | Multi-tenant token bucket with per-user quotas |
| OIDC Auth | JWT validation and tenant extraction |
| WASM Plugins | Custom request transformation logic |
| Request ID | Assigns unique ID for tracing |
| Metrics | Records latency, throughput, error rates |
| OpenTelemetry | Distributed tracing spans |
Router Layer¶
The router layer handles LLM-specific request processing. It selects one of three routing paths based on worker type.
Router Manager¶
| Worker Type | Path Selected | Gateway Behavior |
|---|---|---|
| gRPC workers | gRPC Path | Full server - tokenization, chat templates, tool parsing |
| HTTP workers | HTTP Path | Smart proxy - load balancing, PD disaggregation |
| External APIs | 3rd Party Path | Unified router - provider abstraction |
gRPC Path (Token-Level Streaming)¶
The gRPC path provides maximum performance by handling all text processing at the gateway.
Pipeline Stages¶
| Stage | Function |
|---|---|
| Chat Template | Apply model-specific chat template (Jinja2) |
| Tokenization | Convert text to token IDs using model tokenizer |
| Token Cache | Cache tokenized prefixes for reuse |
| Load Balance | Select worker using cache-aware policy |
| Detokenize | Convert streaming tokens back to text |
| Reasoning Parser | Extract thinking/reasoning from output (DeepSeek-R1, etc.) |
| Tool Parser | Parse function/tool calls from output |
Supported Backends¶
- SGLang (gRPC)
- vLLM (gRPC)
- TensorRT-LLM (gRPC)
HTTP Path (OpenAI Compatible)¶
The HTTP path supports two modes for OpenAI-compatible backends.
Regular HTTP Mode¶
Standard load balancing across HTTP workers running full inference.
PD (Prefill-Decode) Mode¶
Disaggregated inference with separate prefill and decode workers:
- Find P/D Pair - Select a prefill worker and decode worker pair
- Mutate Headers - Add routing headers for KV cache transfer
- Prefill Worker - Processes prompt, transfers KV cache
- Decode Worker - Generates tokens using transferred KV cache
Supported Backends¶
- SGLang (HTTP)
- vLLM (HTTP)
- TensorRT-LLM (HTTP)
3rd Party Path¶
The 3rd party path routes to external LLM providers through a unified interface.
Model Discovery¶
The gateway discovers available models from external providers and exposes them through /v1/models.
Supported Providers¶
| Provider | API Style |
|---|---|
| OpenAI | OpenAI |
| Anthropic | Messages |
| Google Gemini | Gemini |
| xAI Grok | OpenAI |
| Together AI | OpenAI |
| OpenRouter | OpenAI |
| AWS Bedrock | Bedrock |
| OCI Generative AI | OCI |
Response Processing¶
All paths converge at response processing for tool handling and MCP execution.
Components¶
| Component | Function |
|---|---|
| Tool Parser | Extracts function/tool calls from model output |
| MCP Handler | Executes tools via Model Context Protocol servers |
| Response Builder | Assembles final response with tool results |
MCP Loop¶
When the model requests tool execution:
- Tool parser extracts the tool call
- MCP handler executes the tool
- Result is re-routed through the router for continued generation
- Loop continues until model produces final response
Load Balancing¶
All paths use the same load balancing infrastructure with multiple policies.
| Policy | Algorithm | Best For |
|---|---|---|
cache_aware | Radix tree prefix matching + load | Production default |
bucket | Request-length buckets | PD disaggregation |
power_of_two | Sample two, pick lighter | Load-aware routing |
consistent_hashing | Hash ring with virtual nodes | Session affinity |
prefix_hash | Prefix token hash | Lightweight cache locality |
manual | Explicit routing key mapping | Stateful chat |
round_robin | Sequential cycling | Even distribution |
random | Uniform random | Testing |
Cache-Aware Routing¶
The cache-aware policy optimizes for KV cache reuse:
- Tokenize the request prefix
- Search radix tree for longest matching prefix per worker
- If match ratio ≥ threshold, route to matched worker
- Otherwise, route to worker with most cache capacity
- Falls back to least-loaded when system is imbalanced
This integrates with SGLang, vLLM, and TensorRT-LLM's native KV cache management.
Resilience¶
Built-in resilience features protect against failures.
| Feature | Function |
|---|---|
| Circuit Breaker | Stops routing to failing workers |
| Retry Handler | Retries failed requests with exponential backoff |
| Health Checker | Periodic worker health probes |
| Timeout Manager | Request and connection timeouts |
What's Next?¶
- Service Discovery - Automatic worker discovery in Kubernetes
- gRPC Pipeline - Token-level streaming implementation
- High Availability - Multi-instance mesh networking
- Load Balancing - Routing policy deep dive
- Cache-Aware Routing - KV cache optimization
- PD Disaggregation - Prefill-decode separation