Shepherd Model Gateway¶

The high-performance inference gateway for production LLM deployments

Route, balance, and orchestrate traffic across your LLM fleet with enterprise-grade reliability.

Get Started View on GitHub

70% TTFT Reduction

<1ms Routing Latency

40+ Metrics

100% OpenAI Compatible

Works With

vLLM TensorRT-LLM TokenSpeed SGLang OpenAI Claude Gemini

Why Shepherd Model Gateway?¶

SMG sits between your applications and LLM workers, providing a unified control and data plane for managing inference at scale. Whether you're running a single model or orchestrating hundreds of workers across multiple clusters, SMG gives you the tools to do it reliably.

Full OpenAI Server Mode¶

With gRPC workers, SMG becomes a complete OpenAI-compatible server — handling tokenization, chat templates, tool calling, MCP, reasoning loops, and detokenization at the gateway level.

High Performance¶

Native Rust implementation with gateway-side tokenization caching, token-level streaming, and sub-millisecond routing. Built for throughput at scale.

Enterprise Reliability¶

Circuit breakers, automatic retries with exponential backoff, rate limiting, and health monitoring. Your inference stack stays up.

Full Observability¶

40+ Prometheus metrics, OpenTelemetry distributed tracing, and structured logging. Know exactly what's happening.

How It Works¶

gRPC Mode¶

Gateway = Full Server

SMG handles everything: tokenization, chat templates, tool parsing, MCP loops, detokenization, and PD routing. Workers run raw inference on vLLM, TensorRT-LLM, TokenSpeed, or SGLang.