Concepts¶

This section explains the core concepts behind Shepherd Model Gateway. Understanding these concepts helps you design, deploy, and operate SMG effectively.

What is SMG?¶

Shepherd Model Gateway is an inference gateway that adapts to your deployment:

With gRPC Workers	With HTTP Workers	With External APIs
Full OpenAI server	Intelligent proxy	Unified routing
Tokenization + caching	Load balancing	Model discovery
Tool parsing + MCP	PD disaggregation	Provider abstraction
Reasoning loops	Health-aware failover	API translation

Unlike generic load balancers, SMG understands LLM patterns: prefix caching, token streaming, and KV cache affinity.

Core Components¶

Layer	Purpose
API Layer	Inference, utility, and admin endpoints
Router Manager	Selects path based on worker type
Service Discovery	Health monitoring, worker registration

Key Concepts¶

Architecture¶

How SMG's control plane and data plane work together.

Learn about Architecture →

Load Balancing¶

Routing strategies from simple random selection to cache-aware algorithms.

Learn about Routing →

Performance¶

Tokenizer caching and optimization strategies for high-throughput deployments.

Learn about Performance →

Extensibility¶

WASM plugins and MCP integration for custom middleware and external tools.

Learn about Extensibility →

Reliability¶

Circuit breakers, retries, and rate limiting for resilient deployments.

Learn about Reliability →

Concept Categories¶

Architecture¶

Understand how SMG is structured internally:

Architecture Overview — High-level system design
Service Discovery — Kubernetes integration and worker registration
gRPC Pipeline — Token-level streaming implementation
High Availability — Multi-instance mesh networking

Routing¶

Learn how SMG selects workers for requests:

Load Balancing — Available routing policies
Cache-Aware Routing — Prefix-based worker affinity

Performance¶

Optimize SMG for high-throughput deployments:

Tokenizer Caching — Two-level cache for tokenization

Extensibility¶

Extend SMG with custom logic and external tools:

WASM Plugins — Dynamic middleware with WebAssembly
Model Context Protocol — External tool integration via MCP

Reliability¶

Understand how SMG handles failures:

Circuit Breakers — Preventing cascade failures
Rate Limiting — Protecting workers from overload

Design Principles¶

SMG is built on several core principles:

Transparency: SMG should be invisible to well-behaved applications. The same requests that work against a single worker should work through SMG.
Performance: Routing decisions happen in microseconds. SMG never becomes the bottleneck.
Reliability: Individual worker failures don't cause application failures. SMG routes around problems automatically.
Observability: You can always understand what SMG is doing through metrics, traces, and logs.
Simplicity: Common cases are simple. Advanced features are available but not required.