Skip to content

gRPC Workers

When workers connect via gRPC instead of HTTP, SMG becomes a full OpenAI-compatible server — handling tokenization, chat templates, reasoning extraction, and tool calling at the gateway level. Workers run raw inference only.

Before you begin

  • Completed the Getting Started guide
  • A gRPC-capable inference worker (vLLM with gRPC entrypoint)
  • Access to the model weights or a HuggingFace model path (for tokenizer loading)

What gRPC Mode Enables

Capability HTTP Mode (worker handles) gRPC Mode (gateway handles)
Chat templates Worker Gateway
Tokenization Worker Gateway (with caching)
Load balancing Request-level Token-aware
Reasoning extraction Worker Gateway
Tool call parsing Worker Gateway
MCP tool execution (Responses API) N/A Gateway

In HTTP mode, SMG is a smart proxy — routing and failover only. In gRPC mode, SMG takes over the full request processing pipeline.


Start a gRPC Worker

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051
python -m tensorrt_llm.commands.serve serve \
  meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --host 0.0.0.0 \
  --port 50051 \
  --backend pytorch

Connect SMG

Point SMG at the gRPC worker using grpc:// URLs and provide --model-path so the gateway can load the tokenizer:

smg \
  --worker-urls grpc://localhost:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

--model-path is required

The gateway needs the tokenizer to apply chat templates, count tokens for load balancing, and parse tool calls. This can be a HuggingFace model ID or a local path.

The API is still OpenAI-compatible — clients send the same requests as with HTTP workers:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Multiple gRPC Workers

smg \
  --worker-urls grpc://worker1:50051 grpc://worker2:50052 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --policy round_robin

Reasoning Extraction

For thinking models (DeepSeek-R1, Qwen3, etc.), SMG can extract chain-of-thought content into a separate field:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path deepseek-ai/DeepSeek-R1 \
  --reasoning-parser deepseek_r1

The parser is auto-detected from the model name by default. Override with --reasoning-parser if needed.

Request with separate_reasoning: true:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [{"role": "user", "content": "What is 25 * 37?"}],
    "separate_reasoning": true
  }'

Response includes both fields:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "925",
      "reasoning_content": "Let me calculate 25 * 37 step by step..."
    }
  }]
}

Supported Reasoning Parsers

Auto-detected from the model name. Override with --reasoning-parser if needed.

Parser Models
deepseek_r1 DeepSeek-R1
qwen3 Qwen3, Nemotron Nano
qwen_thinking Qwen3-Thinking
kimi Kimi
glm45 GLM-4.5, GLM-4.7
step3 Step-3
minimax MiniMax, MiniMax-M2
cohere_cmd Command-R, Command-A, C4AI

Tool Calling

In gRPC mode, SMG parses function calls from model output:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.2-70B-Instruct \
  --tool-call-parser llama

For MCP tool execution in Responses API, see the dedicated guide:

# See:
#   Getting Started → MCP in Responses API
#   /v1/responses + --mcp-config-path

Supported Tool Call Parsers

Auto-detected from the model name. Override with --tool-call-parser if needed.

Parser Models
json GPT-4/4o, Claude, Gemini, Gemma, Llama (generic)
llama Llama 3.2
pythonic Llama 4, DeepSeek (generic)
deepseek DeepSeek-V3
mistral Mistral, Mixtral
qwen Qwen
qwen_coder Qwen3-Coder, Qwen2.5-Coder
glm45_moe GLM-4.5, GLM-4.6
glm47_moe GLM-4.7
step3 Step-3
kimik2 Kimi-K2
minimax_m2 MiniMax
cohere Command-R, Command-A, C4AI

HTTP vs gRPC: When to Use Which

Use Case Recommended Mode
Workers already run OpenAI servers (SGLang, vLLM HTTP) HTTP
You need gateway-level tool parsing or Responses MCP gRPC
You want token-aware load balancing gRPC
You use thinking models and want reasoning extraction gRPC
Simplest possible setup HTTP

Next Steps