gRPC Pipeline¶
When workers communicate via gRPC, SMG becomes a complete OpenAI-compatible server with a sophisticated request processing pipeline for reasoning extraction, tool call parsing, and MCP execution.
Overview¶
Chat Templates¶
Apply model-specific chat templates with full Jinja2 support for all major model families.
Tokenization Caching¶
Two-level tokenization cache reduces CPU overhead by 60-90% for repeated content.
Reasoning Extraction¶
Extract chain-of-thought content from thinking models (DeepSeek-R1, Qwen3, etc.).
Tool Call Parsing¶
Parse function calls and execute MCP tools with automatic result injection.
Pipeline Architecture¶
gRPC Mode¶
Gateway = Full Server
SMG handles tokenization, chat templates, tool parsing, MCP loops, and detokenization. Workers run raw inference.
HTTP Mode¶
Gateway = Smart Proxy
SMG handles routing, load balancing, and failover. Workers run full OpenAI-compatible servers.
Responsibility Comparison¶
| Capability | gRPC Mode (Gateway) | HTTP Mode (Worker) |
|---|---|---|
| Chat template | Gateway | Worker |
| Tokenization | Gateway (cached) | Worker |
| Load balancing | Token-aware | Request count |
| Reasoning extraction | Gateway | Worker |
| Tool call parsing | Gateway | Worker |
| MCP execution | Gateway | N/A |
Reasoning Parsers¶
Reasoning parsers extract chain-of-thought content from model outputs. Essential for models that produce thinking tokens before their final response.
Configuration¶
| Option | --reasoning-parser |
|---|---|
| Environment | SMG_REASONING_PARSER |
| Default | Auto-detected from model name |
Supported Parsers¶
DeepSeek-R1
- Pattern:
*deepseek-r1* - Initial state: In reasoning
- Tokens:
</think>to exit
Qwen3
- Pattern:
*qwen3* - Initial state: Not in reasoning
- Tokens:
<think>/</think>
Kimi
- Pattern:
*kimi* - Initial state: Not in reasoning
- Tokens: Unicode markers
Complete Parser Reference¶
| Parser | Model Pattern | Initial State | Tokens |
|---|---|---|---|
deepseek_r1 | *deepseek-r1* | In reasoning | </think> |
qwen3 | *qwen3* | Not in reasoning | <think> / </think> |
qwen3_thinking | *qwen-thinking* | In reasoning | <think> / </think> |
kimi | *kimi* | Not in reasoning | Unicode markers |
glm45 | *glm45*, *glm47* | Not in reasoning | <think> / </think> |
step3 | *step3* | In reasoning | <think> / </think> |
minimax | *minimax*, *mm-m2* | In reasoning | <think> appended |
Output Format¶
When separate_reasoning: true is set in the request:
{
"choices": [{
"message": {
"role": "assistant",
"content": "The answer is 42.",
"reasoning_content": "Let me think step by step..."
}
}]
}
Tool Call Parsers¶
Tool call parsers extract function calls from model output and validate arguments against schemas.
Configuration¶
| Option | --tool-call-parser |
|---|---|
| Environment | SMG_TOOL_CALL_PARSER |
| Default | Auto-detected from model name |
Supported Parsers¶
Llama
Native Llama 3.2 function calling format.
Complete Parser Reference¶
| Parser | Model Pattern | Format |
|---|---|---|
passthrough | Default fallback | No parsing (returns text unchanged) |
json | gpt-*, claude-*, gemini-* | Standard JSON function calls |
mistral | mistral-*, mixtral-* | Mistral-specific format |
qwen | qwen*, Qwen* | JSON tool calls |
qwen_coder | Qwen*-Coder*, qwen*-coder* | XML with parameter tags |
pythonic | llama-4*, deepseek-* | Python-style function syntax |
llama | llama-3.2* | Python tag with JSON |
deepseek | deepseek-v3* | XML with function syntax |
glm45_moe | glm-4.5*, glm-4.6* | GLM 4.5/4.6 MoE format |
glm47_moe | glm-4.7* | GLM 4.7 MoE format |
step3 | step3*, Step-3* | Step-3 model format |
kimik2 | kimi-k2*, Kimi-K2* | Kimi K2 model format |
minimax_m2 | minimax*, MiniMax* | MiniMax M2 model format |
Tool Execution Flow¶
- Parse: Extract tool calls from model output
- Validate: Check arguments against tool schema
- Execute: Run MCP tools or return to client
- Inject: Add tool results back to conversation
- Continue: Resume generation if needed
Configuration¶
Parser CLI Options¶
| Option | Default | Description |
|---|---|---|
--reasoning-parser | Auto | Reasoning parser type to use |
--tool-call-parser | Auto | Tool call parser type to use |
--mcp-config-path | None | Path to MCP server configuration file |
MCP Integration¶
When MCP is configured, tool calls can be executed automatically:
See the MCP Guide for detailed configuration.
Recommended Configurations¶
Thinking Model¶
DeepSeek-R1 with reasoning extraction.
Tool Calling Model¶
Llama with MCP tool execution.
Full Pipeline¶
Complete configuration with all features.
Monitoring¶
Pipeline Metrics¶
| Metric | Description |
|---|---|
smg_pipeline_stage_duration_seconds | Time spent in each pipeline stage |
smg_reasoning_extractions_total | Reasoning tokens extracted |
smg_tool_calls_total | Tool calls parsed by type |
smg_tool_execution_duration_seconds | Tool execution time |
smg_mcp_tool_calls_total | MCP tool invocations |
Debug Logging¶
# Enable pipeline debug logging
RUST_LOG=smg::pipeline=debug smg ...
# Enable parser debug logging
RUST_LOG=smg::parsers=debug smg ...
Troubleshooting¶
| Symptom | Cause | Solution |
|---|---|---|
| Reasoning not extracted | Wrong parser | Check model and parser match |
| Tool calls not parsed | Format mismatch | Verify tool parser selection |
| MCP tools timeout | Slow tool execution | Check MCP server configuration |
| Empty reasoning_content | Model not thinking | Enable separate_reasoning: true in request |