Monitoring¶
Set up Prometheus monitoring, OpenTelemetry tracing, and Grafana dashboards for SMG.
Before you begin¶
- Completed the Getting Started guide
- Prometheus server (or follow steps below to deploy)
- Grafana (optional, for dashboards)
- OTLP collector (optional, for distributed tracing)
Enable Metrics¶
SMG exposes Prometheus metrics on a dedicated port with a 6-layer metric hierarchy.
Start SMG with metrics¶
Verify metrics endpoint¶
You should see Prometheus-formatted metrics:
# HELP smg_http_requests_total Total HTTP requests
# TYPE smg_http_requests_total counter
smg_http_requests_total{method="POST",path="/v1/chat/completions"} 1234
...
OpenTelemetry Tracing¶
SMG supports distributed tracing via OpenTelemetry.
Enable tracing¶
Configuration¶
| Flag | Default | Description |
|---|---|---|
--enable-trace | false | Enable OpenTelemetry tracing |
--otlp-traces-endpoint | localhost:4317 | OTLP gRPC collector endpoint |
Trace propagation¶
SMG automatically propagates W3C TraceContext headers to workers:
traceparent— Trace ID and span IDtracestate— Vendor-specific trace data
Prometheus Configuration¶
Basic configuration¶
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'smg'
static_configs:
- targets: ['localhost:29000']
metrics_path: /metrics
Kubernetes ServiceMonitor¶
For Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: smg
namespace: inference
labels:
app: smg
spec:
selector:
matchLabels:
app: smg
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- inference
Key Metrics by Layer¶
Layer 1: HTTP Metrics¶
| Metric | Type | Description |
|---|---|---|
smg_http_requests_total | Counter | Requests by method, path |
smg_http_request_duration_seconds | Histogram | Request latency |
smg_http_responses_total | Counter | Responses by status_code, error_code |
smg_http_connections_active | Gauge | Active connections |
smg_http_rate_limit_total | Counter | Rate limit decisions |
Layer 2: Router Metrics¶
| Metric | Type | Description |
|---|---|---|
smg_router_requests_total | Counter | Requests by router_type, model, endpoint |
smg_router_ttft_seconds | Histogram | Time to first token (gRPC) |
smg_router_tpot_seconds | Histogram | Time per output token (gRPC) |
smg_router_tokens_total | Counter | Tokens by type (input/output) |
smg_router_stage_duration_seconds | Histogram | Pipeline stage durations |
Layer 3: Worker Metrics¶
| Metric | Type | Description |
|---|---|---|
smg_worker_health | Gauge | Health status (1=healthy, 0=unhealthy) |
smg_worker_requests_active | Gauge | Active requests per worker |
smg_worker_cb_state | Gauge | Circuit breaker state |
smg_worker_retries_total | Counter | Retry attempts |
Layer 5: MCP Metrics¶
| Metric | Type | Description |
|---|---|---|
smg_mcp_tool_calls_total | Counter | Tool invocations by tool_name, result |
smg_mcp_tool_duration_seconds | Histogram | Tool execution time |
smg_mcp_servers_active | Gauge | Active MCP servers |
Grafana Dashboards¶
Essential panels¶
Request Rate
P99 Latency
Error Rate
sum(rate(smg_http_responses_total{status_code=~"5.."}[5m]))
/ sum(rate(smg_http_responses_total[5m]))
Time to First Token (TTFT)
Tokens per Second
Worker Health
Alerting Rules¶
groups:
- name: smg
rules:
- alert: SMGHighErrorRate
expr: |
sum(rate(smg_http_responses_total{status_code=~"5.."}[5m]))
/ sum(rate(smg_http_responses_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on SMG"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: SMGWorkerUnhealthy
expr: smg_worker_health == 0
for: 1m
labels:
severity: warning
annotations:
summary: "SMG worker unhealthy"
description: "Worker {{ $labels.worker }} is unhealthy"
- alert: SMGHighLatency
expr: |
histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on SMG"
description: "P99 latency is {{ $value }}s"
- alert: SMGCircuitBreakerOpen
expr: smg_worker_cb_state == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker open"
description: "Circuit breaker for {{ $labels.worker }} is open"
- alert: SMGHighTTFT
expr: |
histogram_quantile(0.95, rate(smg_router_ttft_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High time to first token"
description: "P95 TTFT is {{ $value }}s"
- alert: SMGRateLimitRejections
expr: rate(smg_http_rate_limit_total{decision="rejected"}[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High rate limit rejections"
description: "{{ $value }} rejections/sec"
Useful Queries¶
Request analysis¶
# Request rate by endpoint
sum by (path) (rate(smg_http_requests_total[5m]))
# Success rate
sum(rate(smg_http_responses_total{status_code="200"}[5m]))
/ sum(rate(smg_http_responses_total[5m]))
# Latency percentiles
histogram_quantile(0.50, rate(smg_http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(smg_http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m]))
LLM performance¶
# Tokens per second by model
sum by (model) (rate(smg_router_tokens_total[5m]))
# TTFT by model
histogram_quantile(0.5, sum by (model, le) (rate(smg_router_ttft_seconds_bucket[5m])))
# Input/output token ratio
sum(rate(smg_router_tokens_total{type="output"}[5m]))
/ sum(rate(smg_router_tokens_total{type="input"}[5m]))
Worker analysis¶
# Load distribution
smg_worker_requests_active / ignoring(worker) group_left sum(smg_worker_requests_active)
# Unhealthy workers
count(smg_worker_health == 0)
# Circuit breaker states
count by (worker) (smg_worker_cb_state == 1)
MCP tool analysis¶
# Tool success rate
sum(rate(smg_mcp_tool_calls_total{result="success"}[5m]))
/ sum(rate(smg_mcp_tool_calls_total[5m]))
# Most used tools
topk(10, sum by (tool_name) (rate(smg_mcp_tool_calls_total[5m])))
# Slowest tools
topk(5, histogram_quantile(0.95, sum by (tool_name, le) (rate(smg_mcp_tool_duration_seconds_bucket[5m]))))
Verification¶
# Check metrics are being scraped
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="smg")'
# Query a metric
curl -s 'http://prometheus:9090/api/v1/query?query=smg_http_requests_total' | jq
# Check alerts
curl -s http://prometheus:9090/api/v1/alerts | jq
Troubleshooting¶
Metrics endpoint not responding
-
Verify SMG is running with
--prometheus-port: -
Check the port is listening:
-
Check firewall rules allow access
Traces not appearing
-
Verify OTLP endpoint is reachable:
-
Check SMG was started with
--enable-trace -
Verify collector is receiving spans
Missing metrics
-
Ensure the feature generating metrics is enabled
-
Some metrics only appear for specific router types (e.g., TTFT is gRPC-only)
-
Verify metric name spelling in queries
What's Next?¶
- Configure Logging — Structured log aggregation
- Configure TLS — Secure client-to-gateway traffic
- Metrics Reference — Complete metrics documentation