Metrics Reference¶
Complete reference for Prometheus metrics exposed by SMG. Metrics are organized in six layers matching the request lifecycle.
Metrics Endpoint¶
Metrics are exposed on the Prometheus port (default: 29000):
The same listener also serves a WebSocket stream of real-time metric updates at /ws/metrics (used by the TUI and dashboards that need live state).
Configure via CLI:
Layer 1: HTTP Metrics¶
Metrics for incoming HTTP requests at the gateway edge.
smg_http_requests_total¶
Total HTTP requests received by the gateway.
| Type | Labels |
|---|---|
| Counter | method, path |
# Request rate by endpoint
sum by (path) (rate(smg_http_requests_total[5m]))
# Total request rate
sum(rate(smg_http_requests_total[5m]))
smg_http_request_duration_seconds¶
HTTP request duration from receipt to response.
| Type | Labels |
|---|---|
| Histogram | method, path |
# P99 latency by endpoint
histogram_quantile(0.99, sum by (path, le) (rate(smg_http_request_duration_seconds_bucket[5m])))
# Average latency
rate(smg_http_request_duration_seconds_sum[5m]) / rate(smg_http_request_duration_seconds_count[5m])
smg_http_responses_total¶
HTTP responses by status and error code.
| Type | Labels |
|---|---|
| Counter | status_code, error_code |
# Error rate (5xx responses)
sum(rate(smg_http_responses_total{status_code=~"5.."}[5m])) / sum(rate(smg_http_responses_total[5m]))
# Success rate
sum(rate(smg_http_responses_total{status_code="200"}[5m])) / sum(rate(smg_http_responses_total[5m]))
smg_http_connections_active¶
Currently active HTTP connections.
| Type | Labels |
|---|---|
| Gauge | None |
smg_http_inflight_request_age_count¶
Distribution of in-flight request ages for Grafana heatmaps.
| Type | Labels |
|---|---|
| Gauge | gt, le |
Age buckets (seconds): 30, 60, 180, 300, 600, 1200, 3600, 7200, 14400, 28800, 86400
smg_http_rate_limit_total¶
Rate limiting decisions.
| Type | Labels |
|---|---|
| Counter | result |
Values: allowed, rejected
# Rejection rate
rate(smg_http_rate_limit_total{result="rejected"}[5m]) / sum(rate(smg_http_rate_limit_total[5m]))
Layer 2: Router Metrics¶
Metrics for request routing and processing.
smg_router_requests_total¶
Requests processed by the router.
| Type | Labels |
|---|---|
| Counter | router_type, backend_type, connection_mode, model, endpoint, streaming |
Router types: openai, http, grpc Backend types: regular, pd, external, harmony Endpoints: chat, generate, responses, completions, rerank, embeddings, classify, messages, realtime, realtime_sessions, realtime_client_secrets, realtime_transcription Streaming: true, false
# Request rate by model
sum by (model) (rate(smg_router_requests_total[5m]))
# Streaming vs non-streaming
sum by (streaming) (rate(smg_router_requests_total[5m]))
smg_router_request_duration_seconds¶
Total router request duration.
| Type | Labels |
|---|---|
| Histogram | router_type, backend_type, connection_mode, model, endpoint |
smg_router_request_errors_total¶
Router errors by type.
| Type | Labels |
|---|---|
| Counter | router_type, backend_type, connection_mode, model, endpoint, error_type |
Error types: no_workers, timeout, backend_error, validation_error, internal_error
smg_router_stage_duration_seconds¶
Duration of individual pipeline stages (gRPC mode only).
| Type | Labels |
|---|---|
| Histogram | router_type, stage |
Stage names are emitted by the gRPC pipeline (e.g., tokenize, route, inference, detokenize, tool_parse).
# Tokenization latency
histogram_quantile(0.99, rate(smg_router_stage_duration_seconds_bucket{stage="tokenize"}[5m]))
smg_router_ttft_seconds¶
Time to first token (gRPC streaming only).
| Type | Labels |
|---|---|
| Histogram | router_type, backend_type, model, endpoint |
# P50 TTFT by model
histogram_quantile(0.5, sum by (model, le) (rate(smg_router_ttft_seconds_bucket[5m])))
smg_router_tpot_seconds¶
Time per output token (gRPC streaming only).
| Type | Labels |
|---|---|
| Histogram | router_type, backend_type, model, endpoint |
smg_router_tokens_total¶
Token counts by type.
| Type | Labels |
|---|---|
| Counter | router_type, backend_type, model, endpoint, token_type |
Token types: input, output
# Tokens per second
sum by (token_type) (rate(smg_router_tokens_total[5m]))
# Input/output ratio
sum(rate(smg_router_tokens_total{token_type="output"}[5m])) / sum(rate(smg_router_tokens_total{token_type="input"}[5m]))
smg_router_generation_duration_seconds¶
Total generation time (first token to last token).
| Type | Labels |
|---|---|
| Histogram | router_type, backend_type, model, endpoint |
smg_router_upstream_responses_total¶
HTTP responses from upstream workers.
| Type | Labels |
|---|---|
| Counter | router_type, status_code, error_code |
Layer 3: Worker Metrics¶
Metrics for worker pool management and resilience.
smg_worker_pool_size¶
Number of workers in the pool.
| Type | Labels |
|---|---|
| Gauge | worker_type, connection_mode, model |
smg_worker_connections_active¶
Active connections per worker pool.
| Type | Labels |
|---|---|
| Gauge | worker_type, connection_mode |
smg_worker_requests_active¶
Active requests per worker.
| Type | Labels |
|---|---|
| Gauge | worker |
# Load distribution across workers
smg_worker_requests_active / ignoring(worker) group_left sum(smg_worker_requests_active)
smg_worker_health¶
Worker health status.
| Type | Labels | Values |
|---|---|---|
| Gauge | worker | 1 = healthy, 0 = unhealthy |
smg_worker_health_checks_total¶
Health check results.
| Type | Labels |
|---|---|
| Counter | worker_type, result |
Results: success, failure
smg_worker_selection_total¶
Worker selection events by load balancer.
| Type | Labels |
|---|---|
| Counter | worker_type, connection_mode, model, policy |
smg_worker_errors_total¶
Worker-level errors by type.
| Type | Labels |
|---|---|
| Counter | worker_type, connection_mode, error_type |
Circuit Breaker Metrics¶
smg_worker_cb_state¶
Circuit breaker state per worker.
| Type | Labels | Values |
|---|---|---|
| Gauge | worker | 0 = closed, 1 = open, 2 = half-open |
smg_worker_cb_transitions_total¶
Circuit breaker state transitions.
| Type | Labels |
|---|---|
| Counter | worker, from, to |
smg_worker_cb_outcomes_total¶
Request outcomes tracked by circuit breaker.
| Type | Labels |
|---|---|
| Counter | worker, outcome |
Outcomes: success, failure
smg_worker_cb_consecutive_failures¶
Consecutive failures per worker.
| Type | Labels |
|---|---|
| Gauge | worker |
smg_worker_cb_consecutive_successes¶
Consecutive successes per worker.
| Type | Labels |
|---|---|
| Gauge | worker |
Retry Metrics¶
smg_worker_retries_total¶
Retry attempts.
| Type | Labels |
|---|---|
| Counter | worker_type, endpoint |
smg_worker_retries_exhausted_total¶
Requests that exhausted all retries.
| Type | Labels |
|---|---|
| Counter | worker_type, endpoint |
smg_worker_retry_backoff_seconds¶
Retry backoff durations by attempt number.
| Type | Labels |
|---|---|
| Histogram | attempt |
Layer 4: Discovery Metrics¶
Metrics for service discovery.
smg_discovery_registrations_total¶
Worker registrations.
| Type | Labels |
|---|---|
| Counter | source, result |
Sources: static, kubernetes, consul, manual
smg_discovery_deregistrations_total¶
Worker deregistrations.
| Type | Labels |
|---|---|
| Counter | source, reason |
smg_discovery_sync_duration_seconds¶
Discovery sync duration.
| Type | Labels |
|---|---|
| Histogram | source |
smg_discovery_workers_discovered¶
Workers discovered per source.
| Type | Labels |
|---|---|
| Gauge | source |
Layer 5: MCP Tool Metrics¶
Metrics for Model Context Protocol tool execution.
smg_mcp_tool_calls_total¶
MCP tool invocations.
| Type | Labels |
|---|---|
| Counter | model, tool_name, result |
Results: success, error
# Tool success rate
sum(rate(smg_mcp_tool_calls_total{result="success"}[5m])) / sum(rate(smg_mcp_tool_calls_total[5m]))
# Most used tools
topk(10, sum by (tool_name) (rate(smg_mcp_tool_calls_total[5m])))
smg_mcp_tool_duration_seconds¶
Tool execution duration.
| Type | Labels |
|---|---|
| Histogram | model, tool_name |
smg_mcp_servers_active¶
Active MCP servers.
| Type | Labels |
|---|---|
| Gauge | None |
smg_mcp_tool_iterations_total¶
Tool loop iterations in Responses API.
| Type | Labels |
|---|---|
| Counter | model |
Layer 6: Database Metrics¶
Metrics for storage operations.
smg_db_operations_total¶
Database operations.
| Type | Labels |
|---|---|
| Counter | storage_type, operation, result |
Storage types: response, conversation, conversation_item Operations: get, put, delete, list
smg_db_operation_duration_seconds¶
Database operation duration.
| Type | Labels |
|---|---|
| Histogram | storage_type, operation |
smg_db_connections_active¶
Active database connections.
| Type | Labels |
|---|---|
| Gauge | storage_type |
smg_db_items_stored¶
Items stored in database.
| Type | Labels |
|---|---|
| Counter | storage_type |
Cache Routing Metrics¶
smg_manual_policy_cache_entries¶
Entries in the cache-aware routing cache.
| Type | Labels |
|---|---|
| Gauge | None |
smg_worker_routing_keys_active¶
Active routing keys per worker (used by cache-aware policies).
| Type | Labels |
|---|---|
| Gauge | worker |
smg_manual_policy_branch_total¶
Manual policy execution branch counts for routing decisions.
| Type | Labels |
|---|---|
| Counter | branch |
smg_consistent_hashing_policy_branch_total¶
Consistent hashing policy execution branch counts for routing decisions.
| Type | Labels |
|---|---|
| Counter | branch |
smg_prefix_hash_policy_branch_total¶
Prefix hash policy execution branch counts for routing decisions.
| Type | Labels |
|---|---|
| Counter | branch |
Dashboard Queries Summary¶
| Metric | Query |
|---|---|
| Request rate | sum(rate(smg_http_requests_total[5m])) |
| Error rate | sum(rate(smg_http_responses_total{status_code=~"5.."}[5m])) / sum(rate(smg_http_responses_total[5m])) |
| P99 latency | histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m])) |
| TTFT P50 | histogram_quantile(0.5, rate(smg_router_ttft_seconds_bucket[5m])) |
| Tokens/sec | sum(rate(smg_router_tokens_total[5m])) |
| Healthy workers | sum(smg_worker_health) |
| Open circuits | count(smg_worker_cb_state == 1) |
| Rate limit rejections | rate(smg_http_rate_limit_total{result="rejected"}[5m]) |
| MCP tool success rate | sum(rate(smg_mcp_tool_calls_total{result="success"}[5m])) / sum(rate(smg_mcp_tool_calls_total[5m])) |
Histogram Buckets¶
Default histogram buckets (29 buckets from 1ms to 7200s) applied to every metric whose name ends with duration_seconds:
0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0,
10.0, 15.0, 30.0, 45.0, 60.0, 90.0, 120.0, 180.0, 240.0, 300.0,
480.0, 900.0, 1200.0, 1800.0, 2700.0, 3600.0, 5400.0, 7200.0
Configure custom buckets via CLI: