Rate Limiting¶
Rate limiting protects workers from being overwhelmed by too many concurrent requests. SMG uses a token bucket algorithm with optional request queuing.
Overview¶
Token Bucket¶
Smooth rate limiting with burst capacity using the token bucket algorithm.
Request Queuing¶
Queue excess requests instead of rejecting them immediately.
Configurable Timeouts¶
Bound request and queue wait times to maintain system responsiveness.
Observable¶
Full Prometheus metrics for queue depth, wait times, and rejection rates.
Why Rate Limit?¶
Without rate limiting:
- Worker overload: Too many concurrent requests degrade performance
- Memory exhaustion: Workers run out of GPU memory
- Cascading timeouts: Slow responses cause client timeouts
- Poor user experience: Some users get fast responses, others wait forever
Rate limiting ensures fair access and predictable performance.
How It Works¶
SMG uses a token bucket algorithm:
Token Bucket¶
- Bucket capacity: Maximum concurrent requests (
--max-concurrent-requests) - Refill rate: Tokens added per second (
--rate-limit-tokens-per-second) - Request cost: Each request consumes one token
Request Queue¶
When no tokens are available, requests can wait in a queue:
- Queue size: Maximum waiting requests (
--queue-size) - Queue timeout: Maximum wait time (
--queue-timeout-secs)
Configuration¶
smg \
--worker-urls http://w1:8000 http://w2:8000 \
--max-concurrent-requests 100 \
--rate-limit-tokens-per-second 50 \
--queue-size 200 \
--queue-timeout-secs 30
Rate Limit Parameters¶
| Parameter | Default | Description |
|---|---|---|
--max-concurrent-requests | -1 (disabled) | Token bucket capacity. When <= 0 the limiter is disabled entirely and requests pass through. |
--rate-limit-tokens-per-second | unset (refills at max_concurrent_requests) | Token bucket refill rate in tokens per second. |
--queue-size | 100 | Maximum queued requests |
--queue-timeout-secs | 60 | Maximum queue wait time |
Timeout Parameters¶
| Parameter | Default | Description |
|---|---|---|
--request-timeout-secs | 1800 (30 min) | Maximum time for a request to complete |
--queue-timeout-secs | 60 | Maximum time a request waits in queue |
--worker-startup-timeout-secs | 1800 (30 min) | Timeout for worker startup/model loading |
Concurrency vs. Rate Limiting
Setting --max-concurrent-requests alone creates a token bucket whose capacity and refill rate both equal max_concurrent_requests, so it enforces both burst capacity and a sustained rate. Set --rate-limit-tokens-per-second when you want the sustained rate to differ from the burst capacity (for example, capacity 100 with refill 50 allows short bursts of 100 while sustaining 50 req/s).
Response Codes¶
| Code | Meaning | When |
|---|---|---|
| 429 | Too Many Requests | Queue is full, or queuing is disabled and no token is available |
| 408 | Request Timeout | Queue wait exceeded timeout |
The local rate limiter returns a status-only response with no JSON body (clients should read the HTTP status and X-Request-Id to distinguish cases). SMG does not currently emit a Retry-After header with the response.
When the mesh global rate limit is enabled and exceeded, the 429 response carries a JSON body:
Sizing Guidelines¶
Concurrent Requests¶
Base on worker capacity:
| Worker Type | Requests per Worker |
|---|---|
| Small GPU (16GB) | 4-8 |
| Medium GPU (40GB) | 8-16 |
| Large GPU (80GB) | 16-32 |
Queue Size¶
Base on acceptable latency:
| Latency Tolerance | Queue Depth Factor |
|---|---|
| Low (interactive) | 0.5-1x |
| Medium (batch) | 2-4x |
| High (async) | 4-8x |
Token Refill Rate¶
Base on sustainable throughput:
The 1.2 factor provides headroom for bursts.
Example Configurations¶
Monitoring¶
Metrics¶
| Metric | Description |
|---|---|
smg_http_rate_limit_total | Rate limit decisions by result (allowed/rejected) |
smg_http_request_duration_seconds | Request duration histogram |
Useful PromQL Queries¶
Alert Thresholds¶
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Queue utilization | >70% | >90% | Increase queue size or capacity |
| Rejection rate | >5% | >20% | Increase limits or scale workers |
| Avg queue wait | >10s | >30s | Reduce load or increase capacity |
| Queue timeouts | >1/min | >10/min | Investigate bottlenecks |
Client-Side Handling¶
Retry Strategy¶
Clients should implement exponential backoff when receiving 429. SMG does not set a Retry-After header today, so clients must compute their own wait:
import time
import requests
def request_with_retry(url, data, max_retries=5):
for attempt in range(max_retries):
response = requests.post(url, json=data)
if response.status_code == 429:
# SMG does not emit Retry-After; fall back to exponential backoff.
time.sleep(2 ** attempt)
continue
return response
raise Exception("Max retries exceeded")
Adaptive Rate¶
Monitor 429 responses and adjust request rate:
class AdaptiveClient:
def __init__(self, base_rate=10):
self.rate = base_rate
def on_success(self):
self.rate = min(self.rate * 1.1, 100) # Increase slowly
def on_rate_limit(self):
self.rate = self.rate * 0.5 # Decrease quickly