Graceful Shutdown¶

Graceful shutdown allows in-flight requests to complete before the gateway terminates, preventing request failures during deployments and restarts.

Overview¶

Clean Termination¶

Allow existing requests to finish rather than abruptly closing connections.

Zero-Downtime Deployments¶

Deploy updates without causing client-visible errors.

Configurable Grace Period¶

Control how long to wait for in-flight requests.

API Control¶

Trigger shutdown programmatically via HTTP API.

Why Graceful Shutdown?¶

Without graceful shutdown:

Abrupt termination: Active requests are immediately disconnected
Client errors: In-flight requests return connection errors
Data loss: Streaming responses may be truncated
Deployment failures: Rolling updates cause visible errors

With graceful shutdown:

Request completion: Active requests finish normally
No client errors: Users don't see deployment-related failures
Clean streaming: Streaming responses complete before shutdown
Smooth deployments: Zero-downtime rolling updates

How It Works¶

Shutdown Sequence¶

Shutdown signal received (SIGTERM, SIGINT, or API call)
Stop accepting new requests - New connections are rejected with 503
Drain in-flight requests - Existing requests continue processing
Grace period timer starts - After shutdown-grace-period-secs, force shutdown
Clean exit - Once all requests complete (or grace period expires)

Configuration¶

smg \
  --worker-urls http://w1:8000 http://w2:8000 \
  --shutdown-grace-period-secs 180

Parameters¶

Parameter	Default	Description
`--shutdown-grace-period-secs`	`180` (3 min)	Time to wait for in-flight requests

Recommended Configurations¶

Fast Shutdown¶

Quick termination for development.

smg --shutdown-grace-period-secs 10

Use when: Development, testing, quick restarts

Production Standard¶

Balanced grace period for typical workloads.

smg --shutdown-grace-period-secs 180

Use when: Standard production deployments

Batch Processing¶

Long grace period for long-running requests.

smg --shutdown-grace-period-secs 600

Use when: Batch inference, long-running generations

Critical Low-Latency¶

Minimal grace for latency-sensitive systems.

smg --shutdown-grace-period-secs 30

Use when: Very short requests, rapid scaling

Triggering Shutdown¶

Via Signal¶

# Find the SMG process
pgrep -f smg

# Send SIGTERM for graceful shutdown
kill -TERM <pid>

# Or SIGINT (Ctrl+C in terminal)
kill -INT <pid>

Via API¶

# Trigger graceful shutdown via HTTP
curl -X POST http://gateway:3001/ha/shutdown

Kubernetes Integration¶

Kubernetes sends SIGTERM by default when terminating pods. Configure terminationGracePeriodSeconds to match or exceed your SMG grace period:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: smg
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 210  # SMG grace + buffer
      containers:
        - name: smg
          args:
            - --shutdown-grace-period-secs=180

Kubernetes timeout

Kubernetes will force-kill the pod after terminationGracePeriodSeconds. Set this higher than --shutdown-grace-period-secs to ensure SMG has time to complete its graceful shutdown.

Sizing the Grace Period¶

Consider these factors when setting the grace period:

Factor	Impact on Grace Period
Average request duration	Grace period should exceed typical request time
Longest expected request	Batch jobs may need longer grace periods
Streaming responses	Long streams need extended grace periods
Deployment frequency	Frequent deployments may need shorter periods
Scaling responsiveness	Autoscaling may need faster termination

Calculation Guidelines¶

grace_period = max(
    avg_request_duration * 3,
    p99_request_duration * 1.5,
    max_streaming_duration
)

Example: If your average request is 30s, p99 is 60s, and max streaming is 120s:

grace_period = max(90, 90, 120) = 120 seconds

Integration with Load Balancers¶

For zero-downtime deployments, coordinate with your load balancer:

Pre-Stop Hook (Kubernetes)¶

Remove the pod from the load balancer before shutdown:

spec:
  containers:
    - name: smg
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]

The sleep allows the load balancer to stop sending new traffic before SMG begins its graceful shutdown.

Health Check Coordination¶

During shutdown, SMG's health endpoint can return unhealthy to signal load balancers:

# Health check during normal operation
curl http://gateway:3001/health
# Returns 200 OK

# During graceful shutdown
curl http://gateway:3001/health
# Returns 503 Service Unavailable

Monitoring¶

Shutdown Events¶

Watch logs for shutdown-related messages:

# Graceful shutdown initiated
[INFO] Received shutdown signal, starting graceful shutdown
[INFO] Stopping new request acceptance
[INFO] Waiting for 5 in-flight requests to complete

# Requests completing
[INFO] In-flight requests: 5 -> 4
[INFO] In-flight requests: 4 -> 3
...

# Clean exit
[INFO] All requests completed, shutting down

Metrics During Shutdown¶

Metric	Observation
`smg_requests_active`	Should decrease towards 0
`smg_requests_total`	New requests should stop
`smg_shutdown_in_progress`	1 during graceful shutdown

Tuning Guidelines¶

Symptom	Potential Adjustment
Requests failing during deployment	Increase `--shutdown-grace-period-secs`
Slow scaling down	Decrease `--shutdown-grace-period-secs`
Kubernetes force-killing pods	Increase `terminationGracePeriodSeconds`
Streaming responses truncated	Match grace period to max stream duration

What's Next?¶

Circuit Breakers¶

Isolate failing workers to prevent cascade failures.

Circuit Breakers →

Health Checks¶

Proactive worker monitoring and failure detection.

Health Checks →

Rate Limiting¶

Protect workers from overload with token bucket rate limiting.

Rate Limiting →