TokenSpeedSpeed-of-light LLM inference

Production-oriented docs for launching, tuning, and operating low-latency OpenAI-compatible serving.

Launch First

Start with concrete commands, then tune the exact knobs that affect memory, scheduling, parallelism, and kernels.

TokenSpeed keeps familiar parameter names where the runtime semantics match, with TokenSpeed-specific knobs documented separately.

Recipes collect the launch patterns used for Kimi and GPT-OSS deployments.

Parallelism and configuration guidance stay close to the serving paths operators actually use.

Install the runtime and kernel packages.
Pick a launch recipe close to your model family and hardware.
Set model loading, memory, scheduler, and parallelism parameters explicitly.
Validate correctness and throughput together before changing more than one tuning dimension.

bash

tokenspeed serve openai/gpt-oss-20b \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1

The server exposes an OpenAI-compatible API under /v1.

Large MoE deployments usually make the same decisions:

See Model Recipes for concrete examples and Server Parameters for the parameter reference.