Skip to content

TokenSpeedSpeed-of-light LLM inference

Production-oriented docs for launching, tuning, and operating low-latency OpenAI-compatible serving.

Start Here

Common Workflow

  1. Install the runtime and kernel packages.
  2. Pick a launch recipe close to your model family and hardware.
  3. Set model loading, memory, scheduler, and parallelism parameters explicitly.
  4. Validate correctness and throughput together before changing more than one tuning dimension.

Minimal Server

bash
tokenspeed serve openai/gpt-oss-20b \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1

The server exposes an OpenAI-compatible API under /v1.

High-Performance Shape

Large MoE deployments usually make the same decisions:

  • model path and revision
  • context length and KV cache dtype
  • scheduler token and sequence budgets
  • attention and MoE backends
  • tensor, data, and expert parallelism
  • reasoning, tool-call, and speculative decoding parsers

See Model Recipes for concrete examples and Server Parameters for the parameter reference.

TokenSpeed documentation