Skip to content

Compatible Parameters

TokenSpeed keeps familiar serving parameter names when the operational meaning is the same. This makes recipes portable while still documenting TokenSpeed-specific behavior explicitly.

Directly Aligned

ParameterTokenSpeed behavior
positional modelModel path or Hugging Face repo ID.
--modelEquivalent to positional model.
--tokenizerTokenizer path.
--tokenizer-modeTokenizer implementation mode.
--skip-tokenizer-initSkip tokenizer initialization.
--load-formatWeight loading format.
--trust-remote-codeAllow custom model code from the model repository.
--dtypeWeight and activation dtype.
--kv-cache-dtypeKV cache storage dtype.
--quantizationWeight quantization method.
--quantization-param-pathKV cache scaling-factor file.
--max-model-lenMaximum sequence length.
--deviceDevice type. TokenSpeed currently serves CUDA.
--served-model-nameOpenAI-compatible served model name.
--revisionModel revision.
--download-dirModel download directory.
--hf-overridesJSON model config overrides.
--hostHTTP bind host.
--portHTTP bind port.
--api-keyAPI key for the server.
--chat-templateChat template name or path.
--gpu-memory-utilizationGPU memory fraction used for weights and KV cache.
--max-num-seqsMaximum concurrent sequences.
--block-sizeKV cache block size.
--enable-prefix-cachingEnable prefix cache reuse.
--no-enable-prefix-cachingDisable prefix cache reuse.
--enforce-eagerDisable CUDA graph execution.
--max-cudagraph-capture-sizeLargest CUDA graph capture size.
--tensor-parallel-size, --tpSet attention tensor parallel size.
--data-parallel-sizeData parallel size.
--enable-expert-parallelEnable expert parallelism.
--speculative-configJSON speculative decoding config.
--tool-call-parserOpenAI-compatible tool-call parser.
--reasoning-parserReasoning-output parser.

Similar But Not Identical

Recipe parameterTokenSpeed parameterDifference
--max-num-batched-tokens--chunked-prefill-sizeTokenSpeed uses this as the scheduler per-iteration issue budget.
--max-num-batched-tokens--max-total-tokensTokenSpeed uses this for the global token pool size override.
--tensor-parallel-size, --tp--attn-tp-sizeThe familiar alias maps to attention TP. TokenSpeed can split attention, dense, and MoE TP.
--expert-parallel-size--expert-parallel-size, --ep-sizeTokenSpeed supports the familiar name and its existing short form.
--attention-backend--attention-backendName is aligned; available backend values are TokenSpeed-specific.
--moe-backend--moe-backendName is aligned; available backend values are TokenSpeed-specific.

Recipe Translation Notes

  • Use tokenspeed serve as the launcher.
  • Pass the model path positionally, then keep --trust-remote-code, --max-model-len, --kv-cache-dtype, --gpu-memory-utilization, --max-num-seqs, --tensor-parallel-size, --reasoning-parser, and --tool-call-parser when the model needs them.
  • Review --max-num-batched-tokens before copying it. TokenSpeed usually wants --chunked-prefill-size for per-iteration scheduling.
  • Review backend names. TokenSpeed backends are optimized for its runtime and kernel packages.
  • Keep TokenSpeed-specific --attn-tp-size, --moe-tp-size, --disaggregation-*, and --kvstore-* only when the deployment needs those features.

TokenSpeed documentation