Skip to content

Server Parameters

This page documents the parameters operators usually set directly. TokenSpeed uses familiar serving parameter names where the semantics match and keeps TokenSpeed-specific knobs for runtime features with different meaning.

For a compact compatibility table, see Compatible Parameters.

Model Loading

ParameterPurpose
positional modelModel path or Hugging Face repo ID.
--modelEquivalent to positional model.
--tokenizerTokenizer path when it differs from the model path.
--tokenizer-modeSelect tokenizer behavior. auto uses fast tokenizers and model-specific hooks when available.
--skip-tokenizer-initSkip tokenizer initialization for input-ID-only serving paths.
--load-formatWeight loading format: auto, pt, safetensors, npcache, dummy, or extensible.
--trust-remote-codeAllow custom model code from the model repository.
--revisionModel branch, tag, or commit.
--download-dirHugging Face download/cache directory.
--hf-overridesJSON overrides for model configuration values.

Precision And Quantization

ParameterPurpose
--dtypeModel weight and activation dtype. auto follows model metadata.
--kv-cache-dtypeKV cache dtype. Lower precision reduces KV memory and may require scaling factors.
--kv-cache-quant-methodKV cache quantization method.
--quantizationWeight quantization mode such as fp8, nvfp4, w8a8_fp8, or compressed-tensors.
--quantization-param-pathJSON file for KV cache scaling factors, commonly needed with FP8 KV cache.

API Surface

ParameterPurpose
--hostHTTP bind host.
--portHTTP bind port.
--served-model-nameModel name returned by the OpenAI-compatible API.
--api-keyAPI key required by the server.
--chat-templateBuilt-in chat template name or template file path.
--completion-templateCompletion template for code-completion style serving.
--stream-intervalStreaming buffer interval in generated tokens. Smaller values stream more frequently.
--stream-outputReturn generated text as disjoint streaming segments.

Scheduler And Memory

ParameterPurpose
--max-model-lenMaximum sequence length. If omitted, TokenSpeed uses the model config.
--gpu-memory-utilizationFraction of GPU memory used for model weights and KV cache. Lower it to leave headroom.
--max-num-seqsMaximum number of active sequences the scheduler may process concurrently.
--chunked-prefill-sizeToken budget the scheduler may issue in one iteration. Defaults to 8192. Set -1 to disable chunked prefill.
--max-prefill-tokensPrefill token budget used when chunked prefill is disabled. Defaults to 8192.
--max-total-tokensOverride the automatically calculated token pool size.
--block-sizeKV cache block size.
--enable-prefix-caching / --no-enable-prefix-cachingEnable or disable prefix cache reuse.
--enforce-eagerDisable CUDA graph execution.
--max-cudagraph-capture-sizeLargest batch size to capture with CUDA graphs.
--cudagraph-capture-sizesExplicit CUDA graph capture sizes.

--chunked-prefill-size is intentionally separate from --max-num-batched-tokens: in TokenSpeed it is the scheduler's per-iteration issue budget, while --max-total-tokens controls the global token pool.

Parallelism

ParameterPurpose
--tensor-parallel-size, --tpFamiliar alias for setting attention tensor parallel size.
--attn-tp-sizeTensor parallel size for attention.
--dense-tp-sizeTensor parallel size for dense layers.
--moe-tp-sizeTensor parallel size for MoE layers.
--data-parallel-sizeNumber of data-parallel replicas.
--enable-expert-parallelSet expert parallelism across the selected world size.
--expert-parallel-size, --ep-sizeExplicit expert parallel size.
--world-sizeTotal worker process count across all nodes.
--nprocs-per-nodeWorker process count per node.
--nnodesNumber of nodes.
--node-rankRank of the current node.
--dist-init-addrDistributed initialization address.

Use --tensor-parallel-size for simple launches. Use the TokenSpeed-specific split knobs when attention, dense, and MoE layers need different process groups.

Backend Selection

ParameterPurpose
--attention-backendAttention kernel backend. Common values include trtllm_mla, tokenspeed_mla, fa3, and mha.
--drafter-attention-backendAttention backend for speculative decoding drafter model.
--moe-backendMoE backend.
--draft-moe-backendMoE backend for the speculative decoding draft model.
--all2all-backendMoE all-to-all backend.
--deepep-modeDeepEP mode: auto, normal, or low_latency.
--sampling-backendSampling backend: greedy, flashinfer, or flashinfer_full.

Set backend choices explicitly in production. auto is useful for bring-up, but explicit values make benchmark comparisons and regressions easier to reason about.

Reasoning And Tool Calling

ParameterPurpose
--reasoning-parserParser for extracting reasoning content from model outputs.
--tool-call-parserParser for OpenAI-compatible tool-call payloads.
--tool-serverBuilt-in demo tool server.
--enable-custom-logit-processorAllow custom logit processors. Keep disabled unless the deployment needs it.
--think-end-tokenEnd marker for thinking models.

Common parser values include kimi_k2 and gpt-oss.

Speculative Decoding

ParameterPurpose
--speculative-configJSON speculative decoding configuration.
--speculative-algorithmSpeculative algorithm, such as EAGLE3 or MTP.
--speculative-draft-model-pathDraft model path or repo ID.
--speculative-draft-model-quantizationDraft model quantization. Defaults to unquant.
--speculative-num-stepsNumber of draft model steps. Defaults to 3.
--speculative-num-draft-tokensNumber of draft tokens. Defaults to --speculative-num-steps + 1.
--speculative-eagle-topkEAGLE top-k. Defaults to 1.
--eagle3-layers-to-captureEAGLE3 layers to capture.

Prefer --speculative-config for recipe-style launches because it keeps method, draft model, and token count together.

Observability

ParameterPurpose
--log-levelRuntime log level.
--log-level-httpHTTP server log level. Defaults to --log-level when unset.
--enable-log-requestsLog request metadata and optionally payloads.
--log-requests-levelRequest logging verbosity.
--enable-metricsEnable metrics reporting.
--metrics-reportersMetrics reporter, such as prometheus.
--decode-log-intervalDecode batch log interval.
--enable-cache-reportInclude cached-token counts in OpenAI-compatible usage details.

TokenSpeed-Specific Runtime Knobs

These parameters are TokenSpeed-specific. They expose runtime features directly:

  • --max-total-tokens
  • --max-prefill-tokens
  • --chunked-prefill-size
  • --attn-tp-size
  • --dense-tp-size
  • --moe-tp-size
  • --kvstore-*
  • --enable-mla-l1-5-cache
  • --mla-chunk-multiplier
  • --disaggregation-*
  • --comm-fusion-max-num-tokens
  • --enable-allreduce-fusion

TokenSpeed documentation