Server Parameters
This page documents the parameters operators usually set directly. TokenSpeed uses familiar serving parameter names where the semantics match and keeps TokenSpeed-specific knobs for runtime features with different meaning.
For a compact compatibility table, see Compatible Parameters.
Model Loading
| Parameter | Purpose |
|---|---|
positional model | Model path or Hugging Face repo ID. |
--model | Equivalent to positional model. |
--tokenizer | Tokenizer path when it differs from the model path. |
--tokenizer-mode | Select tokenizer behavior. auto uses fast tokenizers and model-specific hooks when available. |
--skip-tokenizer-init | Skip tokenizer initialization for input-ID-only serving paths. |
--load-format | Weight loading format: auto, pt, safetensors, npcache, dummy, or extensible. |
--trust-remote-code | Allow custom model code from the model repository. |
--revision | Model branch, tag, or commit. |
--download-dir | Hugging Face download/cache directory. |
--hf-overrides | JSON overrides for model configuration values. |
Precision And Quantization
| Parameter | Purpose |
|---|---|
--dtype | Model weight and activation dtype. auto follows model metadata. |
--kv-cache-dtype | KV cache dtype. Lower precision reduces KV memory and may require scaling factors. |
--kv-cache-quant-method | KV cache quantization method. |
--quantization | Weight quantization mode such as fp8, nvfp4, w8a8_fp8, or compressed-tensors. |
--quantization-param-path | JSON file for KV cache scaling factors, commonly needed with FP8 KV cache. |
API Surface
| Parameter | Purpose |
|---|---|
--host | HTTP bind host. |
--port | HTTP bind port. |
--served-model-name | Model name returned by the OpenAI-compatible API. |
--api-key | API key required by the server. |
--chat-template | Built-in chat template name or template file path. |
--completion-template | Completion template for code-completion style serving. |
--stream-interval | Streaming buffer interval in generated tokens. Smaller values stream more frequently. |
--stream-output | Return generated text as disjoint streaming segments. |
Scheduler And Memory
| Parameter | Purpose |
|---|---|
--max-model-len | Maximum sequence length. If omitted, TokenSpeed uses the model config. |
--gpu-memory-utilization | Fraction of GPU memory used for model weights and KV cache. Lower it to leave headroom. |
--max-num-seqs | Maximum number of active sequences the scheduler may process concurrently. |
--chunked-prefill-size | Token budget the scheduler may issue in one iteration. Defaults to 8192. Set -1 to disable chunked prefill. |
--max-prefill-tokens | Prefill token budget used when chunked prefill is disabled. Defaults to 8192. |
--max-total-tokens | Override the automatically calculated token pool size. |
--block-size | KV cache block size. |
--enable-prefix-caching / --no-enable-prefix-caching | Enable or disable prefix cache reuse. |
--enforce-eager | Disable CUDA graph execution. |
--max-cudagraph-capture-size | Largest batch size to capture with CUDA graphs. |
--cudagraph-capture-sizes | Explicit CUDA graph capture sizes. |
--chunked-prefill-size is intentionally separate from --max-num-batched-tokens: in TokenSpeed it is the scheduler's per-iteration issue budget, while --max-total-tokens controls the global token pool.
Parallelism
| Parameter | Purpose |
|---|---|
--tensor-parallel-size, --tp | Familiar alias for setting attention tensor parallel size. |
--attn-tp-size | Tensor parallel size for attention. |
--dense-tp-size | Tensor parallel size for dense layers. |
--moe-tp-size | Tensor parallel size for MoE layers. |
--data-parallel-size | Number of data-parallel replicas. |
--enable-expert-parallel | Set expert parallelism across the selected world size. |
--expert-parallel-size, --ep-size | Explicit expert parallel size. |
--world-size | Total worker process count across all nodes. |
--nprocs-per-node | Worker process count per node. |
--nnodes | Number of nodes. |
--node-rank | Rank of the current node. |
--dist-init-addr | Distributed initialization address. |
Use --tensor-parallel-size for simple launches. Use the TokenSpeed-specific split knobs when attention, dense, and MoE layers need different process groups.
Backend Selection
| Parameter | Purpose |
|---|---|
--attention-backend | Attention kernel backend. Common values include trtllm_mla, tokenspeed_mla, fa3, and mha. |
--drafter-attention-backend | Attention backend for speculative decoding drafter model. |
--moe-backend | MoE backend. |
--draft-moe-backend | MoE backend for the speculative decoding draft model. |
--all2all-backend | MoE all-to-all backend. |
--deepep-mode | DeepEP mode: auto, normal, or low_latency. |
--sampling-backend | Sampling backend: greedy, flashinfer, or flashinfer_full. |
Set backend choices explicitly in production. auto is useful for bring-up, but explicit values make benchmark comparisons and regressions easier to reason about.
Reasoning And Tool Calling
| Parameter | Purpose |
|---|---|
--reasoning-parser | Parser for extracting reasoning content from model outputs. |
--tool-call-parser | Parser for OpenAI-compatible tool-call payloads. |
--tool-server | Built-in demo tool server. |
--enable-custom-logit-processor | Allow custom logit processors. Keep disabled unless the deployment needs it. |
--think-end-token | End marker for thinking models. |
Common parser values include kimi_k2 and gpt-oss.
Speculative Decoding
| Parameter | Purpose |
|---|---|
--speculative-config | JSON speculative decoding configuration. |
--speculative-algorithm | Speculative algorithm, such as EAGLE3 or MTP. |
--speculative-draft-model-path | Draft model path or repo ID. |
--speculative-draft-model-quantization | Draft model quantization. Defaults to unquant. |
--speculative-num-steps | Number of draft model steps. Defaults to 3. |
--speculative-num-draft-tokens | Number of draft tokens. Defaults to --speculative-num-steps + 1. |
--speculative-eagle-topk | EAGLE top-k. Defaults to 1. |
--eagle3-layers-to-capture | EAGLE3 layers to capture. |
Prefer --speculative-config for recipe-style launches because it keeps method, draft model, and token count together.
Observability
| Parameter | Purpose |
|---|---|
--log-level | Runtime log level. |
--log-level-http | HTTP server log level. Defaults to --log-level when unset. |
--enable-log-requests | Log request metadata and optionally payloads. |
--log-requests-level | Request logging verbosity. |
--enable-metrics | Enable metrics reporting. |
--metrics-reporters | Metrics reporter, such as prometheus. |
--decode-log-interval | Decode batch log interval. |
--enable-cache-report | Include cached-token counts in OpenAI-compatible usage details. |
TokenSpeed-Specific Runtime Knobs
These parameters are TokenSpeed-specific. They expose runtime features directly:
--max-total-tokens--max-prefill-tokens--chunked-prefill-size--attn-tp-size--dense-tp-size--moe-tp-size--kvstore-*--enable-mla-l1-5-cache--mla-chunk-multiplier--disaggregation-*--comm-fusion-max-num-tokens--enable-allreduce-fusion