Skip to content

Server Parameters

This page documents the parameters operators usually set directly. TokenSpeed uses familiar serving parameter names where the semantics match and keeps TokenSpeed-specific knobs for runtime features with different meaning.

For a compact compatibility table, see Compatible Parameters.

Model Loading

ParameterPurpose
positional modelModel path or Hugging Face repo ID.
--modelEquivalent to positional model.
--tokenizerTokenizer path when it differs from the model path.
--tokenizer-modeSelect tokenizer behavior. auto uses fast tokenizers and model-specific hooks when available.
--skip-tokenizer-initSkip tokenizer initialization for input-ID-only serving paths.
--load-formatWeight loading format: auto, pt, safetensors, npcache, dummy, or extensible.
--trust-remote-codeAllow custom model code from the model repository.
--revisionModel branch, tag, or commit.
--download-dirHugging Face download/cache directory.
--hf-overridesJSON overrides for model configuration values.

Precision And Quantization

ParameterPurpose
--dtypeModel weight and activation dtype. auto follows model metadata.
--kv-cache-dtypeKV cache dtype. Lower precision reduces KV memory and may require scaling factors.
--kv-cache-quant-methodKV cache quantization method.
--quantizationWeight quantization mode such as fp8, nvfp4, w8a8_fp8, or compressed-tensors.
--quantization-param-pathJSON file for KV cache scaling factors, commonly needed with FP8 KV cache.

API Surface

ParameterPurpose
--hostHTTP bind host.
--portHTTP bind port.
--served-model-nameModel name returned by the OpenAI-compatible API.
--api-keyAPI key required by the server.
--chat-templateBuilt-in chat template name or template file path (handled by the smg gateway).
--stream-intervalStreaming buffer interval in generated tokens. Smaller values stream more frequently.
--stream-outputReturn generated text as disjoint streaming segments.

Scheduler And Memory

ParameterPurpose
--max-model-lenMaximum sequence length. If omitted, TokenSpeed uses the model config.
--gpu-memory-utilizationFraction of GPU memory used for model weights and KV cache. Lower it to leave headroom.
--max-num-seqsMaximum number of active sequences the scheduler may process concurrently.
--chunked-prefill-sizeToken budget the scheduler may issue in one iteration. Defaults to 8192. Set -1 to disable chunked prefill.
--max-prefill-tokensPrefill token budget used when chunked prefill is disabled. Defaults to 8192.
--max-total-tokensOverride the automatically calculated token pool size.
--block-sizeKV cache block size.
--enable-prefix-caching / --no-enable-prefix-cachingEnable or disable prefix cache reuse.
--enforce-eagerDisable CUDA graph execution.
--max-cudagraph-capture-sizeLargest batch size to capture with CUDA graphs.
--cudagraph-capture-sizesExplicit CUDA graph capture sizes.

--chunked-prefill-size is intentionally separate from --max-num-batched-tokens: in TokenSpeed it is the scheduler's per-iteration issue budget, while --max-total-tokens controls the global token pool.

Parallelism

ParameterPurpose
--tensor-parallel-size, --tpFamiliar alias for setting attention tensor parallel size.
--attn-tp-sizeTensor parallel size for attention.
--dense-tp-sizeTensor parallel size for dense layers.
--moe-tp-sizeTensor parallel size for MoE layers.
--data-parallel-sizeNumber of data-parallel replicas.
--enable-expert-parallelSet expert parallelism across the selected world size.
--expert-parallel-size, --ep-sizeExplicit expert parallel size.
--world-sizeTotal worker process count across all nodes.
--nprocs-per-nodeWorker process count per node.
--nnodesNumber of nodes.
--node-rankRank of the current node.
--dist-init-addrDistributed initialization address.

Use --tensor-parallel-size for simple launches. Use the TokenSpeed-specific split knobs when attention, dense, and MoE layers need different process groups.

Backend Selection

ParameterPurpose
--attention-backendAttention kernel backend. Common values include mha, fa3, fa4, triton, flashinfer, trtllm_mla, and tokenspeed_mla.
--drafter-attention-backendAttention backend for speculative decoding drafter model.
--moe-backendMoE backend.
--draft-moe-backendMoE backend for the speculative decoding draft model.
--all2all-backendMoE all-to-all backend.
--deepep-modeDeepEP mode: auto, normal, or low_latency.
--sampling-backendSampling backend: greedy, flashinfer, or flashinfer_full.

Set backend choices explicitly in production. auto is useful for bring-up, but explicit values make benchmark comparisons and regressions easier to reason about.

When --dp-sampling is enabled, the logits processor owns the per-forward logits layout decision and carries the resulting plan to the sampling backend with the logits output.

Reasoning And Tool Calling

ParameterPurpose
--reasoning-parserParser for extracting reasoning content from model outputs (handled by the smg gateway).
--tool-call-parserParser for OpenAI-compatible tool-call payloads (handled by the smg gateway).
--enable-custom-logit-processorAllow custom logit processors. Keep disabled unless the deployment needs it.

Common reasoning parser values include kimi_k25, base, qwen3, deepseek_r1, and deepseek_v31. Common tool-call parser values include kimik2, qwen, deepseek_v4, json, and passthrough. The parser names are validated by the SMG gateway, so use the values accepted by the bundled tokenspeed-smg package.

Speculative Decoding

ParameterPurpose
--speculative-configJSON speculative decoding configuration.
--speculative-algorithmSpeculative algorithm, such as EAGLE3, MTP, or DFLASH.
--speculative-draft-model-pathDraft model path or repo ID.
--speculative-draft-model-quantizationDraft model quantization. Defaults to unquant.
--speculative-num-stepsNumber of draft model steps. Defaults to 3.
--speculative-num-draft-tokensNumber of draft tokens. Defaults to --speculative-num-steps + 1.
--speculative-eagle-topkEAGLE top-k. Defaults to 1.
--eagle3-layers-to-captureEAGLE3 layers to capture.

Prefer --speculative-config for recipe-style launches because it keeps method, draft model, and token count together.

Observability

ParameterPurpose
--log-levelRuntime log level.
--log-level-httpHTTP server log level. Defaults to --log-level when unset.
--enable-log-requestsLog request metadata and optionally payloads.
--log-requests-levelRequest logging verbosity.
--enable-metricsEnable metrics reporting.
--metrics-reportersMetrics reporter, such as prometheus.
--decode-log-intervalDecode batch log interval.
--enable-cache-reportInclude cached-token counts in OpenAI-compatible usage details.
--kv-events-configJSON config for KV cache mutation events. Set enable_kv_cache_events and a publisher such as zmq to publish device prefix-cache stores and removals.

KV Cache Events

KV cache events publish reusable device prefix-cache mutations from the live C++ scheduler path. Host/L2 loadback events are not published by this initial stream. Block hash lineage is cached on prefix-cache nodes, so publishing a stored block uses the parent node's cached hash instead of rebuilding the full ancestor prefix.

Example:

bash
--kv-events-config '{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"tcp://*:5557","topic":"kv-events"}'

The ZMQ publisher sends three frames: topic bytes, an 8-byte big-endian sequence number, and a msgpack payload. The payload is an array-like KVEventBatch:

python
[timestamp, [["BlockStored", [block_hash], parent_hash, token_ids, block_size]], attn_dp_rank]
[timestamp, [["BlockRemoved", [block_hash]]], attn_dp_rank]

With attention data parallelism, each attention DP rank publishes on an offset port from the configured endpoint.

TokenSpeed-Specific Runtime Knobs

These parameters are TokenSpeed-specific. They expose runtime features directly:

  • --max-total-tokens
  • --max-prefill-tokens
  • --chunked-prefill-size
  • --attn-tp-size
  • --dense-tp-size
  • --moe-tp-size
  • --kvstore-*
  • --enable-mla-l1-5-cache
  • --kv-events-config
  • --mla-chunk-multiplier
  • --disaggregation-*
  • --comm-fusion-max-num-tokens
  • --enable-allreduce-fusion

TokenSpeed documentation