Model Recipes
These recipes start from a known model family, pick the hardware topology, then set only the parameters that change runtime behavior.
The commands below are templates. Validate exact model IDs, checkpoint formats, and backend choices against the build you deploy.
Kimi K2.5 / K2.6
Kimi-style MoE launches usually need remote code, long context, reasoning and tool parsers, and explicit MLA/MoE backends.
bash
tokenspeed serve nvidia/Kimi-K2.5-NVFP4 \
--served-model-name kimi-k2.5 \
--trust-remote-code \
--max-model-len 262144 \
--kv-cache-dtype fp8 \
--quantization nvfp4 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--chunked-prefill-size 8192 \
--max-num-seqs 256 \
--attention-backend trtllm_mla \
--moe-backend flashinfer_trtllm \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--host 0.0.0.0 \
--port 8000For K2.6, keep the same parameter shape and change the checkpoint and parser only if the model card requires a different value.
GPT-OSS 20B / 120B
Small GPT-OSS launches can start simple. Large GPT-OSS launches usually tune tensor parallelism, scheduler token budget, and KV cache dtype.
bash
tokenspeed serve openai/gpt-oss-20b \
--served-model-name gpt-oss-20b \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--chunked-prefill-size 8192 \
--reasoning-parser gpt-oss \
--tool-call-parser gpt-oss \
--host 0.0.0.0 \
--port 8000bash
tokenspeed serve openai/gpt-oss-120b \
--served-model-name gpt-oss-120b \
--tensor-parallel-size 4 \
--max-model-len 131072 \
--kv-cache-dtype fp8 \
--chunked-prefill-size 8192 \
--max-num-seqs 256 \
--reasoning-parser gpt-oss \
--tool-call-parser gpt-oss \
--host 0.0.0.0 \
--port 8000Tuning Order
- Set model ID, trust policy, tokenizer mode, and served model name.
- Set context length and KV cache dtype.
- Set tensor, data, and expert parallelism to match the node topology.
- Set scheduler budgets:
--chunked-prefill-size,--max-num-seqs, and only then--max-total-tokens. - Set attention, MoE, and sampling backends explicitly for benchmark runs.
- Add reasoning, tool-call, grammar, or speculative decoding only when the model and workload need them.