LightSeek Foundation

TokenSpeed enables Day 0 support for Moonshot AI's 2.8T-parameter Kimi K3 on NVIDIA Blackwell and AMD CDNA4 within one week, with a unified flat KV cache, disaggregated serving, and specialized CuteDSL, TensorRT-LLM, and Gluon kernels.

2026/07/27

TokenSpeed brings native NVFP4 and MXFP4 serving to Thinking Machines Lab’s 975B-parameter open-source MoE model Inkling across NVIDIA and AMD accelerators.

2026/07/15

TokenSpeed-kernel is a standalone open-source subsystem for LLM inference kernels, using layered APIs and registry-based dispatch to separate runtime logic from hardware-specific backend implementations.

PyTorch Blog2026/06/25

The TokenSpeed inference engine achieved a record-breaking 580 tps running the Qwen3.5-397B-A17B model on GPUs. This extreme performance for agentic workloads is driven by systematic elimination of memory copies, advanced kernel fusions, and fully overlapped CPU-GPU execution-keeping the GPU saturated at all times. On the functionality side, TokenSpeed also supports hybrid prefix caching and unified Prefill-Decode state transfers to handle complex agentic serving scenarios.

PyTorch Blog2026/05/27

EAGLE 3.1 improves speculative decoding robustness with FC normalization and post-norm hidden-state feedback, with training support in TorchSpec and deployment support in vLLM.

2026/05/26

Slides for the PyCon US 2026 talk on building high-performance LLM inference in pure Python with PyTorch custom ops.

PyCon US 20262026/05/16

LightSeek Foundation shares the TokenSpeed project ethos and how external contributions fit into the project.

X Article2026/05/14

TokenSpeed is a speed-of-light LLM inference engine designed from first principles for agentic workloads, with a compiler-backed modeling mechanism for parallelism, a high performance scheduler, a safe KV resource reuse restriction, a pluggable layered kernel system that supports heterogeneous accelerators, and SMG integration.

2026/05/06

SMG moves CPU-bound serving work such as tokenization, detokenization, reasoning parsing, tool orchestration, multimodal preprocessing, and routing into a Rust gateway layer, keeping GPU inference engines focused on tensor execution while avoiding Python GIL bottlenecks.

PyTorch Blog2026/04/30

We’re excited to announce TorchSpec, a torch-native framework for scalable speculative decoding training. TorchSpec streams hidden states directly from inference engines to training workers via Mooncake, eliminating the need to materialize massive tensors on disk or co-locate training with the target model. This design enables fully disaggregated pipelines where inference and training scale independently.

2026/03/19

This post is jointly published with the PyTorch Foundation and is also available on the official PyTorch Blog.

PyTorch Blog2026/03/19

Blog

Kimi K3 at Day 0: Frontier Model Enablement on Leading Platforms with TokenSpeed

TML Inkling at Day 0: FP4 Inference on NVIDIA and AMD with TokenSpeed

TokenSpeed-Kernel: Portable APIs and High-Performance Kernels for Multi-Silicon LLM Inference

Up to 580tps! New Speed Record of Qwen3.5-397B-A17B on GPU for Agentic Workloads with TokenSpeed

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

High-Performance LLM Inference in Pure Python with PyTorch Custom Ops

TokenSpeed Project Ethos and External Contributions

TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads

SMG: The Case for Disaggregating CPU from GPU in LLM Serving

TorchSpec: Speculative Decoding Training at Scale

TorchSpec: Speculative Decoding Training at Scale