Inference effectivity has quietly develop into one of the crucial consequential bottlenecks in AI deployment. As agentic coding methods corresponding to Claude Code, Codex, and Cursor scale from developer instruments to infrastructure powering software program growth at giant, the underlying inference engines serving these requests are below growing pressure. The LightSeek Basis researchers have launched TokenSpeed, an open-source LLM inference engine launched below the MIT license and designed particularly for the calls for of agentic workloads. The TokenSpeed engine is at present in preview standing.
Why Agentic Inference is a Completely different Drawback
To know what makes TokenSpeed’s design decisions significant, it helps to grasp what makes agentic inference laborious. Coding brokers don’t behave like a typical chatbot flip. Contexts routinely exceed 50K tokens, and conversations typically span dozens of turns. This creates simultaneous stress on two metrics: per-GPU TPM (tokens per minute), which determines what number of customers a single GPU can serve, and per-user TPS (tokens per second), which determines whether or not a person consumer perceives the system as responsive. Most public benchmarks don’t totally seize this habits.
TokenSpeed has been designed to maximise each. The target is to maximise per-GPU TPM whereas sustaining a per-user TPS ground — sometimes 70 TPS, and typically 200 TPS or increased.
Structure: 5 Interlocking Subsystems
TokenSpeed’s structure is constructed round 5 design pillars: a compiler-backed modeling mechanism for parallelism, a high-performance scheduler, a secure KV useful resource reuse restriction, a pluggable layered kernel system that helps heterogeneous accelerators, and SMG integration for a low-overhead CPU-side request entrypoint.
The modeling layer makes use of an area SPMD (Single Program, A number of Knowledge) method. SPMD is a parallel execution mannequin the place all processes run the identical program however on totally different subsets of information — a standard sample in distributed deep studying. Reasonably than requiring builders to manually implement the communication logic between processes, TokenSpeed allows builders to specify I/O placement annotations at module boundaries, and a light-weight static compiler then routinely generates the required collective operations throughout mannequin development, eliminating the necessity to manually implement communication logic.
The scheduler makes a structural break up between the management aircraft and the execution aircraft. The management aircraft is carried out in C++ as a finite-state machine that works with the kind system to implement secure useful resource administration — together with KV cache state switch and utilization — at compile time slightly than at runtime. Request lifecycle, KV cache sources, and overlap timing are represented via express FSM transitions and possession semantics, so correctness is enforced by a verifiable management system slightly than conference. By encoding these correctness constraints into the kind system slightly than leaving them to runtime conference, errors in KV cache administration — one of the crucial error-prone areas in LLM serving — are caught earlier. The execution aircraft is carried out in Python to keep up growth effectivity, enabling quicker characteristic iteration and decrease cognitive load for builders
The kernel layer treats GPU kernels as a first-class modular subsystem slightly than baking them into the engine core. It gives a transportable public API, a centralized registry and choice mannequin, and an extensible plugin mechanism to assist heterogeneous accelerators — that means it isn’t locked to NVIDIA {hardware}. The dev crew has additionally developed one of many quickest MLA (Multi-head Latent Consideration) kernels for agentic workloads on NVIDIA Blackwell. Within the decode kernel, q_seqlen and num_heads are grouped to totally make the most of Tensor Cores, as num_heads are small in a few of these use instances. The binary prefill kernel features a fine-tuned softmax implementation. Notably, TokenSpeed MLA has been adopted by vLLM.


Lastly, TokenSpeed integrates SMG — a PyTorch-native element — for a low-overhead CPU-side request entrypoint, decreasing the handoff price between CPU orchestration and GPU execution.
Benchmark Outcomes Towards TensorRT-LLM on NVIDIA B200
It’s value noting upfront that these benchmarks cowl single (non-disaggregated) deployment solely. PD disaggregation assist continues to be present process cleanup and could also be lined in a devoted follow-up from the TokenSpeed crew.
Along with the EvalScope crew, TokenSpeed was evaluated towards SWE-smith traces, which intently mirror manufacturing coding-agent visitors, benchmarked towards TensorRT-LLM — the present state-of-the-art on NVIDIA Blackwell. The take a look at mannequin was Kimi K2.5.
For coding brokers working above 70 TPS/Consumer, the most effective configuration is Consideration TP4 + MoE TP4, the place TokenSpeed dominates TensorRT-LLM throughout the complete Pareto frontier: roughly 9% quicker within the min-latency case (batch measurement 1), and roughly 11% increased throughput round 100 TPS/Consumer. TP4 right here refers to tensor parallelism throughout 4 GPUs, a way that shards mannequin weights throughout a number of gadgets to cut back per-device reminiscence stress and latency.
On the MLA kernel, the features are extra pronounced on the decode stage. The decode kernel folds the query-sequence axis into the pinnacle axis to raised fill the BMM1 M tile, bettering Tensor Core utilization. The binary-version prefill kernel makes use of NVIDIA-internal knobs to fine-tune the softmax implementation, outperforming TensorRT-LLM’s MLA throughout all 5 typical prefill workloads for coding brokers with lengthy prefix KV cache. Mixed with different optimizations, this almost halves latency relative to TensorRT-LLM on typical decode workloads with speculative decoding at batch sizes 4, 8, and 16 with lengthy prefix KV cache.
Key Takeaways
- TokenSpeed is a brand new MIT-licensed, open-source LLM inference engine by LightSeek Basis, constructed particularly for agentic workloads. (Accessible in preview mode)
- Its scheduler makes use of a C++ finite-state machine to implement KV cache security at compile time, whereas retaining the execution aircraft in Python for usability.
- On NVIDIA B200, TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput at 100 TPS/Consumer on Kimi K2.5.
- The TokenSpeed MLA kernel almost halves decode latency vs. TensorRT-LLM on speculative decoding workloads and has already been adopted by vLLM.
Try the Technical particulars and GitHub Repo. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.
Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us








