• About Us
  • Privacy Policy
  • Disclaimer
  • Contact Us
AimactGrow
  • Home
  • Technology
  • AI
  • SEO
  • Coding
  • Gaming
  • Cybersecurity
  • Digital marketing
No Result
View All Result
  • Home
  • Technology
  • AI
  • SEO
  • Coding
  • Gaming
  • Cybersecurity
  • Digital marketing
No Result
View All Result
AimactGrow
No Result
View All Result

LightSeek Basis Releases TokenSpeed, an Open-Supply LLM Inference Engine Concentrating on TensorRT-LLM-Stage Efficiency for Agentic Workloads

Admin by Admin
May 7, 2026
Home AI
Share on FacebookShare on Twitter


Inference effectivity has quietly develop into one of the crucial consequential bottlenecks in AI deployment. As agentic coding methods corresponding to Claude Code, Codex, and Cursor scale from developer instruments to infrastructure powering software program growth at giant, the underlying inference engines serving these requests are below growing pressure. The LightSeek Basis researchers have launched TokenSpeed, an open-source LLM inference engine launched below the MIT license and designed particularly for the calls for of agentic workloads. The TokenSpeed engine is at present in preview standing.

Why Agentic Inference is a Completely different Drawback

To know what makes TokenSpeed’s design decisions significant, it helps to grasp what makes agentic inference laborious. Coding brokers don’t behave like a typical chatbot flip. Contexts routinely exceed 50K tokens, and conversations typically span dozens of turns. This creates simultaneous stress on two metrics: per-GPU TPM (tokens per minute), which determines what number of customers a single GPU can serve, and per-user TPS (tokens per second), which determines whether or not a person consumer perceives the system as responsive. Most public benchmarks don’t totally seize this habits.

TokenSpeed has been designed to maximise each. The target is to maximise per-GPU TPM whereas sustaining a per-user TPS ground — sometimes 70 TPS, and typically 200 TPS or increased.

Structure: 5 Interlocking Subsystems

TokenSpeed’s structure is constructed round 5 design pillars: a compiler-backed modeling mechanism for parallelism, a high-performance scheduler, a secure KV useful resource reuse restriction, a pluggable layered kernel system that helps heterogeneous accelerators, and SMG integration for a low-overhead CPU-side request entrypoint.

The modeling layer makes use of an area SPMD (Single Program, A number of Knowledge) method. SPMD is a parallel execution mannequin the place all processes run the identical program however on totally different subsets of information — a standard sample in distributed deep studying. Reasonably than requiring builders to manually implement the communication logic between processes, TokenSpeed allows builders to specify I/O placement annotations at module boundaries, and a light-weight static compiler then routinely generates the required collective operations throughout mannequin development, eliminating the necessity to manually implement communication logic.

The scheduler makes a structural break up between the management aircraft and the execution aircraft. The management aircraft is carried out in C++ as a finite-state machine that works with the kind system to implement secure useful resource administration — together with KV cache state switch and utilization — at compile time slightly than at runtime. Request lifecycle, KV cache sources, and overlap timing are represented via express FSM transitions and possession semantics, so correctness is enforced by a verifiable management system slightly than conference. By encoding these correctness constraints into the kind system slightly than leaving them to runtime conference, errors in KV cache administration — one of the crucial error-prone areas in LLM serving — are caught earlier. The execution aircraft is carried out in Python to keep up growth effectivity, enabling quicker characteristic iteration and decrease cognitive load for builders

The kernel layer treats GPU kernels as a first-class modular subsystem slightly than baking them into the engine core. It gives a transportable public API, a centralized registry and choice mannequin, and an extensible plugin mechanism to assist heterogeneous accelerators — that means it isn’t locked to NVIDIA {hardware}. The dev crew has additionally developed one of many quickest MLA (Multi-head Latent Consideration) kernels for agentic workloads on NVIDIA Blackwell. Within the decode kernel, q_seqlen and num_heads are grouped to totally make the most of Tensor Cores, as num_heads are small in a few of these use instances. The binary prefill kernel features a fine-tuned softmax implementation. Notably, TokenSpeed MLA has been adopted by vLLM.

https://lightseek.org/weblog/lightseek-tokenspeed.html

Lastly, TokenSpeed integrates SMG — a PyTorch-native element — for a low-overhead CPU-side request entrypoint, decreasing the handoff price between CPU orchestration and GPU execution.

Benchmark Outcomes Towards TensorRT-LLM on NVIDIA B200

It’s value noting upfront that these benchmarks cowl single (non-disaggregated) deployment solely. PD disaggregation assist continues to be present process cleanup and could also be lined in a devoted follow-up from the TokenSpeed crew.

Along with the EvalScope crew, TokenSpeed was evaluated towards SWE-smith traces, which intently mirror manufacturing coding-agent visitors, benchmarked towards TensorRT-LLM — the present state-of-the-art on NVIDIA Blackwell. The take a look at mannequin was Kimi K2.5.

For coding brokers working above 70 TPS/Consumer, the most effective configuration is Consideration TP4 + MoE TP4, the place TokenSpeed dominates TensorRT-LLM throughout the complete Pareto frontier: roughly 9% quicker within the min-latency case (batch measurement 1), and roughly 11% increased throughput round 100 TPS/Consumer. TP4 right here refers to tensor parallelism throughout 4 GPUs, a way that shards mannequin weights throughout a number of gadgets to cut back per-device reminiscence stress and latency.

On the MLA kernel, the features are extra pronounced on the decode stage. The decode kernel folds the query-sequence axis into the pinnacle axis to raised fill the BMM1 M tile, bettering Tensor Core utilization. The binary-version prefill kernel makes use of NVIDIA-internal knobs to fine-tune the softmax implementation, outperforming TensorRT-LLM’s MLA throughout all 5 typical prefill workloads for coding brokers with lengthy prefix KV cache. Mixed with different optimizations, this almost halves latency relative to TensorRT-LLM on typical decode workloads with speculative decoding at batch sizes 4, 8, and 16 with lengthy prefix KV cache.

Key Takeaways

  • TokenSpeed is a brand new MIT-licensed, open-source LLM inference engine by LightSeek Basis, constructed particularly for agentic workloads. (Accessible in preview mode)
  • Its scheduler makes use of a C++ finite-state machine to implement KV cache security at compile time, whereas retaining the execution aircraft in Python for usability.
  • On NVIDIA B200, TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput at 100 TPS/Consumer on Kimi K2.5.
  • The TokenSpeed MLA kernel almost halves decode latency vs. TensorRT-LLM on speculative decoding workloads and has already been adopted by vLLM.

Try the Technical particulars and GitHub Repo. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.

Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us


Tags: AgenticEngineFoundationInferenceLightSeekLLMOpenSourcePerformanceReleasesTargetingTensorRTLLMLevelTokenSpeedworkloads
Admin

Admin

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended.

Ought to You Purchase Walmart’s Unique Onn TVs? This is What Client Reviews Says

Ought to You Purchase Walmart’s Unique Onn TVs? This is What Client Reviews Says

February 3, 2026
Reply engine optimization technique past primary Website positioning and AEO ways

Reply engine optimization technique past primary Website positioning and AEO ways

March 9, 2026

Trending.

The way to Clear up the Wall Puzzle in The place Winds Meet

The way to Clear up the Wall Puzzle in The place Winds Meet

November 16, 2025
Researchers Uncover Crucial GitHub CVE-2026-3854 RCE Flaw Exploitable by way of Single Git Push

Researchers Uncover Crucial GitHub CVE-2026-3854 RCE Flaw Exploitable by way of Single Git Push

April 29, 2026
Google Introduces Simula: A Reasoning-First Framework for Producing Controllable, Scalable Artificial Datasets Throughout Specialised AI Domains

Google Introduces Simula: A Reasoning-First Framework for Producing Controllable, Scalable Artificial Datasets Throughout Specialised AI Domains

April 21, 2026
Undertaking possession (fairness and fairness)

Your work diary | Seth’s Weblog

May 6, 2026
Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Reaching 88% Goodput Below Excessive {Hardware} Failure Charges

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Reaching 88% Goodput Below Excessive {Hardware} Failure Charges

April 24, 2026

AimactGrow

Welcome to AimactGrow, your ultimate source for all things technology! Our mission is to provide insightful, up-to-date content on the latest advancements in technology, coding, gaming, digital marketing, SEO, cybersecurity, and artificial intelligence (AI).

Categories

  • AI
  • Coding
  • Cybersecurity
  • Digital marketing
  • Gaming
  • SEO
  • Technology

Recent News

LightSeek Basis Releases TokenSpeed, an Open-Supply LLM Inference Engine Concentrating on TensorRT-LLM-Stage Efficiency for Agentic Workloads

LightSeek Basis Releases TokenSpeed, an Open-Supply LLM Inference Engine Concentrating on TensorRT-LLM-Stage Efficiency for Agentic Workloads

May 7, 2026
ServiceNow’s New Platform Additionally Governs Everybody Else’s AI

ServiceNow’s New Platform Additionally Governs Everybody Else’s AI

May 7, 2026
  • About Us
  • Privacy Policy
  • Disclaimer
  • Contact Us

© 2025 https://blog.aimactgrow.com/ - All Rights Reserved

No Result
View All Result
  • Home
  • Technology
  • AI
  • SEO
  • Coding
  • Gaming
  • Cybersecurity
  • Digital marketing

© 2025 https://blog.aimactgrow.com/ - All Rights Reserved