Prime Mind Releases prime-rl 0.6.0 to Practice Trillion-Parameter MoE Fashions on Agentic RL Workloads

Prime Mind has launched prime-rl model 0.6.0. The framework targets reinforcement studying on trillion-parameter Combination-of-Consultants (MoE) fashions. It focuses on heavy agentic workloads, like long-horizon software-engineering duties.

The analysis crew skilled GLM-5 on SWE duties at as much as 131k sequence size. Step occasions stayed below 5 minutes. The batch measurement was 256 rollouts. The run used solely 28 H200 nodes.

TL;DR

prime-rl 0.6.0 trains trillion-parameter MoE fashions on agentic RL workloads.
GLM-5 skilled on SWE at 131k sequence size, sub-5-minute steps, 28 H200 nodes.
Asynchronous RL disaggregates coach and inference for unbiased optimization.
Inference makes use of FP8, Vast EP, P/D disaggregation, KV offloading, and router replay.
Coaching makes use of 3-D parallelism (FSDP, EP, CP) plus block-scaled FP8.

What’s prime-rl 0.6.0?

prime-rl is an open framework for asynchronous reinforcement studying. It post-trains massive open-source fashions on agentic duties. Model 0.6.0 extends this to trillion-parameter MoE scale.

The instance mannequin within the announcement is zai-org/GLM-5.1. The optimizations additionally apply to different massive MoE fashions. Examples embody moonshotai/Kimi-K2.7-Code and nvidia/NVIDIA-Nemotron-3-Extremely-550B-A55B-BF16.

A full GLM-5.1 run begins with one command on a Slurm cluster.

uv run rl @ examples/glm5_llmd/rl.toml --output-dir /shared/outputs/glm5-llmd

Position of asynchronous RL

Agentic duties have long-tail outliers. Some coding rollouts run for hours. Ready for them earlier than every coverage replace would idle GPUs.

Asynchronous RL avoids this. The coach and inference methods are disaggregated. They run and scale independently. The inference coverage updates as quickly because the optimizer step finishes.

There may be one synchronization level: the coverage replace. prime-rl pushes new weights as quickly as they exist. Already-dispatched rollouts maintain their energetic prefix cache. So a single rollout might combine tokens from a number of coverage variations.

New rollouts behave in a different way. They repopulate their very own KV cache, even when prefixes match. A KV-cache salt forces this. Requests from too previous a coverage are dropped. The max_off_policy_steps worth controls that threshold.

Inference optimizations

Inference is often the throughput bottleneck in an RL system. prime-rl optimizes for throughput, whereas preserving latency bounded.

FP8 inference: Decrease precision hurries up prefill and decode. prime-rl makes use of FP8 with DeepEP and DeepGEMM kernels.

Vast Professional Parallelism: Vast EP spreads specialists throughout ≥32 GPUs. It pairs with a big data-parallel rank, for instance 32. Every GPU holds separate specialists and serves as an endpoint. Synchronization occurs per-layer, via dispatch and mix operations.

Prefill and Decode Disaggregation: Some mannequin↔env pairs hit a 4:1 prefill:decode token ratio. Shared staff would inflate end-to-end latency. That reduces the advantages of PipelineRL. P/D disaggregation separates prefill and decode staff. Lengthy software outputs then cease throttling decode staff.

KV cache administration: Excessive concurrency wants massive KV cache house. prime-rl helps tiered offloading to CPU and disk. vLLM native offloading creates one pool per employee. Mooncake Retailer as an alternative swimming pools RAM and disk throughout all nodes centrally.

Request routing: prime-rl ships a fork of vllm-router by default. It additionally helps the NVIDIA Dynamo router as a drop-in. Routers rating staff utilizing KV cache reuse, queue depth, and stay load.

Router replay (R3): Coach↔inference mismatch silently kills coaching. Router replay captures inference routing choices. It replays them straight on the coach. This cuts KL mismatch by roughly an order of magnitude. Routed specialists have form [num_layers, top_k, seq_len]. This payload can develop to a whole lot of GB. At scale, the information fee reaches tens of Gbps. So prime-rl treats it as an opaque payload. Optimized PyTorch operations deal with the processing.

Coaching optimizations

The coach builds on torchtitan, a PyTorch-native coaching codebase. It depends on 3-D parallelism: FSDP, CP, and EP. The GLM-5 case research makes use of all three.

Technique	What it shards	Major use	Key element
FSDP (FSDP2)	Parameters, gradients, optimizer states	Baseline reminiscence amortization	Gathers weights on demand per layer through `fully_shard`
Professional Parallelism (EP)	Consultants inside a layer	Shrinks energetic layer reminiscence	`all2all` dispatch/mix; torch-native or DeepEP
Context Parallelism (CP)	The sequence dimension	Lengthy-context activation reminiscence	Ulysses (default) or Ring Consideration

EP exists as a result of layers keep big after FSDP. With 78 layers and 800B params in float32, one layer’s all-gather wants roughly 40GB. Overlapping one layer pushes that close to 80GB. Setting EP=8 dispatches tokens as an alternative of gathering full specialists. torch-native all2all is barely sooner inside one node. DeepEP wins when EP spans a number of nodes.

CP issues at 131k+ sequence size. There, activations dominate reminiscence, not parameters. GLM-5 makes use of DSA, which neither Ulysses nor Ring Consideration parallelizes straight. So prime-rl ships a customized context-parallel implementation for it.

FP8 coaching. prime-rl makes use of DeepGEMM block-scaled FP8, as proposed by DeepSeek V3. This hardly ever raises throughput, as a consequence of quantization overhead. Its actual worth is matching coach and inference precision. That reduces KL mismatch and stabilizes coaching.

Interactive Explainer

Use circumstances with examples

Lengthy-horizon SWE brokers: Practice a mannequin on actual repository points. Rollouts can span 100s of turns and power calls. P/D disaggregation retains decode latency predictable right here.
1T-scale post-training on fewer nodes: The GLM-5 run match on 28 H200 nodes. Vast EP and KV offloading elevate concurrency and throughput.
Steady agentic RL at scale: Router replay and FP8 coaching each cut back coach↔inference KL mismatch. Decrease mismatch means steadier coaching.

Take a look at the Technical particulars. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 150k+ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as effectively.

Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us