Nanbeige4-3B-Pondering: How a 23T Token Pipeline Pushes 3B Fashions Previous 30B Class Reasoning

Can a 3B mannequin ship 30B class reasoning by fixing the coaching recipe as an alternative of scaling parameters? Nanbeige LLM Lab at Boss Zhipin has launched Nanbeige4-3B, a 3B parameter small language mannequin household skilled with an unusually heavy emphasis on knowledge high quality, curriculum scheduling, distillation, and reinforcement studying.

The analysis staff ships 2 major checkpoints, Nanbeige4-3B-Base and Nanbeige4-3B-Pondering, and evaluates the reasoning tuned mannequin in opposition to Qwen3 checkpoints from 4B as much as 32B parameters.

Benchmark outcomes

On AIME 2024, Nanbeige4-3B-2511 stories 90.4, whereas Qwen3-32B-2504 stories 81.4. On GPQA-Diamond, Nanbeige4-3B-2511 stories 82.2, whereas Qwen3-14B-2504 stories 64.0 and Qwen3-32B-2504 stories 68.7. These are the two benchmarks the place the analysis’s “3B beats 10× bigger” framing is immediately supported.

The analysis staff additionally showcase sturdy device use positive aspects on BFCL-V4, Nanbeige4-3B stories 53.8 versus 47.9 for Qwen3-32B and 48.6 for Qwen3-30B-A3B. On Area-Exhausting V2, Nanbeige4-3B stories 60.0, matching the best rating listed in that comparability desk contained in the analysis paper. On the similar time, the mannequin will not be finest throughout each class, on Fullstack-Bench it stories 48.0, beneath Qwen3-14B at 55.7 and Qwen3-32B at 58.2, and on SuperGPQA it stories 53.2, barely beneath Qwen3-32B at 54.1.

The coaching recipe, the elements that transfer a 3B mannequin

Hybrid Knowledge Filtering, then resampling at scale

For pretraining, the analysis staff mix multi dimensional tagging with similarity primarily based scoring. They scale back their labeling house to twenty dimensions and report 2 key findings, content material associated labels are extra predictive than format labels, and a tremendous grained 0 to 9 scoring scheme outperforms binary labeling. For similarity primarily based scoring, they construct a retrieval database with tons of of billions of entries supporting hybrid textual content and vector retrieval.

They filter to 12.5T tokens of top of the range knowledge, then choose a 6.5T greater high quality subset and upsample it for two or extra epochs, producing a closing 23T token coaching corpus. That is the primary place the place the report diverges from typical small mannequin coaching, the pipeline isn’t just “clear knowledge”, it’s scored, retrieved, and resampled with express utility assumptions.

FG-WSD, a knowledge utility scheduler as an alternative of uniform sampling

Most related analysis tasks deal with warmup secure decay as a studying fee schedule solely. Nanbeige4-3B provides a knowledge curriculum contained in the secure part by way of FG-WSD, Positive-Grained Warmup-Secure-Decay. As a substitute of sampling a set combination all through secure coaching, they progressively focus greater high quality knowledge later in coaching.

In a 1B ablation skilled on 1T tokens, the above Desk exhibits GSM8K bettering from 27.1 underneath vanilla WSD to 34.3 underneath FG-WSD, with positive aspects throughout CMATH, BBH, MMLU, CMMLU, and MMLU-Professional. Within the full 3B run, the analysis staff splits coaching into Warmup, Variety-Enriched Secure, Excessive-High quality Secure, and Decay, and makes use of ABF within the decay stage to increase context size to 64K.

Multi-stage SFT, then repair the supervision traces

Put up coaching begins with chilly begin SFT, then total SFT. The chilly begin stage makes use of about 30M QA samples centered on math, science, and code, with 32K context size, and a reported mixture of about 50% math reasoning, 30% scientific reasoning, and 20% code duties. The analysis staff additionally declare that scaling chilly begin SFT directions from 0.5M to 35M retains bettering AIME 2025 and GPQA-Diamond, with no early saturation of their experiments.

General SFT shifts to a 64K context size combine together with normal dialog and writing, agent type device use and planning, more durable reasoning that targets weaknesses, and coding duties. This stage introduces Resolution refinement plus Chain-of-Thought reconstruction. The system runs iterative generate, critique, revise cycles guided by a dynamic guidelines, then makes use of a series completion mannequin to reconstruct a coherent CoT that’s in line with the ultimate refined answer. That is meant to keep away from coaching on damaged reasoning traces after heavy modifying.

DPD distillation, then multi stage RL with verifiers

Distillation makes use of Twin-Degree Desire Distillation, DPD. The scholar learns token degree distributions from the trainer mannequin, whereas a sequence degree DPO goal maximizes the margin between optimistic and unfavourable responses. Positives come from sampling the trainer Nanbeige3.5-Professional, negatives are sampled from the 3B scholar, and distillation is utilized on each pattern sorts to cut back assured errors and enhance options.

Reinforcement studying is staged by area, and every stage makes use of on coverage GRPO. The analysis staff describes on coverage knowledge filtering utilizing avg@16 move fee and retaining samples strictly between 10% and 90% to keep away from trivial or unimaginable gadgets. STEM RL makes use of an agentic verifier that calls a Python interpreter to examine equivalence past string matching. Coding RL makes use of artificial check capabilities, validated by way of sandbox execution, and makes use of move fail rewards from these exams. Human desire alignment RL makes use of a pairwise reward mannequin designed to supply preferences in a couple of tokens and scale back reward hacking danger in comparison with normal language mannequin rewarders.

Comparability Desk

Benchmark, metric	Qwen3-14B-2504	Qwen3-32B-2504	Nanbeige4-3B-2511
AIME2024, avg@8	79.3	81.4	90.4
AIME2025, avg@8	70.4	72.9	85.6
GPQA-Diamond, avg@3	64.0	68.7	82.2
SuperGPQA, avg@3	46.8	54.1	53.2
BFCL-V4, avg@3	45.4	47.9	53.8
Fullstack Bench, avg@3	55.7	58.2	48.0
ArenaHard-V2, avg@3	39.9	48.4	60.0

Key Takeaways

3B can lead a lot bigger open fashions on reasoning, underneath the paper’s averaged sampling setup. Nanbeige4-3B-Pondering stories AIME 2024 avg@8 90.4 vs Qwen3-32B 81.4, and GPQA-Diamond avg@3 82.2 vs Qwen3-14B 64.0.
The analysis staff is cautious about analysis, these are avg@ok outcomes with particular decoding, not single shot accuracy. AIME is avg@8, most others are avg@3, with temperature 0.6, prime p 0.95, and lengthy max technology.
Pretraining positive aspects are tied to knowledge curriculum, not simply extra tokens. Positive-Grained WSD schedules greater high quality mixtures later, and the 1B ablation exhibits GSM8K shifting from 27.1 to 34.3 versus vanilla scheduling.
Put up-training focuses on supervision high quality, then desire conscious distillation. The pipeline makes use of deliberative answer refinement plus chain-of-thought reconstruction, then Twin Desire Distillation that mixes token distribution matching with sequence degree desire optimization.

Take a look at the Paper and Mannequin Weights. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.

Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking advanced datasets into actionable insights.