Baidu Releases Limitless OCR, a 3B Mannequin That Retains the KV Cache Flat for Lengthy-Doc Parsing

Most end-to-end OCR fashions decelerate as output grows. Every generated token provides to the KV cache. Reminiscence rises and era drags. Parsing dozens of pages turns into impractical. Baidu’s Limitless OCR addresses this immediately. It swaps the decoder’s consideration for a design that retains reminiscence fixed.

TL;DR

Limitless OCR is a 3B-parameter Combination-of-Specialists mannequin, with solely 500M parameters lively.
It replaces decoder consideration with Reference Sliding Window Consideration (R-SWA), preserving the KV cache fixed.
The mannequin parses dozens of pages in a single ahead move underneath a 32K most size.
It scores 93.23 on OmniDocBench v1.5, beating the DeepSeek OCR baseline by 6.22 factors.
It builds on DeepSeek OCR through continue-training, not a from-scratch run.

What’s Limitless OCR?

Limitless OCR takes DeepSeek OCR as its baseline. It retains the DeepEncoder and the Combination-of-Specialists decoder. The MoE design holds 3B complete parameters however prompts solely 500M at inference.

The DeepEncoder is the compression engine. It cascades a SAM-ViT underneath window consideration with a CLIP-ViT underneath world consideration. On the bridge, it applies 16× token compression. A 1024×1024 PDF picture turns into simply 256 visible tokens. Fewer enter tokens imply a smaller prefill.

DeepEncoder natively helps 5 decision modes, and Limitless OCR retains two. ‘Base’ mode runs at 1024×1024 for multi-page work. ‘Gundam’ mode makes use of dynamic decision for single pages.

How R-SWA Retains the Cache Fixed

The contribution is Reference Sliding Window Consideration. Customary Multi-Head Consideration shops a key and worth for each token. As output size T grows, the cache grows with it. The scale is C_MHA(T) = L_m + T. Reminiscence and latency climb with out certain.

R-SWA breaks that hyperlink. Every generated token attends to all reference tokens, which means the visible tokens and the immediate. It additionally attends to the previous n output tokens, the place n defaults to 128. All the things older is evicted. The cache turns into a set queue of dimension m + n.

The scale is C_R-SWA(T) = L_m + min(n, T) ≤ L_m + n. It’s bounded by a continuing. As T grows far past n, the cache ratio traits towards zero. So reminiscence stays flat and per-step latency stays flat.

The analysis staff evaluate this to delicate forgetting. An individual copying a e-book glances on the supply and the previous few phrases. They don’t re-read every thing transcribed to date. Visible tokens by no means endure state updates. That avoids the progressive blurring seen in linear consideration. The interactive simulator under allows you to differ T and watch each caches reply.

generated-winCells){ t.className=”tok win”; } else { t.className=”tok evicted”; } } s.appendChild(t); } } perform postHeight(){ strive{ dad or mum.postMessage({sort:’uocr-resize’,peak:doc.physique.offsetHeight+40},’*’); }catch(e){} } perform play(){ if(anim){stopAnim();return;} doc.getElementById(‘play’).textContent=”⏸ Pause”; tokens.worth=256; anim=setInterval(perform(){ var v=+tokens.worth+3000; if(v>=120000){v=120000; render(); stopAnim(); return;} tokens.worth=v; render(); },90); } perform stopAnim(){clearInterval(anim);anim=null;doc.getElementById(‘play’).textContent=”▶ Animate decoding”;} pages.addEventListener(‘enter’,render); tokens.addEventListener(‘enter’,perform(){ if(anim) stopAnim(); render();}); win.addEventListener(‘enter’,render); doc.getElementById(‘play’).addEventListener(‘click on’,play); doc.getElementById(‘reset’).addEventListener(‘click on’,perform(){ stopAnim(); pages.worth=8; tokens.worth=8000; win.worth=128; render(); }); window.addEventListener(‘load’,render); window.addEventListener(‘resize’,postHeight); render(); })();