How Baidu's newly released Unlimited-OCR transcribes dozens of pages in one forward pass
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| https://i.redd.it/zjduf8zns79h1.gif Baidu released Unlimited-OCR 2 days ago, and they claim it can transcribe dozens of pages in one forward pass. I read the research paper, and decided to make a post (link if anyone's interested) Problem they are solving The problem it targets basically well known. end-to-end OCR models transcribe a page one token at a time, and each new token attends back over everything generated so far. the accumulated KV cache drives up memory and progressively slows generation as the output grows . in practice that means page 20 costs far more than page 1, which is why most pipelines chunk a PDF page by page and stitch the results. Their Fix Their fix is a new attention mechanism, Reference Sliding Window Attention (R-SWA). the framing in the paper is: when a human copies a document, you don't re scan everything you've already written, you just glance at the surrounding context to stay oriented. R-SWA encodes that directly. the visual tokens (the encoded image) are treated as reference and stay fully visible to every generated token, while the generated text only attends to a sliding window of the previous n tokens, 128 by default. Based on Deepseek ocr The encoder is inherited from DeepSeek-OCR, which compresses a 1024x1024 page into roughly 256 visual tokens. Baidu took DeepSeek-OCR as the baseline and replaced all the decoder's attention layers with R-SWA. everything else is inherited, the encoder, the 16x image compression, and the MoE setup (3B total params, only 500M active per token) all come straight from DeepSeek. Note: On benchmarks they report 93.92% on OmniDocBench v1.6 against DeepSeek-OCR's 87.01% on v1.5, though those are vendor-reported and on slightly different benchmark versions, so worth waiting for independent evaluation before drawing firm conclusions. The model is MIT licensed and available on hugging face, modelscope. hugging face: https://huggingface.co/baidu/Unlimited-OCR [link] [comments] |
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.