Login
Sign Up
Baidu has disclosed the Unlimited-OCR document intelligent parsing large model alongside a technical report, with industry speculation linking the project's CTO 'YY' to former DeepSeek-OCR core author Wei Haoran. Data compiled by Woofun AI shows that Unlimited-OCR achieved a score of 93.92% in the OmniDocBench v1.6 long document parsing benchmark, establishing a new end-to-end SOTA record.
To mitigate the linear surge in key-value cache (KV cache) that typically causes slowdowns and excessive GPU memory consumption in traditional models, Baidu deployed the Reference Sliding Window Attention mechanism (R-SWA). This approach limits the model's focus during decoding to all image features and a fixed window of recently generated text (default 128 tokens), keeping the total KV cache volume constant. Consequently, R-SWA prevents image detail blurring during window updates and ensures stable inference speed and GPU memory usage for documents exceeding 40 pages, delivering a 12.7% speedup compared to DeepSeek-OCR. Baidu has released the code and weights under the MIT license, supporting Hugging Face Transformers, vLLM, and SGLang, with plans to extend R-SWA to Automatic Speech Recognition (ASR) and translation tasks.