Login
Sign Up
Woofun AI reports that DeepSeek has officially deployed Vision Mode across its web and app platforms, positioning it alongside Quick and Expert modes. This new capability transcends basic optical character recognition to deliver deep scene analysis, spatial logical reasoning, and the direct conversion of UI screenshots into structured HTML code. For complex geometric deductions or intricate chart analysis, the system automatically engages its deep thinking model to generate comprehensive reasoning chains.
The underlying architecture relies on the 'Thinking with Visual Primitives' research framework, co-authored by multimodal researcher Xiaokang Chen with scholars from Peking University and Tsinghua University. To resolve the 'Reference Gap' in existing visual language models regarding fine-grained positioning, the team integrated spatial primitives like coordinate points and bounding boxes directly into the Chain of Thought process. Although the foundational academic paper and open-source project were released on April 30, DeepSeek officials abruptly retracted them on May 1, fueling speculation about technical leakage. Currently, Vision Mode supports only image input, excluding video, audio, and image generation functionalities.