DeepSeek Deploys Vision Mode with Visual CoT Reasoning Based on Retracted Grounding Framework

2026-06-18 16:54

Woofun AI reports that DeepSeek has officially deployed Vision Mode across its web and app platforms, positioning it alongside Quick and Expert modes. This new capability transcends basic optical character recognition to deliver deep scene analysis, spatial logical reasoning, and the direct conversion of UI screenshots into structured HTML code. For complex geometric deductions or intricate chart analysis, the system automatically engages its deep thinking model to generate comprehensive reasoning chains.

The underlying architecture relies on the 'Thinking with Visual Primitives' research framework, co-authored by multimodal researcher Xiaokang Chen with scholars from Peking University and Tsinghua University. To resolve the 'Reference Gap' in existing visual language models regarding fine-grained positioning, the team integrated spatial primitives like coordinate points and bounding boxes directly into the Chain of Thought process. Although the foundational academic paper and open-source project were released on April 30, DeepSeek officials abruptly retracted them on May 1, fueling speculation about technical leakage. Currently, Vision Mode supports only image input, excluding video, audio, and image generation functionalities.

Disclaimer: Views are the author's own and do not represent the platform. Do not reproduce without permission. Content is for reference only, not investment advice. Trade at your own risk.

WOOFUN.AI — Your Smart Crypto Assistant. Reconstructing the crypto experience with smart technology. We simplify the complex, break professional barriers, and enable everyone to embrace the digital future with confidence, intelligence, and joy.

iOS

Google Play

Android Apk

Market Ecosystem Alpha Paradise Lost Ratings News News Flash Calendar Exchanges Wallets