Login
Sign Up
Woofun AI reports that RoboScience, a firm dedicated to embodied intelligence, officially released the Visics general embodied large model on June 24, fully disclosing its proprietary VLOA technical architecture. This launch marks a critical departure from existing industry standards by introducing a unified intermediate representation unit designed to resolve the fragmentation plaguing embodied intelligence. Unlike autonomous driving systems that rely on unified visual or point cloud representations, or large language models dependent on standard text tokens, the field of embodied intelligence has historically lacked a universally recognized basic format. This absence has dictated inefficient data collection methods, restricted model learning sources, and severely limited the transferability of learned knowledge to new operational scenarios. The prevailing approach over the last two years involved training models to directly replicate the joint motion trajectories of specific robots executing particular tasks, effectively hard-coding action coordinates to specific hardware configurations. This methodology creates a fundamental bottleneck: when the robot, the object, or the environment changes, the acquired skills fail to transfer, as the model learns only 'how to use a gripper to pick up a cup' rather than understanding the abstract concept of 'grasping,' including required force vectors and object reaction dynamics.
Tian Ye, founder and CEO of RoboScience, identified three primary bottlenecks hindering current robot operations: poor generalization capabilities, difficulties in executing precise manipulations, and the accumulation of errors during long-distance tasks. To address these structural failures, the team initiated a foundational overhaul by developing a new set of basic representation units centered on the concept of the 'object.' As the core of this technical system, the Visics General Embodied Large Model proposes the Object Trajectory standard, defined as the 3D point cloud trajectory of objects, to construct a layered and decoupled VLOA architecture. Tian Ye explained that the term 'object' serves a dual purpose, encompassing both the physical entity and the operational goal, thereby allowing for precise definition of robot-object interactions and the desired post-operation motion states. This architectural shift reconstructs the cognitive and execution logic of robots, moving away from hardware-specific trajectory replication toward a generalized understanding of physical interactions.
The Visics model operates on a dual-engine architecture where the Embodied World Model and the General Operation Model function independently, undergoing separate pre-training and iterative cycles without mutual interference. The Embodied World Model leverages vast quantities of internet video data as its pre-training foundation to model object states, 3D trajectories, contact forces, and physical causal relationships, effectively learning the motion patterns of objects within the real world. Conversely, the General Operation Model is tasked with converting these 'object motion trajectories' into executable 'actions that robots should perform.' It generates large-scale simulation data through a physical engine and continuously iterates to handle a diverse range of object types, including rigid bodies, hinge components, and soft deformable objects. This engine supports cross-robot deployment, closed-loop control, and compatibility with multi-modal perception inputs such as vision, touch, and force sensing.
Woofun AI data shows that this separation allows the upper-layer Embodied World Model to predict reasonable object motion trajectories while the lower-layer General Operation Model generates hardware-specific control instructions, ensuring the system can adapt to any robot type.
The synergy between these two engines through the VLOA architecture achieves full-domain generalization across three dimensions, enabling robots to autonomously complete a variety of tasks regardless of the object type. When applied to grasping actions, the VLOA-based model demonstrates significant improvements over traditional training methods that bind a single robotic arm to a single object. Specifically, the system exhibits enhanced grasping success rates, a richer variety of operating postures, and faster computational response speeds. A robotic arm equipped with the Visics General Embodied Large Model has successfully demonstrated furniture assembly tasks, validating the model's ability to handle complex, multi-step operations that require high precision and adaptability. This capability represents a shift from rigid automation to flexible, intelligent manipulation capable of navigating the unpredictability of real-world environments.
In the domain of embodied intelligence, data remains the foundational element determining a model's ultimate capabilities, yet traditional data collection approaches face severe challenges regarding cost and capacity limitations. RoboScience has addressed this by establishing a 'simulation + video' dual-data framework, anchored by its self-developed high-precision simulation engine, RoboMirage. This system integrates fully automated video data annotation and cleaning processes to drastically reduce the cost of obtaining individual data points to between 1/20 and 1/200 of traditional methods.
Furthermore, the framework allows for continuous expansion at a rate of hundreds of thousands of hours per week, ensuring a steady supply of high-quality training material. By 2026, the company expects to have constructed a dataset exceeding 1 terabyte of high-quality manipulation trajectory data, providing the necessary fuel for further model refinement and scaling.
Since its inception, RoboScience has secured investment and industrial support from a diverse array of CVCs and financial institutions, including JD.com, SenseTime, Dacheng Capital, China Merchants Venture Capital, Zero One Venture Capital, and Puhua Capital. The company maintains research and development centers in Beijing, Shenzhen, Suzhou, and Hangzhou, facilitating a broad geographic footprint for its operations. Focusing on large models, RoboScience has vertically integrated its self-developed robot bodies, controllers, and RobotOS, while horizontally establishing a framework for model generalization, convenient development, and multi-level ecosystems. This strategy has created a software-and-hardware integrated, closed-loop collaborative business model designed to accelerate the deployment of embodied intelligence solutions. Co-founder Wang Tao noted that the true large-scale implementation of embodied intelligence has not yet arrived, prompting the company to prioritize the object dimension over direct entry into industrial scenarios where it would compete with established automation solutions.
The strategic focus on the object dimension aims to solve the generalization issue regarding rigid, flexible, and various property objects before tackling specific industrial applications. Scenarios such as supermarkets and e-commerce logistics naturally involve the need to handle a vast number of SKUs and multiple product categories, making them ideal test fields for verifying generalization abilities at the object level. Currently, RoboScience has initiated pilot collaborations with several retail, logistics, and wellness service companies, alongside firms specializing in robot bodies and dexterous hands. The company plans to achieve mass production of standardized robot body products for industrial and commercial applications within this year. This trajectory suggests a deliberate move to validate core technological capabilities in high-variability environments before scaling into rigid industrial automation, marking a significant pivot in how embodied intelligence is commercialized. The success of this approach will likely determine the future viability of general-purpose robotics in unstructured environments.