Login
Sign Up
Woofun AI reports that the term 'world model' currently lacks a unified definition within the industry, creating a fragmented landscape where Alibaba, Tencent, Huawei, and automakers pursue the same strategic goal under different names. Some entities label their systems as foundational world models or physical AI, while others integrate these capabilities into autonomous driving architectures, VLA systems, or embodied intelligence frameworks without assigning a specific title. Alibaba's Qwen-AgentWorld, HappyOyster, and Qwen-RobotWorld correspond to the linguistic world, virtual world, and physical world respectively; Tencent's HY-World focuses on 3D editable worlds; automakers prefer to refer to it as a driving world model or a world behavior model; Huawei and Baidu simply do not use the term 'world model' explicitly. Behind this confusion in naming, every major player is working towards the same objective: to create a dynamic environment within machines that allows for inference and retrospective analysis before they take actual action, thereby reducing their reliance on real-world data and transforming the real world into a data engine that can generate infinite variations, make mistakes, and start over repeatedly.
Recently, IT Juzi released a report on 33 domestic startups working on 'world models,' which attracted considerable attention in the industry. While startups are still struggling with issues such as data collection rights and hash rate budgets, Alibaba, Tencent, Huawei, NIO, XPeng, and Li Auto have already quietly established the world model as a new competitive arena. The world model represents an ambitious goal: to enable AI to go beyond merely recognizing the world and actually 'visualize' it in its own mind. Autonomous driving manufacturers want to use it to create test scenarios for rainy, snowy weather, and complex obstacles; embodied intelligence teams want to use it to ensure that robots can experience hundreds of thousands of simulated failures before being deployed in real-world environments; gaming and social media companies aim to use it to create parallel universes that humans can immerse themselves in. Large companies have different approaches, but their core objective is the same: to transform the real world into a data engine that can be used for infinite inference and retrospective analysis.
Alibaba's approach to building a world model is akin to 'displaying items on a shelf one by one.' In June 2026, it launched three products in just over ten days: the Qwen-Robot series on June 16, HappyOyster 1.0 on June 17, and Qwen-AgentWorld on June 24. Qwen-AgentWorld is a native linguistic world model that does not generate images but creates environments. Within seven environments—MCP tools, search, terminal, code engineering, Web, operating system, and Android—the model can simulate real interactions, learn autonomously, and improve itself through reinforcement learning. It is available in two versions with total parameter counts of 35B and 397B, and its activation parameters are 3B and 17B, respectively. The training data comes from over 10 million real-world interaction trajectories, and both the model and the evaluation benchmark AgentWorldBench have been open-sourced. This approach treats the world model as a 'training ground' for intelligent agents, rather than just a decorative element.
Woofun AI data shows that the rapid deployment of these three distinct models in a single month signals a strategic shift from singular product launches to ecosystem-wide integration.
HappyOyster 1.0, on the other hand, is more like a 'playable film set': given a sentence or an image, it generates an open-world environment and allows users to interact in two modes: 'world exploration' and 'real-time director.' The exploration mode supports continuous real-time movement and camera control for up to 1 minute, while the director mode can generate real-time videos in 480p/720p resolution for more than 3 minutes. Alibaba positions it as an entry point for industries such as interactive games, virtual companionship, interactive short dramas, and cultural and tourism experiences. Qwen-RobotWorld takes a different direction. It serves as the 'thinking brain' in Alibaba's embodied intelligence ecosystem, working together with the VLA operation model Qwen-RobotManip and the VLN movement model Qwen-RobotNav to enable robots to have an internal mental world that they can pre-examine. Together, these three components allow Alibaba to compete for the right to define the linguistic, virtual, and physical worlds.
Tencent's Hyphen Universe follows a different path. Its HY-World series is more like building an 'automated factory for 3D games.' In July 2025, Tencent open-sourced Hyphen Universe 3D World Model 1.0 at the WAIC conference; it was upgraded to version 1.5 in December, and HY-World 2.0 was released and open-sourced in April 2026. Input data can include text, single images, multiple images, videos, or even blank templates, and the output can be in formats such as 3DGS, Mesh, or point clouds. Version 2.0 introduces modules such as HY-Pano 2.0, WorldNav, WorldStereo 2.0, and WorldMirror 2.0, forming a closed-loop system for world generation, reconstruction, panoramic views, and real-time world generation. Tencent's strength lies in its gaming and social media scenarios, so the real users of HY-World are not those developing autonomous driving systems but those creating game levels, conducting virtual filming, and creating digital twins.
ByteDance's world model project is like a 'secret campaign' driven by the needs of its short-video business. In August 2025, ByteDance's Seed team was reported to be developing a world model, led by Zhou Chang, a former key member of Tongyi Qianwen. The biggest advantage of this project is the massive volume of video traffic on Douyin and TikTok, as well as the EX-4D framework, which can convert monocular videos into 4D multi-view scenes. ByteDance's goal is not to create a beautiful video generator but to build a 'digital twin' that can simulate physical laws. At the Volcano Engine FORCE Power Conference on June 23, 2026, ByteDance did not directly announce the world model but showcased its Seed 2.1 series, Seedance 2.5 video generation model, Seedream 5.0 Pro image generation model, and a new audio generation model. ByteDance's AI strategy for 2026 was summarized in four key points: to make the world model reach the global state-of-the-art by the end of the year, to explore dynamic generation with Seedance, to consolidate foundations with Coding, and to accelerate commercialization with Bean Bag.
Huawei's Pangu world model is characterized by its 'low-profile but lethal' approach. At the developer conference in June 2025, Huawei released the Pangu large model, which, based on its multi-modal architecture, can generate high-precision digital physical spaces from a single image. It can predict collisions, train robotic arms to perform tasks accurately, and generate driving videos and lidar point clouds, helping Huawei's ADS end-to-end model to release new versions every two days. Huawei did not use the term 'world model' explicitly but regarded it as the 'training foundation' for intelligent vehicles and embodied intelligence. Its collaboration with GAC is a typical example: 2D videos and 3D point clouds are matched at the pixel level, allowing complex scenarios to be reproduced within minutes. At the HDC 2026 conference in June 2026, Huawei upgraded the Pangu large model to version 7.0 and released the Ascend 910C. Yu Chengdong took charge of Pangu again, but there were no news about a new version of the world model itself. This approach of 'the world model not existing as a separate entity but serving an industrial closed-loop' is consistent with Huawei's style.
Baidu entered the autonomous driving field earlier. In May 2024, it released Apollo ADFM, which was positioned as 'the world's first autonomous driving large model capable of supporting L4-level autonomous driving.' Although Baidu did not call it a world model, it essentially fulfilled the functions of a world model by using an end-to-end neural network to understand the physical world and predict the behavior of traffic participants. In November 2025, Wenxin Large Model 5.0 was unveiled in its native full-modal form, with a parameter scale of 2.4 trillion; the official version was launched in January 2026. Baidu's world model capabilities have been integrated into a larger strategic plan. Its approach is to not focus solely on the world model itself but to make Apollo and Wenxin complement each other.
Xiaomi and SenseTime represent two different technological approaches. On May 13, 2026, Xiaomi open-sourced the Xiaomi OneVL, integrating VLA, world model, and latent space reasoning into one framework, emphasizing the interpretability of the visual reasoning process. This technology can be used in both autonomous driving and embodied intelligence applications. SenseTime's 'Enlightenment' is more like a 'senior driver' that has already started working. In a report by Frost & Sullivan in September 2025, it was defined as the industry's first mass-produced and interactive world model. It can generate driving videos lasting 150 seconds in 1080P resolution with 11 perspectives and has built the largest generative driving dataset in the industry, WorldSim-Drive, as well as a database of millions of generated scenarios. In June 2026, Daxiao Robot, founded by Wang Xiaogang, one of SenseTime's co-founders, announced that it had raised hundreds of millions of dollars in financing. Its Enlightenment Kairos world model 3.0 ranks first in four major generative prediction benchmarks in terms of embodied video generation and task instruction following. SenseTime's world model is gradually extending its application from intelligent vehicles to robots.
If we compare Internet giants' approach to developing world models with that of automakers, it is clear that automakers are 'using the world model' in practical applications. NIO was among the first Chinese automakers to promote the world model as a key concept. At the NIO IN conference in July 2024, Ren Shaoqing introduced the NWM (NIO World Model), positioning it as China's first intelligent driving world model. It uses a multi-autoregressive generation architecture to perform two tasks: 'imaginary reconstruction' in space and 'imaginary inference' in time. Given a real scenario, it can recreate a 3D world; given a three-second prompt, it can generate a future video lasting more than two minutes. Every 0.1 second, it generates 216 possible trajectories and selects the optimal one. NIO's logic is clear: an end-to-end model alone is not enough; a truly intelligent driving system needs to be able to 'imagine road conditions even with its eyes closed.'
On June 18, 2026, NIO officially released the new NWM 2.0 version, which was available to over 700,000 users across all its models. Even car owners who purchased their vehicles four years ago could upgrade for free. The Banyan, Cedar, and Coconut+ vehicle systems were also updated simultaneously. The new version enabled the intelligent driving model to directly output raw operation signals for the steering wheel, acceleration pedal, and brake pedal for the first time in China. The training system was also upgraded from 'world model + closed-loop reinforcement learning' to 'world model + supervised fine-tuning + closed-loop reinforcement learning.' The AEB coverage scenario increased by 6.7 times compared to the standard version, and the probability of incorrect braking was reduced to once every 100,000 kilometers. The Dimensity NX9031 chip was even described as 'designated specifically for world models.'
In the second half of 2024, Li Auto proposed the 'reconstruction + generation' approach for its world model and presented the DrivingSphere at the CVPR 2025 conference. It consists of the OccDreamer diffusion model and the VideoDreamer ST-DiT, creating a high-fidelity 4D closed-loop simulation environment. Traditional open-loop simulations can only evaluate what a model 'sees,' while closed-loop simulations can evaluate what the model 'does.' Li Auto's world model is like an exam room where difficult scenarios can be repeatedly practiced in advance. At the Livis Day conference in June 2026, Li Auto further enhanced this capability with the 'Mahe VLA' version, which uses a native multi-modal MoE architecture to integrate perception, prediction, and planning. The dual M100 chips on the vehicle provide 2560 TOPS of computing power, resulting in a response time of 0.28 seconds. According to Li Auto's roadmap, the new Mahe VLA version will be rolled out to AD Max users in the third quarter, and by the fourth quarter, it will be fully integrated into the production fleet. This marks a definitive shift where the world model transitions from a research concept to a core component of mass-market vehicle safety and performance, fundamentally altering how automotive software is validated and deployed.