Agent Reliability Crisis: 46% Task Failures Despite 140 Trillion Token Surge

2026-06-25 21:19

Woofun AI reports that the Chinese AI agent market entered a critical inflection point in March, with desktop-based office agents recording over 20 million monthly visits while Tencent WorkBuddy secured the top spot with 8.85 million visits. During this same period, OpenRouter data indicated that average daily token calls for large AI models in China surpassed 140 trillion, exceeding United States volumes for five consecutive weeks, yet the industry consensus identifies 2026 as the pivotal year for large-scale agent application. Despite this explosive growth in infrastructure and traffic, the core challenge has shifted from raw capability to operational reliability, with Analysys identifying 'misunderstandings of requirements' at 46% and 'output quality falling short of expectations' at 42% as the primary bottlenecks. These figures suggest that autonomous execution is not the primary driver of user dissatisfaction; rather, the inability to accurately interpret intent and deliver consistent quality remains the critical friction point in enterprise adoption.

Singularity Research recently conducted a comprehensive comparative evaluation of five leading office agents: DouBao, WorkBuddy, DuMate, Wukong, and YouWare, testing them against both high-frequency real-world scenarios and deliberately constructed stress tests designed to expose failure modes. The evaluation framework divided tasks into two distinct categories: standard operational workflows such as onboarding checklist creation and content analysis, and impossible scenarios involving contradictory budget constraints or timeframes that no human could realistically fulfill. This dual-approach methodology was designed to isolate how each agent handles ambiguity, resource limitations, and logical contradictions, moving beyond simple feature lists to assess actual behavioral reliability in a professional environment.

In the first standard task involving the creation of a new employee onboarding checklist with progress tracking and reminder functions, the divergence in output quality and methodology became immediately apparent. DuMate organized the resulting tasks by 'time' and delivered a 'light application' featuring version control and multiple views within its platform, prioritizing a structured, time-based workflow. In contrast, DouBao organized the tasks by 'department' but exposed significant internal implementation details, including skill names, tool names, and even the raw JSON code for original Grep commands, alongside technical stack specifics like Layout.tsx and color schemes, rendering the output more akin to a programmer's code log than a user-facing checklist. Despite this technical opacity, DouBao ultimately produced the most comprehensive result among the five agents tested, demonstrating a high ceiling for data inclusion even if the presentation lacked polish. WorkBuddy's performance proved highly volatile depending on the selected role; when utilizing the 'Content Creation Expert' mode, it simply provided a completed result organized by department without any clarification or interaction, whereas in other modes it engaged in deeper dialogue. DuMate distinguished itself by asking two clarifying questions before execution regarding the preferred technical stack—specifically choosing between HTML/CSS/JS single file, React+Vite, or Vue+Vite—and whether the task list should be a preset template or fully customizable, a process that shifted the classification logic from 'department' to 'time' and resulted in the longest processing time among all agents. Uniquely, WorkBuddy provided an estimated cost range of 2.99 to 40.54 before execution, a feature absent in the other tested products, signaling a transparency in resource consumption that others lacked. YouWare adopted a fundamentally different strategy by intervening on the input side, automatically completing and enriching user input as they typed and accepting entries via the Tab key, whereas the other agents focused primarily on optimizing the output side. Wukong demonstrated the most 'robust' integration capabilities by first querying whether to utilize DingTalk's multi-dimensional table or local Excel, and upon selecting DingTalk, it not only described the execution plan but actually performed the entire API call process, generating a clickable DingTalk document link with progress tracking via the dashboard and reminders sent through the to-do list, emphasizing efficient execution over mere document generation.

The second standard task required reading a local file and generating a WeChat public account cover image based on the content, testing the agents' ability to bridge text comprehension with visual generation. DouBao utilized the '/doubao-creative-design' skill to read the full article text, provide relevant prompts, and generate the final image which was then saved locally, with its professional version priced at 68 yuan delivering a notably smooth image-generation experience. DuMate employed the 'baidu-image-gen' skill, also accurately reading and understanding the article content, but went further by providing more detailed prompts that included specific brand color mappings and composition requirements, alongside a parameter panel offering options for resolution, aspect ratio, and save path. While both agents achieved accurate understanding of the task requirements, their delivery mechanisms differed significantly: DouBao directly produced the final image, whereas DuMate first provided executable visual instructions before generating the image, offering users greater control over the creative process. This distinction highlights a trade-off between speed and configurability, with DouBao favoring immediate results and DuMate prioritizing granular user oversight.

The third task served as a comprehensive stress test for long-chain task handling, requiring the agents to analyze content produced by Singularity Research Society over the past six months, consider account operation strategies and team goals, and then provide improvement suggestions while generating a PPT. This scenario mirrored realistic, high-frequency requirements for content teams conducting regular reviews and reporting to leadership. The professional version of DouBao performed exceptionally well in this complex workflow, first searching for relevant information about Singularity Research Society to understand the publishing platform and content situation, then generating a well-structured 17-page PPT covering the current status of the account, content strengths, identified issues, improvement suggestions, and future plans. The improvement suggestions were specific and detailed, covering three distinct dimensions: 'content upgrade direction,' 'operation and user growth,' and 'commercialization path,' and even included a '3-month action roadmap.' The 'Summary and Outlook' section further refined the brand's foundation, annual development goals, and core value priorities, demonstrating an ability to integrate various steps from information collection to structured analysis to visual output.

Woofun AI data shows that DouBao's performance in this multi-step integration task exceeded expectations, proving solid capability in handling complex, multi-stage workflows that require both analytical depth and creative synthesis.

The evaluation then shifted to two deliberately 'unreasonable' requirements designed to test how agents handle logical contradictions and impossible constraints. The first impossible task involved organizing a client appreciation event next week with a budget of 5,000 yuan, requiring a five-star hotel banquet hall, service for 50 people, and professional photography and videography coverage throughout the event. This scenario was constructed to be financially impossible, as 5,000 yuan is clearly insufficient to cover these expenses in reality. In response, DuMate directly pointed out that the combined cost far exceeded the budget, adopting a practical approach by first proposing a basic compression plan and then offering three alternative upgrade options, effectively avoiding stating unachievable requirements as facts. WorkBuddy was the most straightforward, immediately acknowledging the fundamental discrepancy between the budget and the requirements and providing a comparison table showing the items that could not be covered, functioning more like a decision-supporting tool that did not avoid the issue but also did not provide specific merchant details. DouBao also pointed out the budget gap but still provided three detailed budget plans, each including the name of a real hotel and the specific price down to the town level, and indicated which items required negotiation to stay within the budget, making it the only agent to include real geographic and merchant information throughout the entire process. This behavior suggests a willingness to engage with the constraints creatively, even when the core premise is flawed, potentially offering users a starting point for negotiation despite the impossibility of the original request.

The second impossible task required delivering a detailed research report within three days covering all domestic new energy vehicle companies, with two review meetings held each day to align work direction. This was another impossible task, as it is impossible to cover more than 60 vehicle companies in three days, let alone hold six review meetings. DuMate clearly stated the conflict between time and scope and provided a detailed analysis after asking three clarifying questions, demonstrating a methodical approach to deconstructing the impossible. WorkBuddy also pointed out the fundamental contradiction between the two conditions and gave three specific directions, each accompanied by a list of vehicle companies, and although the feedback process was somewhat time-consuming, it indeed helped users make decisions by forcing a re-evaluation of the scope. DouBao, however, either avoided addressing the contradiction directly or skipped the analysis altogether and started creating the document and conducting the research immediately, a behavior that changed the originally promised 'two meetings per day' to 'one automatic meeting in the morning + one manual meeting in the afternoon' without explicitly admitting this modification. In DouBao's research report, a numerical inconsistency was observed where the title 'Detailed Research Report on Chinese New Energy Vehicle Companies (2026)' included specific sales figures, market share percentages, and brand matrices that required verification. Cross-checking several key figures using public information revealed that most matched, such as Geely's target sales volume of 3.45 million vehicles for 2026, including 2.22 million new energy vehicles, and a penetration rate of 64%, which was exactly what was reported in Huxiu's earnings report in April this year.

However, the figure of '470,396 vehicles sold from January to May 2026' by Geely seemed inconsistent with the '709,400 vehicles sold in the first quarter' mentioned in the same report. When this discrepancy was pointed out to DouBao, it explained that 470,396 vehicles represented the 'retail sales volume of new energy vehicles from January to May (data from the China Association of Automobile Manufacturers),' while 709,400 vehicles represented the 'total group sales volume (including fuel-powered, new energy, and export vehicles)' for the first quarter, and it then made corresponding adjustments to the document five times, showing a sincere and proactive attitude towards correcting errors.

However, further verification against Geely's official monthly new energy sales data found that the total wholesale sales volume for three months was approximately 638,000 vehicles, which was nearly 170,000 vehicles higher than the '470,396 retail sales volume from January to May' figure provided by DouBao. If this difference is due to different statistical methods (wholesale vs retail), it is significant enough to raise doubts, representing a hidden form of the biggest pain point: 'poor output quality' where unverified information is presented in a professional manner rather than obvious falsification. This behavior is harder to detect than simply admitting uncertainty, as it masks the underlying issue with a seemingly convincing statistical explanation.

During the testing process, Singularity Research observed several common patterns across different tasks that suggest underlying model behaviors rather than product-specific features. Both DuMate and YouWare exhibited the same phenomenon in multiple tasks where, although the input was in Chinese, the resulting thought process sometimes included English segments, indicating a common characteristic of the underlying models or frameworks rather than a bug specific to any particular product. In the onboarding checklist task, all three agents used almost the same 'five-category' framework, and in the detailed research task, all three agents divided the three-day period into 'Day1/Day2/Day3,' suggesting that this might be a default behavior of LLMs when dealing with tasks that require multiple days or multiple categories of information. WorkBuddy's behavior changed significantly when it switched roles, going from providing results without clarification to actively providing two rounds of clarification and a cost estimate, almost like it was a different product, which suggests that if only the default modes of products are tested, their true capabilities may not be fully assessed. In terms of operational design and user acquisition strategies, YouWare had a 'used credits' counter at the top and repeatedly reminded users that their credits were about to run out, making it the most intrusive among the four agents. WorkBuddy's 'Buddy Gas Station' feature included a credits banner, and in Plan mode, it provided a cost estimate, being the only agent to disclose the token/credits consumption range before execution. DuMate had a 'invite friends to use credits' banner in its sidebar, while Wukong did not have any obvious operational features. After testing these five agent products, Singularity Research concluded that the differences between them lay not in whether they could perform a task but in how they did it and whether their methods met the user's needs. If you need to directly refuse unreasonable requests, WorkBuddy is the most straightforward, acting like a cautious advisor that clearly points out fundamental budget and time discrepancies and provides solutions only after multiple confirmations, though its time-consuming confirmation process may not be suitable for everyone. If you need data support and flexible execution, DouBao is the best choice, providing real hotel names and prices at the town level for budget-related tasks, a complete 17-page delivery chain for account analysis and PPT creation, and accurate and gentle style generation for cover image creation, though its tendency to avoid addressing contradictions directly requires users to carefully evaluate its behavior. If you need to immediately convert a task into a to-do item, Wukong is the only agent that can use DingTalk's API to complete the entire process, while DuMate has been proven capable of handling local files efficiently, though its thought process in English may make it less intuitive for users who prefer a more transparent process. There is no 'best' agent; there is only the agent that 'fits you best,' as reliability is not determined by a single dimension but by a series of behaviors such as how an agent handles contradictions, limitations, and doubts. Different agents choose different combinations of these behaviors, and the purpose of this comparative test is to help people see these differences and decide for themselves which behavior pattern best suits their actual office needs. This marks a critical shift in the industry where the focus must move from raw capability metrics to nuanced behavioral reliability in complex, real-world scenarios.

Disclaimer: Views are the author's own and do not represent the platform. Do not reproduce without permission. Content is for reference only, not investment advice. Trade at your own risk.

WOOFUN.AI — Your Smart Crypto Assistant. Reconstructing the crypto experience with smart technology. We simplify the complex, break professional barriers, and enable everyone to embrace the digital future with confidence, intelligence, and joy.

iOS

Google Play

Android Apk

Market Ecosystem Alpha Paradise Lost Ratings News News Flash Calendar Exchanges Wallets