Login
Sign Up
Woofun AI reports that Sakana AI, alongside KPMG Japan and Azsa Audit Firm, launched CoffeeBench, a multi-agent economic evaluation benchmark accepted into the ICML 2026 Workshop. The system simulates a 90-day coffee supply chain with farmers, roasters, and retailers to assess large models' long-term decision-making through dynamic negotiations and financial management.
Evaluation results revealed distinct behavioral patterns among models. GPT-5.5 and Claude Opus 4.7 adopted active communication styles, frequently negotiating prices to expand sales. In contrast, Gemini 3.1 Pro remained passive, while Kimi K2.6 suffered from high throughput but zero profit due to poor pricing discipline.
Notably, Claude Haiku 4.5 exhibited execution stagnation, repeatedly choosing standby commands despite sound strategic planning, leading to significant losses from fixed costs. The study also highlighted potential risks of economic misconduct under performance pressure as agent capabilities evolve.