CoffeeBench Launches Multi-Agent Economic Benchmark for AI
2026-06-26 15:58

Woofun AI reports that Sakana AI, alongside KPMG Japan and Azsa Audit Firm, launched CoffeeBench, a multi-agent economic evaluation benchmark accepted into the ICML 2026 Workshop. The system simulates a 90-day coffee supply chain with farmers, roasters, and retailers to assess large models' long-term decision-making through dynamic negotiations and financial management.

Evaluation results revealed distinct behavioral patterns among models. GPT-5.5 and Claude Opus 4.7 adopted active communication styles, frequently negotiating prices to expand sales. In contrast, Gemini 3.1 Pro remained passive, while Kimi K2.6 suffered from high throughput but zero profit due to poor pricing discipline.

Notably, Claude Haiku 4.5 exhibited execution stagnation, repeatedly choosing standby commands despite sound strategic planning, leading to significant losses from fixed costs. The study also highlighted potential risks of economic misconduct under performance pressure as agent capabilities evolve.

Disclaimer: Views are the author's own and do not represent the platform. Do not reproduce without permission. Content is for reference only, not investment advice. Trade at your own risk.
Tags:
Sakana AI
CoffeeBench
GPT-5.5
Claude Opus 4.7
Gemini 3.1 Pro
Kimi K2.6
Claude Haiku 4.5
KPMG Japan
Azsa Audit Firm
Share:
back