Sakana Fugu Benchmark Validity Challenged Due to 10-20 Point Scaffold Deviation
2026-06-26 17:33

Woofun AI reports that the validity of Sakana AI's Fugu Ultra benchmark results against Anthropic's Fable 5 has been challenged by the community. Critics argue that self-reported scores rely on non-uniform testing environments, where variations in execution scaffolds can induce score deviations of 10 to 20 points. This suggests the performance gap stems from system engineering optimizations rather than inherent model capability improvements.

Independent evaluations indicate that the choice of intelligent agent scaffold significantly impacts final scores. For instance, using the same Claude Opus 4.5 model with three different open-source scaffolds resulted in SWE-bench Pro fix rates fluctuating between 50.2% and 55.4%. Scale AI analysis further confirms that operational strategies, including prompting templates and tool integration, are sufficient to create a 10 to 20 point variance for identical model weights. Consequently, since both Sakana AI and Anthropic utilized proprietary, closed-source scaffolds without standardized third-party verification, their published data does not accurately reflect the underlying capabilities of the respective models.

Disclaimer: Views are the author's own and do not represent the platform. Do not reproduce without permission. Content is for reference only, not investment advice. Trade at your own risk.
Tags:
Fugu Ultra
Fable 5
Claude Opus 4.5
SWE-bench Pro
Scale SEAL
Share:
back