Login
Sign Up
Woofun AI reports that the validity of Sakana AI's Fugu Ultra benchmark results against Anthropic's Fable 5 has been challenged by the community. Critics argue that self-reported scores rely on non-uniform testing environments, where variations in execution scaffolds can induce score deviations of 10 to 20 points. This suggests the performance gap stems from system engineering optimizations rather than inherent model capability improvements.
Independent evaluations indicate that the choice of intelligent agent scaffold significantly impacts final scores. For instance, using the same Claude Opus 4.5 model with three different open-source scaffolds resulted in SWE-bench Pro fix rates fluctuating between 50.2% and 55.4%. Scale AI analysis further confirms that operational strategies, including prompting templates and tool integration, are sufficient to create a 10 to 20 point variance for identical model weights. Consequently, since both Sakana AI and Anthropic utilized proprietary, closed-source scaffolds without standardized third-party verification, their published data does not accurately reflect the underlying capabilities of the respective models.