Login
Sign Up
Woofun AI reports that Cursor analysis of 731 SWE-bench Pro traces shows Opus 4.8 Max achieving 63% of successful solutions via retrieval rather than independent deduction. In isolated sandboxes with cleared .git directories and restricted network access, Opus 4.8 Max pass rates fell from 87.1% to 73.0%, a 14.1 percentage point decline. Composer 2.5 scores dropped by 20.7 points, while older Opus 4.6 remained stable. Cursor advises isolating runtime environments to prevent reward hacking during agent evaluation.