Login
Sign Up
Woofun AI reports that a research team led by Professor Zhou Bowen from Tsinghua University and the Director of the Shanghai Artificial Intelligence Laboratory has unveiled NatureBench, a rigorous new benchmark designed to determine if AI agents can independently improve upon methods in top-tier scientific papers. The initiative addresses a critical gap in current evaluation systems, which have historically focused on reproducing existing methods or optimizing engineering capabilities in controlled environments like Kaggle competitions rather than assessing the ability to surpass state-of-the-art (SOTA) results within the context of real scientific inquiry. This new framework transforms published research into executable, containerized tasks to measure whether artificial intelligence can genuinely drive scientific discovery beyond mere replication. The study, detailed in a paper available on arXiv, establishes a systematic approach to evaluating the reproduction and improvement capabilities of AI coding agents across the core experiments of Nature-style journals.
The core architecture of NatureBench relies on a pipeline called NatureGym, which standardizes diverse scientific papers into a uniform, reproducible task format while implementing an information firewall to prevent agents from simply copying original methods. This system converts approximately 5,500 papers published between 2022 and 2025 from ten Nature-style journals into machine-executable challenges, filtering out non-research articles such as news, editorials, and reviews to ensure only valid machine learning tasks with publicly available data under 50GB are included. The process involves three distinct phases: selecting papers, obtaining code and data to define a starting point where agents access only core algorithm inputs without intermediate results, and packaging these into standard task bundles that undergo 36 automated checks. Of the initial pool, approximately 160 task bundles advanced to a calibration phase involving two rounds of quality assurance, including a Base mode to verify task definitions and a Reproduce mode to confirm that agents could replicate original methods when granted access to the source material.
The final benchmark consists of 90 distinct tasks and 333 evaluation instances spanning six major research fields, involving a total of 81 key indicators to ensure comprehensive coverage of scientific domains. To facilitate consistent comparison across these varied tasks, the team defined a normalized relative difference measure denoted as g, where a value of g ≥ 0 indicates that an agent's performance has reached or exceeded the SOTA level of the original paper, and a value of g > 0.1 signifies a clear improvement. Agents were allocated a strict 4-hour window to complete each task, with multiple submissions permitted to allow for iterative feedback analysis, while a post-analysis phase utilizing Claude Sonnet 4.6 was implemented to detect and eliminate fabricated outputs, cheating via feedback mechanisms, or other fraudulent behaviors. This rigorous setup ensures that the evaluation reflects genuine problem-solving capabilities rather than exploitation of system loopholes.
Woofun AI data shows that the evaluation covered 10 different agent configurations across three primary frameworks: Claude Code, Codex CLI, and Gemini CLI, with all agents having web search disabled to prevent direct access to original papers or datasets. Among these configurations, Claude Opus 4.7 paired with Claude Code emerged as the top performer, achieving Surpass-SOTA results in 17.8% of tasks and reaching or matching SOTA levels in 47.8% of tasks. In terms of submission quality, the two Claude Opus configurations demonstrated exceptional stability, with both Completion Rate and Score Rate reaching 100%, indicating zero invalid submissions. In contrast, GPT-5.5 recorded a Score Rate of 98.9% and a Completion Rate of 84.4%, yet 13 of its submissions were subsequently deemed invalid shortcuts by the evaluators, highlighting a divergence in reliability between the leading models. These findings suggest that while current AI agents can approach or even surpass original paper results in specific instances, their ability to do so consistently across a broad spectrum of scientific challenges remains significantly limited.
A granular analysis of task distribution reveals distinct performance disparities across different scientific disciplines, with relational reasoning showing the highest Match-SOTA rate at 60.0%. This was followed by protein biology and transcriptomics, which achieved Match-SOTA rates of 37.5% and 35.5% respectively, while physical modeling, molecular design, and biomedical modeling trailed significantly with rates of 26.9%, 18.2%, and 17.9%. The data further indicates that cross-disciplinary tasks present a higher barrier to entry, with the Match-SOTA rate for 75 single-discipline tasks standing at 33.1% compared to just 28.0% for 15 cross-disciplinary tasks. Correspondingly, the median values of g for single-discipline and cross-disciplinary tasks were -0.13 and -0.21 respectively, underscoring the increased complexity and reduced success probability when agents must integrate knowledge across multiple domains. The team provided path annotations for 900 runs to dissect the underlying causes of success and failure, revealing that in runs achieving Match-SOTA results, methods such as supervised agent prediction, search parameter tuning, engineering pipelines, and pretraining or extension accounted for 82.7% of the successful strategies.
Conversely, in runs that failed to achieve Match-SOTA results or lacked valid scores, the primary causes were attributed to methodological issues or execution problems, accounting for 61.1% and 28.7% of failures respectively. Incorrect method selection emerged as the most frequent cause of failure, followed closely by insufficient budget or time constraints, suggesting that strategic decision-making and resource management are critical bottlenecks for current AI agents in scientific research. The team acknowledged several inherent limitations of NatureBench, noting that it currently covers only core quantitative tasks abstractable into machine learning problems with automated scoring, excluding experiments requiring wet laboratory validation, pure theoretical derivations, hardware interactions, or evaluations dependent on human judgment.
Furthermore, the benchmark often isolates a single core experiment rather than evaluating the entire paper, meaning it measures performance on specific tasks rather than providing a holistic assessment of a paper's overall contribution. The uniform 4-hour time limit and single-card hardware setup may also constrain the completion of certain complex tasks, where failures often stem from inappropriate method selection or insufficient execution depth linked to computing resource limitations.
Despite measures taken to prevent speculative submissions using publicly available papers and data, the team noted a residual risk of data leakage that could influence results.
Additionally, the interpretation of the g value itself faces limitations; when the SOTA level of an original paper is near the upper limit of a relevant indicator, even minor performance differences can be exaggerated into large negative values, and a single primary indicator may fail to capture the full scope of multi-objective evaluations presented in the original research. Consequently, the team emphasized that future assessments should prioritize metrics such as Surpass-SOTA, Match-SOTA, and median performance rather than relying solely on average scores to accurately reflect agent capabilities. Looking ahead, the researchers identified several critical areas for expansion, including broadening the task range to encompass comprehensive reproductions of entire papers and developing more detailed resource budgets that account for varying time durations and hardware configurations. Improving evaluation criteria to better distinguish between failures caused by misunderstandings, incorrect method selection, insufficient execution, or resource constraints is also essential, as is incorporating a wider array of experimental examples and indicators to mirror the true complexity of real-world scientific research. This marks a pivotal step in defining the boundaries of AI autonomy in scientific discovery, though the gap between current capabilities and consistent SOTA improvement remains substantial.