Researchers at the Beijing Academy of Artificial Intelligence introduced AutoResearchBench, a 1,000-instance benchmark testing AI agents against a controlled corpus of over 3 million full-text scientific papers. The benchmark defines two task types: Deep Research (locate a specific target paper through multi-step investigation) and Wide Research (comprehensively collect all papers matching given criteria). Despite frontier LLMs having largely saturated general web-browsing benchmarks like BrowseComp, they reach only 9.39% accuracy on Deep Research and 9.31% intersection-over-union on Wide Research, with many baselines below 5%. The results quantify a substantial gap between general agentic search capability and the fine-grained comprehension, evidence verification, and search-termination reasoning required for genuine scientific literature work.

AutoResearchBench: Frontier LLMs Score Below 10% on Autonomous Scientific Literature Discovery

Citations