More agents, same results — Stanford plus an honest scope note

A single AI agent with the same compute budget delivers at least as good results as a whole team of agents. Often better. A Stanford group demonstrated this in a controlled study in April.

The setup: three models (Qwen3, DeepSeek-R1, Gemini 2.5) on multi-hop reasoning under a unified thinking-token budget. At equal compute, the single agent wins. The reason is unspectacular: every handoff between agents loses context. A solo agent keeps everything; a team has to pass compressed summaries.

The authors are equally clear about the limits of this claim. The study tests text-based reasoning only. Tool use, browser automation, deep research, precisely the workflows multi-agent is most commonly built for today, aren't covered. And the models tested are mid-tier open-source generation, not Opus 4.7, GPT-5.4 or Gemini 3.

Anthropic itself takes a more nuanced line: there are legitimate multi-agent use cases, such as parallel independent research, weak base models, noisy input. But the default should be single-agent. Introducing multi-agent needs to be justified, not the other way around.

Before picking multi-agent as the solution, the simpler question should be answered first: what can be achieved with a single, well-configured agent? Usually far more than people assume.

Background: https://arxiv.org/abs/2604.02460