# We Went Shopping for AI Research Tools. The Literature Told Us to Keep Our Wallet.

> I sat down to upgrade my research toolkit and the literature told me to keep my wallet. Where AI research agents actually fail, and why the leverage is in the workflow, not the tools.


Last night I sat down with my research agents to do something I'd been putting off: a proper audit of our research toolkit. What search APIs are we using, what's out there now, what should we adopt. Standard tool maintenance, the kind of thing every team that builds with AI should do quarterly and mostly doesn't.

The audit opened with an embarrassing discovery. Our primary search API had run out of credits at some point, and the CLI that wraps it was designed to skip failing sources quietly. So the whole fleet had been searching with one eye closed for who knows how long, and nothing anywhere said a word. Lesson one was free: silent fallbacks are silent outages. If a tool degrades, something should say so out loud.

But the real story is what happened after we fixed the plumbing. Because I went into the night assuming the upgrade we needed was better tools. The research said otherwise.

## What three months did to the tool landscape

If you haven't looked at the AI search and extraction market since the spring, it's worth a fresh look. The whole layer reorganized itself around agents as the customer.

Every serious search provider now sells two modes: a cheap fast one for breadth, and a deep one that does multi-step reasoning and returns structured output with citations. The price spread between modes is 3 to 10x. Extraction got smarter too. Instead of handing your model an entire scraped page, the new pattern is query-conditioned extraction: give me the parts of this page relevant to my question. Exa published numbers claiming 500 characters of query-relevant excerpts match the answer accuracy of 8,000 characters of full text. That's a 16x reduction in tokens your agent has to read.

And the deep research APIs (the ones that take a question and come back minutes later with a cited report) matured into real products. They're impressive. They're also billed like a slot machine. One of them consumes anywhere from 120 thousand to 500 thousand tokens for the same prompt on different runs, which the vendor has acknowledged. If you run agents at any volume, that's not a pricing model, that's weather.

So yes, we picked up real upgrades from the survey. New keys, a deep-search flag, a plan for query-aware fetching. But notice what kind of upgrades those are. They're maintenance. Hygiene. None of them change what the system can actually do.

## Then we read the papers, and the floor shifted

Here's the question I should have asked first: where do research agents actually fail?

A benchmark released in May called DeepWeb-Bench ([arXiv 2605.21482](https://arxiv.org/abs/2605.21482)) measured exactly that, and the answer inverts the intuition I'd been operating on. Retrieval failures (couldn't find the information) account for 12 to 14 percent of errors. Derivation and calibration failures (found the information, then reasoned incompletely over it, or answered confidently and wrong) account for over 70 percent.

A second paper from the same month, with the wonderful title "Cited but Not Verified" ([arXiv 2605.06635](https://arxiv.org/abs/2605.06635)), makes it sting more. Frontier models produce citations that are over 80 percent relevant but only 39 to 77 percent factually accurate. The cited page exists and is on topic; it just doesn't say what the model claims it says. And here's the part that should change how you build: citation accuracy drops about 42 percent as the agent's tool calls scale from 2 to 150. More searching makes the grounding worse, not better.

Think about what that means in human terms. We weren't dealing with a researcher who needed a better library card. We were dealing with a researcher who needed better reading discipline. You can't buy your way out of that with a search subscription.

## The part where the literature graded our homework

Here's the personally satisfying bit. Over the past year we built our research workflow by feel: one orchestrator agent that decomposes the question, parallel worker agents that investigate, an adversarial review pass that tries to break the conclusions before we trust them, and a grounding filter that rejects any claim that can't point to its source.

Every one of those choices now has a number attached.

Anthropic published the architecture evidence from their own production system: an orchestrator with parallel workers beat a single agent by 90.2 percent on their internal research evals ([their engineering writeup](https://www.anthropic.com/engineering/multi-agent-research-system)). A Google and MIT team went further and tested 260 agent-system configurations to extract scaling principles ([arXiv 2512.08296](https://arxiv.org/abs/2512.08296)). Centralized orchestration improved parallelizable tasks by 80.9 percent. And the finding I keep coming back to: independent parallel agents without a validator amplify each other's errors 17.2x, while adding a central validating step cuts that to 4.4x. That adversarial review pass we added because it felt rigorous? It was the error-containment mechanism the whole time.

Even the grounding filter got validated. Given that hallucinated citations are baseline model behavior (not a rare glitch, the default), a pipeline that re-checks every claim against its source isn't paranoia. It's table stakes. Ours once rejected nearly half of a batch of generated claims as ungrounded, and I now read that as the system working exactly as the literature says it must.

The same Google paper carries a warning for the multi-agent true believers, though: on tasks with strong sequential dependencies, where each finding reshapes the next question, every multi-agent configuration they tested degraded performance by 39 to 70 percent. Parallelism is a property of the problem, not a virtue of the architecture. Some research should still be one agent thinking for a long time.

## What we actually changed

So after a night that started as a shopping trip, here's what we adopted. Four changes, all prompt-level, all free.

First, decompose before dispatching. The orchestrator breaks the question into checkable sub-questions and keeps an explicit map of what's found and what's missing. Follow-up agents get dispatched against gaps, never as a blanket re-sweep. A system called Argus showed this evidence-map pattern scaling to 86.2 on BrowseComp, one of the harder research benchmarks ([arXiv 2605.16217](https://arxiv.org/abs/2605.16217)).

Second, aggregate before reasoning. When parallel workers return, run an explicit consolidation step (what do we have, what's still open) before the next move. An ICLR paper ablated exactly this and found the aggregation step, not the parallelism, was where the gains came from ([arXiv 2508.19113](https://arxiv.org/abs/2508.19113)).

Third, fact-check citations separately from logic. Reviewing the argument and verifying that cited sources actually say what's claimed are different jobs. Given the 39-to-77-percent number above, the second job gets its own dedicated pass now.

Fourth, make agents declare what they already know before they search. One line in the prompt: state what you know about this sub-question and why it's insufficient, and stop searching when the sub-questions are resolved, not when the turns run out. A whole lineage of papers trained models to learn this behavior with reinforcement learning ([arXiv 2505.17005](https://arxiv.org/abs/2505.17005), [arXiv 2605.29796](https://arxiv.org/abs/2605.29796)); the behavior itself is promptable.

How do we really get value out of AI research agents, then? Not mainly by buying better search. The tools matter, keep them sharp, top up your credits (ahem). But the leverage lives in the workflow: how the question gets decomposed, when the system stops searching, and who checks the work before you believe it.

Tooling is hygiene. Workflow is leverage. The best upgrade we made all night cost zero dollars, and we already owned it.

---

*Papers worth your time if you build research agents: DeepWeb-Bench ([2605.21482](https://arxiv.org/abs/2605.21482)) on where agents actually fail, "Cited but Not Verified" ([2605.06635](https://arxiv.org/abs/2605.06635)) on the citation problem, the Google/MIT scaling study ([2512.08296](https://arxiv.org/abs/2512.08296)) on when multi-agent helps and when it hurts, and Anthropic's [multi-agent research system writeup](https://www.anthropic.com/engineering/multi-agent-research-system) for the production view.*