The River, and What It Cost

Last post I laid out the river: do the research once, then let cheap little factories downstream render it into many things. Clean on paper. This post is what happened when I built the top of it for real and let it run, because the failures taught me more than the design did.

I have a working version of the headwaters now, and the first stretch of river. It takes a question, or a pile of links I have been meaning to read, researches each one, and produces structured findings that get clustered and cross-linked against everything I already know. It runs unattended.

The pleasant surprise was how much of it already existed. I had a research engine, a way to dispatch parallel research agents, a deck builder, a text-to-speech setup, a way to check what my library already knows. The factories were mostly there, scattered across old experiments. What was missing was not the machines. It was the plumbing between them, the spine that turns “ask a question” into “structured finding in the river” as one clean flow. That is the part I built, and it is the part that matters.

But I want to be honest about the failures, because they taught me more than the successes did.

The cost is real, and I have not solved it. Each finding the factory produces costs somewhere around eighty cents. That sounds tiny until you point it at hundreds of sources. I tried the obvious cheap fixes: use a smaller model for the coordinator, merge two steps into one, trim the tools each agent can see. None of them moved the number. One of them probably made it worse. The honest status is that the per-finding cost is the binding constraint at scale and the easy levers are spent. The real fix is harder, and it is the next thing I owe this project.

The measurement broke before the thing it measured did. I score the quality of what the factory produces with a second model acting as a judge. The first time I ran it, the synthesis scored badly, and I almost wrote down “the synthesis step is weak.” It was not. The scorer had a bug: it was quietly dropping part of each finding before showing it to the judge, so the judge was marking real, well-supported content as invented. The first thing a score catches is usually the scorer itself. I had learned that before; here it was again, in a new place. The number was real. The story I almost attached to it was fiction.

The AI confidently invented connections that did not exist. When the synthesis step links a new finding to things I already know, it would sometimes name a prior piece of work that sounded exactly right and simply was not there. Out of thirteen connections it proposed in one run, six were wrong: a few were inventions, a few pointed at things that were real but had moved. The fix is the most important idea in the whole build. Before I trust any connection the AI claims, I check it against the actual map of what I know. If the thing it named is really there, the link stands. If it is not, the link is thrown out and logged, never silently kept. That one check is the difference between “the model named something from memory” and “this connection actually exists.” It is the same discipline as fact-checking a confident person: do not argue with their certainty, just go look.

The factory cannot touch the world. This was a design decision, not an afterthought. None of the research agents can run a shell command or write a file. They can read, they can think, they can hand back a finding, and that is all. The factory can run all night and the worst it can do is cost me money and produce a bad note. Nothing it does reaches out and changes anything until I say so. When you are going to leave a thing running unattended, the safest design is one where “unattended” and “harmless” are the same sentence by construction, not by hope.

The real point is not research

Step back from the plumbing and look at what the failures have in common. The cost I could not crush. The scorer that lied about its own work. The connections the model invented out of thin air. Three of those four were not the machine failing to produce something. They were the machine producing something confident and wrong, and me very nearly believing it. The fourth, refusing to let it touch anything while I am not watching, is the same worry written into the design ahead of time.

That is the thread running through this entire series, and I did not see it clearly until I wrote this part down. Sending a crowd instead of one agent. Checking the library before researching again. Distrusting a green test suite. Reading the papers instead of buying the tool. Building the river so that cheap, deterministic code does the work a model would have faked. Every one of those is the same move. The question was never “how do I get an answer.” A search box gives you answers. The question was “how do I know which answers to trust,” and the work was building that question into the machine so I do not have to remember to ask it.

So here is the honest version of what a research wing is. It is not a thing that answers questions. It is a thing that is honest with you about which of its answers you can believe, and cheap enough that you keep asking. The knowledge it piles up matters. The doubt it keeps about its own knowledge matters more.

You do not need my exact machine to get there. The next time an AI hands you something clean and confident, the useful question is not whether it sounds right. It is what in your setup would have caught it if it were wrong. If the honest answer is nothing, that is the thing to build next. The rest is just plumbing.