Tests-Green Is Not the Truth

In the first post I said the research wing prototypes ideas and then wraps real engineering standards around the ones that earn it. This is the standard that took me longest to learn, and the one I would now fight hardest to keep.

There is a specific feeling you get the first time an AI agent confidently hands you something broken. It tells you the job is done. It shows you the evidence: a full suite of tests, every one of them green. And the thing still does not work.

That happened to me enough times in the spring of 2026 that I had to sit with an uncomfortable idea. A green checkmark from the thing that did the work is not proof the work is right. It is proof the work passed the checks that same thing thought to run. Those are very different statements, and the gap between them is where most bad output hides.

Think about it the way you would think about a person. If I write an essay and then grade it myself, my A tells you almost nothing. Not because I am dishonest, but because I cannot see the flaw I could not see while writing it. Same blind spot, twice. An agent writing its own tests has exactly this problem. It tests for the cases it already thought of, which are precisely the cases it already handled. The bug lives in the case it never imagined, so it never wrote a test for it.

The fix is old, and it does not even come from software. It is separation of powers. Whoever does the work cannot be the only one who judges it. So I added a second agent whose entire job is adversarial. It does not try to confirm the work is good. It tries to break it. Different prompt, different incentive, fresh eyes, and a standing assumption that the first agent missed something.

The day this earned its keep is the one I still think about. One of our build pipelines produced a piece of work, ran its tests, and came back clean: seventy-six passing tests, a confident green. By the old rules I would have shipped it. The adversarial reviewer went in afterward, assuming guilt, and found a real bug that all seventy-six tests had sailed straight past. Not a style nitpick. A genuine defect the green suite was blind to, because the agent that wrote the tests shared the exact blind spot of the agent that wrote the code.

That is the whole lesson in one story. Tests-green is plausibility, not truth. Passing your own checks means you are probably not obviously wrong, which is worth something, but it is a long way from right. The only thing that reliably closes the gap is a second look from something whose job is to disagree with you.

I do not run the adversarial pass on everything, because it costs real time and money. The rule I use is about consequence. The more directly a piece of work is going to reach a human and be believed, the more it earns the hostile second look before I trust it. A throwaway internal note can stand on its own checks. Anything I am going to put in front of a person as true gets the second pair of eyes first. The stakes set the scrutiny.

And there is a humbling version of this that points right back at me. The most dangerous output is not the answer that is obviously wrong. You catch that one. It is the answer that is confident, well formatted, plausible, and agrees with what you already expected. That one slides right through. Building a step whose only job is to distrust the polished, agreeable answer is the best defense I have found against my own confirmation bias.

Up to here, everything I have described, the missions, the library, the adversarial check, I built by feel. Trial, error, and instinct. Then one night I went looking for better tools and ended up reading the actual research on how AI agents fail. It turned out the academics had been grading my homework the whole time. That is the next post.