# Intent Engineering, Part 3: Grade the Grader

> Can you trust a model to grade a model? I built the judge anyway, and it lied to me three times with total confidence. What catching it taught me about measuring agents.


In the last post I showed that an agent's identity actually sticks when you encode it as structure, a required section in the output, rather than a behavior you hope the model performs. I measured that by having a second model read the agent's transcript and score whether it did the things its identity says it does.

You may have already spotted the problem. Can you trust a model to grade a model? I didn't think so. Then I built it anyway, because it was the only way to get numbers at any useful scale. And it lied to me. Three times, with total confidence, in three different ways. This post is about why I kept it anyway, and what catching it taught me about measuring anything an agent does.

## The twist: the judge was wrong first

Months ago, when I first sketched how I'd measure intent, I wrote down one thing I would not do: use one model to judge another model's alignment. It felt circular. A model grading a model, who grades the grader, turtles all the way down. Human judgment was supposed to be the ground truth.

Then I went and built exactly the thing I'd ruled out. And the very first verdict it handed me was wrong. It told me the agent hadn't challenged its conclusion, score zero, when the agent absolutely had. The prosecution was right there in the transcript. I read it with my own eyes.

The judge wasn't wrong because it was a model judging a model. It was wrong for a far dumber reason. It was only being shown the first chunk of a long transcript, and the part it was grading lived at the end, past the cutoff. It was handed a paper with the last page torn out and asked why the conclusion was missing. Once I showed it the whole thing, its scores weren't just correct, they were sharp: a zero for the bare model with no doctrine, a near-perfect for the well-built one, exactly as it should be.

The lesson isn't "model judges don't work." It's more useful than that. Validate what the judge can see, not just the judge. The model was fine. What it could see was broken, and a confident wrong number hid the problem until I went back and read what the judge had actually read.

## Same disease, two more times

I'm glad I internalized it, because it happened again. Twice.

Once, a synthesis score cratered to zero and I nearly wrote down "the agent stopped synthesizing." The real cause was that my test rig had forgotten to give the agent the search tools its identity assumes it has. So it spent the whole run getting denied, fell back on what it remembered, and turned in a thinner answer. The agent was fine. My harness had tied its hands and then I'd graded it for having tied hands.

Another time, the part of the scorer that checks numbers cheerfully marked a wrong answer correct. A bug stripped the decimal point out of "28.5" and matched it against a gold answer of "28." A wrong number scored right, silently, and a wrong number scoring right is worse than one scoring wrong, because it inflates your confidence instead of denting it.

Three times the instrument lied with total confidence. The transcript got truncated, the tools got withheld, the decimal got eaten. And all three times, the only thing that caught it was reading the actual work instead of the score. Not a better model. Not a cleverer metric. Just going back and looking at what really happened, then at what the grader thought happened, and noticing they didn't match.

So I built two habits in.

The first is a tripwire. I keep a deliberately terrible example in the test set at all times: an agent with no doctrine that should always score zero. If my scorer ever starts handing that one a good grade, I know the scorer is broken before I trust a single other number it produces. A measurement that can't fail loudly is worthless, so I make sure one known-bad case is always there to make it fail loudly when it should.

The second is a reflex. I treat every surprising zero as a suspect, not a finding. Before I write down "the agent can't do X," I go read what the grader actually saw. Two of my three best "findings" evaporated the moment I did that. The number was real; the story I'd attached to it was fiction.

## The other half: did it get the right answer?

Everything so far is the process score, whether the agent worked the right way. There's a second half to the harness that asks whether it got the right answer. I run the agent through a set of hard, multi-step research questions where I already know the correct result, and I check. It scores in the low 80s percent. Respectable. But the interesting part isn't the headline. It's where the misses cluster.

Every single failure was a number. The agent was flawless on questions whose answer is a name or a fact, and it missed specifically on the ones that need you to pull several precise figures together and compute. My first read was the obvious one: it's bad at arithmetic. So I went and read all the failing transcripts in full, expecting to find bad math. (You can see the habit working now.)

There was no bad math. In none of the misses did the model add, divide, or round incorrectly. Every miss was a sourcing failure. It fed correct arithmetic the wrong input. The cleanest example: a question about a city's population in a specific census year, where the agent confidently grabbed a current population estimate instead of the figure from the year actually asked about. The math on the wrong number was perfect. The number was wrong. "Bad at numbers" was a misdiagnosis. The real weakness was which number it chose to trust.

Which set up one more use of the structure trick from the last post. I added a required section: a calculation block where every figure has to carry its value, its source, and the year it comes from, with an instruction to match that year to what the question is actually asking. And I'll be honest about the result in a way that matters. On the specific question I was chasing, I could watch it work. The agent switched from the convenient current estimate to the correct-year source, exactly as designed. But across the whole benchmark, the headline number didn't move. One question flipped to correct, another flipped to wrong on a different run, they cancelled out, and the overall score sat flat inside its own noise.

That's its own lesson, and a quieter one. A small, targeted fix needs a small, targeted measurement. If I'd only looked at the overall score, I'd have concluded the change did nothing, when reading the transcripts question by question showed it doing exactly what I built it for. The effect was just too small to survive being averaged into a noisy total. I kept the change anyway, because it has no downside, it demonstrably works on the case it was built for, and it makes every numeric answer auditable. Now I can see which figure and which year the agent used, instead of trusting a bare number. After three rounds of instruments lying to me, "I can see exactly what it did" is worth as much as the score.

## What I still can't measure yet

I want to be honest about the shape of all this evidence, because it has a ceiling. Everything I've described is a snapshot. One agent, one research question, one session that runs for a few minutes and then ends. That's the cheap thing to measure, and it's real, but it isn't the thing I actually care about. The promise of an identity document was never about a single session. It was about an agent that stays itself over hours and days, that doesn't drift, that makes the tenth decision as much like mine as the first.

So the experiment I haven't run yet is the long one. Put an agent with a rich identity and an agent with none on the same extended piece of work, the kind that fills a whole session and resumes the next day, and watch what happens to the judgment over time, not just at the end. My bet is that identity matters more the longer the horizon: that the no-identity agent holds up fine for ten minutes and slowly comes apart over ten hours. But that's a bet, not a finding, and the only way to settle it is to run it.

And I want the comparison to be richer than a score. Right now I can tell you the bare model scores zero on self-prosecution and the well-built one scores near-perfect. What I can't yet tell you with the same rigor is how the two outputs differ in the ways that don't reduce to a number: whether the identity-driven answer is better organized, more honest about what it doesn't know, more useful on a second read. A number tells you something moved. It doesn't tell you what changed in the work. Putting two agents side by side, identity against no identity, and characterizing the difference instead of just scoring it, is the next thing I'm building toward.

## Grade the grader

Here's the whole arc, three posts in. The first was about writing an agent an identity: the invisible 70% of how you work, the stuff that isn't in the handbook. The second was about making that identity stick, which means encoding the parts you care about as structure the output has to contain, not suggestions the model is free to skip. This one is about the part nobody warns you about: the instrument you build to check all of that needs checking too, and it will fail you confidently and silently if you let it.

The instinct to measure your agents is the right one. The first post admitted there was no formal way to do it; now there is, and it changes intent from something you hope for into something you can watch land. But a measurement is a tool, and tools have bugs, and a buggy measurement is more dangerous than no measurement because it wears the costume of rigor. Check what your judge can see. Keep a known-bad case in the mix so a broken scorer can't quietly start congratulating you. Read the work, not just the score, especially when the score surprises you.

The whole reason to measure intent instead of hoping is to be able to find out you were wrong. That only works if the measurement can be wrong out loud. Writing the identity was the first half. Finding out whether it stuck, and learning to trust the thing that tells you so, turns out to be where the real engineering lives.

---

*Starter files for the layered identity stack are on GitHub: [intent-engineering-starter](https://github.com/byrongeorge/intent-engineering-starter), MIT licensed.*