My AI Grader Scored the Same Conversation Differently Every Time I Ran It

Q: Why not just lower the temperature to 0 and call it done?

Temperature 0 makes a single model *more* likely to repeat itself, but it doesn't make the judgment correct, and it's brittle across model versions or providers — a model upgrade can quietly shift scores for everyone's already-graded history. Removing the need to re-derive the fact at all is more durable than relying on a hyperparameter to suppress variance.

I was building a feature where an LLM grades a multi-turn conversation against a checklist — think a practice-interview bot that scores how well someone covered a set of required topics. The grading worked fine on the first pass: dump the whole transcript into a prompt, ask for a score out of 100 across a few dimensions, parse the JSON back. Then, while writing tests, I re-ran the exact same finished transcript through the grader twice and got two different dimension breakdowns. Same input, same code, no randomness in the business logic — just an LLM re-deriving a judgment from scratch each time and landing somewhere slightly different.

That's fine for a single subjective opinion. It's not fine for a score that ends up on someone's record, where "why did I get marked down for something my transcript clearly covers" is a question someone will eventually ask.

The original approach: one prompt, one score

The first version looked roughly like this — score everything in a single call:

SCORE_PROMPT = """You are grading a practice interview transcript against this checklist:
{checklist}

Conversation:
{conversation}

Score each dimension out of its max points and return JSON:
{{"coverage": 30, "technique": 14, "thoroughness": 18}}"""

raw = await llm.chat([{"role": "user", "content": SCORE_PROMPT.format(
    checklist=checklist, conversation=full_transcript,
)}])

This is the natural first thing to write, and it's why it's worth flagging: it looks correct. It returns well-formed JSON, the numbers look plausible, nothing throws an error. The problem only shows up if you happen to grade the same input twice and compare.

Warning

A single LLM call asked to "re-read this transcript and decide what happened" will not reliably produce the same judgment on a second pass, even at low temperature. Lower temperature reduces variance, it doesn't remove it — and it doesn't fix the actual issue, which is asking the model to re-derive something instead of recording it when it happened.

The fix: stop asking the LLM to re-derive facts it already observed

The checklist-coverage part of the score doesn't need to be decided after the fact at all — it can be tracked turn by turn, while the conversation is happening, with a much narrower, cheaper LLM call per message ("did this specific message touch on topic X, yes/no") instead of one expensive call re-reading everything at the end:

# during the conversation, after each user message:
hits = await classify_topics_mentioned(message, checklist_topics)
covered_topics |= set(hits)

Once the conversation ends, the coverage dimensions become a deterministic tally — no LLM involved, no variance possible:

def coverage_score(checklist: list[dict], covered: set[str]) -> dict[str, int]:
    totals, earned = {}, {}
    for item in checklist:
        dim = item["dimension"]
        totals[dim] = totals.get(dim, 0) + item["weight"]
        if item["id"] in covered:
            earned[dim] = earned.get(dim, 0) + item["weight"]
    return {dim: round(earned.get(dim, 0) / totals[dim] * DIM_MAX[dim]) for dim in totals}

That leaves exactly one dimension that genuinely can't be computed this way: something like "interview technique" — was the conversation well-organized, was anything asked redundantly, was the tone appropriate. There's no checklist item for "didn't repeat yourself." That part stays an LLM judgment call, but now it's scoped to only the one thing that needs judgment, and it's handed the already-known coverage facts as context instead of being asked to re-discover them:

TECHNIQUE_PROMPT = """Judge only the interview technique dimension (max {max} points):
organization, redundancy, tone. You do not need to re-determine topic coverage —
here is what's already confirmed covered:
{coverage_summary}

Full conversation:
{conversation}
"""

Tip

The rule of thumb that fell out of this: if a fact has a clear "true the moment it happened" answer, record it the moment it happens and never re-ask an LLM to re-derive it later. Save the LLM call for the one dimension that has no such moment — judgments about the overall shape of a conversation, which really do require reading the whole thing.

I also kept the old single-prompt grader around as a fallback path for any data graded before this change shipped, rather than trying to retroactively recompute coverage for records that never tracked it. Worth deciding upfront whether a change like this needs a compatibility path, since retrofitting one under pressure later is worse.

Note

This same split applies anywhere you're tempted to ask an LLM to grade a whole artifact in one shot — code review comments, support-ticket quality scoring, essay grading. Anything with an objective, checkable sub-component should be pulled out and computed; only the genuinely subjective remainder should stay an LLM call.

FAQ

Why not just lower the temperature to 0 and call it done?

Temperature 0 makes a single model more likely to repeat itself, but it doesn't make the judgment correct, and it's brittle across model versions or providers — a model upgrade can quietly shift scores for everyone's already-graded history. Removing the need to re-derive the fact at all is more durable than relying on a hyperparameter to suppress variance.

How do I tell which parts of a score should be deterministic vs. LLM-judged?

Ask whether the thing being measured has an instant of objective truth. "Did the user ask about X" is true or false the moment the question is asked — capture it then. "Was this well-organized" has no such instant; it's a property of the whole arc, which is exactly the kind of judgment LLMs are reasonably good at and rule-based code isn't.

Doesn't per-turn tracking risk missing things a full-transcript review would catch?

A little, depending on how narrowly you scope the per-turn check. In practice, keep one LLM call that does see the entire conversation — the technique/quality judgment call — so anything a narrow per-turn check missed is still visible to the one pass that's reading everything.

Bottom Line

When an LLM gives a different answer to the same question twice, the fix usually isn't a better prompt — it's noticing you're asking it to re-derive something you already know. Record the fact when it happens, score from the record, and spend the LLM call on the one dimension that genuinely has no record to score from.