The BYOP diagnostic found the failure. You rebuilt the prompt. Definitions replaced the model’s training prior. The misclassification rate dropped. You deployed it.
Six weeks later, two failures appear in the sprint retrospective. One diff marked "breaking": false shipped without a migration note. The other landed under the wrong type.
The rebuild was correct. The BYOP pass found the Semantic Drift Vector and removed it. What the rebuild did not do: specify, before deployment, what a correct output looks like. The failures were not caused by the prompt drifting back. They were caused by outputs on cases no one verified in advance.
The missing piece is a prior written commitment to what correct means for this specific prompt in this specific deployment context. That commitment has a name: a Ground Truth Contract. It is not a test suite, though a test suite is one implementation of it. It is the specification that turns “the rebuild raised the baseline” into “the deployment has a floor.”
The BYOP rebuild fixes the Semantic Drift Vector: the gap between what a term means in your domain and what the model infers from training. Add explicit definitions and the model stops using its training prior for those categories. That is the correct fix for that failure mode.
But a rebuilt prompt is a probabilistic system. The definitions narrow the model’s distribution over output categories; they do not collapse it to a single correct answer. For inputs close to the boundary of your definitions, the model makes a judgment call. Whether that judgment call is correct depends on whether your definitions were precise enough for that edge case.
Without a prior specification of what correct looks like, you cannot evaluate whether those judgment calls are right. You can observe outputs. You cannot compare them to a standard you committed to before looking. The distinction matters: observing outputs and rationalizing them is not evaluation. It is confirmation bias operating on a sample you did not choose.
A Ground Truth Contract creates the prior commitment that makes evaluation possible. It does not change the prompt. It specifies what the prompt’s outputs must satisfy, in your domain’s language, before you run it.
A Ground Truth Contract has three components. Each addresses a different way a correctly rebuilt prompt can still fail in deployment.
Component 1: Evaluation criteria. What makes an output correct by your domain’s definition, not by the model’s general knowledge? This is the hardest component to write because it requires committing to a definition before seeing outputs. It forces you to articulate the conventions your team has always applied but never written down.
For a changelog classifier: “A diff that removes a public method is breaking regardless of whether the method was documented. A diff that adds a method to an internal module is not feat.” These criteria cannot come from the model. They come from the domain. The model’s prior for what “breaking” means is not wrong in the abstract. It is wrong for your API surface.
Component 2: The reference set. A set of specific inputs with their correct outputs, specified by your criteria, before you run the prompt. Not selected after seeing what the model handles well. The reference set is the ground truth: if the prompt produces the specified output on each input, it passes. If it does not, it fails.
The reference set does not need to be large. It needs to cover the ambiguous edge cases where the model’s prior and your domain’s convention diverge. A small, well-chosen set surfaces more information than a large randomly selected one. Five cases that target the boundaries of your definitions are worth more than fifty cases the prompt already handles correctly.
Component 3: Stop-conditions. The specific output patterns that cause you to stop accepting this prompt’s outputs and trigger a rebuild. Not “we will review it monthly.” Stop-conditions specify what failure looks like in concrete, observable terms: “If the classifier marks any diff that removes a public method as breaking: false, the contract is violated and the prompt must be rebuilt.”
Stop-conditions are what turn a Ground Truth Contract from a document into an operational gate. Without them, the criteria and reference set tell you what correct looks like. They give you no trigger to act when it stops being true.
The classifier from the previous post has been running for six weeks. The BYOP rebuild replaced the model’s Conventional Commits prior with explicit definitions. The misclassification rate dropped substantially. Breaking-change false negatives decreased.
In the sprint retrospective, two failures surface. A diff that changed the return type of an exported function from boolean to Promise<boolean> was classified as fix with "breaking": false. It shipped without a migration note. A diff that removed an exported constant was labelled chore; it was a breaking change that required callers to update import paths.
The failures are not caused by the prompt reverting to its old behavior. The definitions are still present. They reduced the original Semantic Drift Vector’s failure rate. But no one specified, before deploying the rebuilt prompt, what the correct output was for a return-type change or a deprecated constant removal. Those cases were not in a reference set because no one wrote a reference set.
Here is the Ground Truth Contract that would have made both failures detectable at the reference set stage, before the first deployment of the rebuilt prompt:
GROUND TRUTH CONTRACT — Changelog Classifier v2
Scope: TypeScript monorepo, public API surface = src/api/**
Prompt version: v2 (BYOP rebuild)
EVALUATION CRITERIA
1. A diff is "breaking: true" if any exported function, type, or interface
in src/api/** is renamed, removed, or has its signature changed such
that existing callers must update their code.
2. When uncertain whether a change is breaking, the correct output is
"breaking: true". False negatives on breaking changes are costlier
than false positives.
3. A diff is "feat" only if it adds a capability a consumer of the public
API could not previously invoke. Internal changes, performance
improvements, and test additions are not feat regardless of scope.
4. Descriptions must characterize the change from the diff. A description
that reproduces commit message text verbatim fails this contract.
REFERENCE SET (7 cases)
Case 01: diff removes exported parseConfig from src/api/config.ts
Correct: type = fix or refactor; breaking = true
Case 02: diff adds optional third param to exported formatDate
Correct: type = feat; breaking = false
Case 03: diff renames internal helper in src/utils/date.ts
Correct: type = refactor; breaking = false
Case 04: diff updates @types/node from 18 to 20
Correct: type = chore; breaking = false
Case 05: diff adds new exported function to src/api/auth.ts
Correct: type = feat; breaking = false
Case 06: diff changes return type of exported verify() from boolean
to Promise<boolean>
Correct: type = fix or feat; breaking = true
Case 07: diff removes deprecated exported constant MAX_RETRIES
Correct: type = chore; breaking = true
STOP-CONDITIONS
SC-1: Any public API removal or signature change marked "breaking: false":
contract violated; halt and rebuild.
SC-2: Any description that reproduces commit message text verbatim:
flag for review.
SC-3: Failure on cases 01, 06, or 07 (highest-consequence patterns):
rebuild required before next deployment.
Case 06 covers the return-type change. Case 07 covers the removed constant. SC-1 would have flagged either failure immediately. The failures were predictable. They were not predicted because no one wrote down what correct meant before looking at outputs.
The contract does not need to be exhaustive. The seven reference cases cover the boundary conditions where the model’s prior and the domain’s convention are most likely to diverge. The stop-conditions target the highest-consequence failure patterns. The criteria are written in the language the engineering team uses in code review, not in the language of model evaluation.
A Ground Truth Contract establishes a floor at deployment. It does not keep the floor in place.
Two things erode it over time. First, the model changes. A model version update alters behavior on your reference set without altering the contract. If you do not re-run the reference set against the new model version, you are running a contract written for a different system. The criteria and cases are still valid. Whether the prompt satisfies them is no longer verified.
Second, the domain’s conventions change. When your team updates the definition of “breaking” or adds a new change type to the classification scheme, the evaluation criteria become stale. Stale criteria mean your reference set is testing the wrong thing. A prompt can pass the contract and still produce outputs your team considers wrong, because the team’s standard has moved since the contract was written.
Catching both failure modes requires more than a written contract at deployment. It requires a process that re-runs the reference set on a schedule, detects when the prompt’s outputs on reference cases deviate, and alerts when the deviation crosses a threshold. That process is not a Ground Truth Contract. It is the enforcement layer that keeps the contract valid over time. It has a name, and it is the subject of the next post.
A Ground Truth Contract is a written specification, scoped to a specific prompt in a specific deployment context, that defines: (1) the evaluation criteria for correct output by domain standards, (2) a reference set of inputs with known-correct outputs specified before running the prompt, and (3) the stop-conditions that trigger a rebuild.
In use: “Before we redeployed the classifier, we wrote a Ground Truth Contract covering the seven edge cases that had caused misclassifications. Stop-condition SC-1 flagged a breaking-change failure in the first week of deployment.”
Where it does not apply: a Ground Truth Contract specifies correctness for a fixed deployment context. It does not handle model drift over time, distribution shift in inputs, or changes to the domain’s conventions. When the underlying model changes, the reference set must be re-run. When the domain’s conventions change, the criteria must be updated before re-running the reference set. The contract governs the prompt. It does not govern the model or the domain.
The contract is also not a capability test. It does not establish whether the model can perform the task at all. A Ground Truth Contract governs a specific prompt already in a specific deployment context. It answers “is this prompt producing correct outputs by our criteria” at the moment of deployment, and it provides an operational trigger when the answer becomes no.
Take the production prompt most recently rebuilt or modified. Before the next run that matters, write three things: the evaluation criteria in your domain’s language (three to five criteria, one sentence each), five to seven reference cases with specified correct outputs, and two stop-conditions that would trigger a rebuild.
Write the criteria first, before looking at outputs from the rebuilt prompt. The order matters: criteria written after looking at outputs are rationalizations, not specifications. They confirm what the model already produces rather than specify what your domain requires.
If you find that you cannot write the criteria without looking at outputs first, that is the diagnostic result. The prompt is deployed without a prior commitment to what correct means. That is the condition that allows second-generation misclassifications: failures on cases no one specified as correct in advance.
The full vocabulary for this, all 25 precision terms, is on the Method page, free and public.
For engineers who want the deployable version: 12 production-ready system prompts including SP-11 (Ground Truth Contract Builder) and five fully-worked BYOP diagnostic rebuilds are in the Agent Control Architecture Pack.
The Constraint newsletter ships weekly: one issue, one technique, one precise term. Subscribe free.
All 25 precision terms, the Prompt Maturity Model, and the vocabulary that makes AI agent failures diagnosable.
Read the Method → Subscribe to The Constraint