Build the Eval Stack That Survives Production with Claude
Here’s a workflow shift that’s happening quietly across teams shipping Claude-powered features: Claude writes the code, and your job becomes writing the tests. That sounds like a minor reframing. It isn’t. It means your primary lever for quality is no longer code review — it’s the assertion layer you build around Claude’s output.
The problem is most teams haven’t built that layer. A community thread I came across recently described a setup that’s probably familiar: around 30 manually-written test prompts living in a spreadsheet, vibe-check review when someone changes a prompt, some traces in an observability tool that mostly get looked at when something breaks, and zero automated eval gates blocking deployment. Regressions get caught via user complaints. That’s not a testing strategy — it’s a prayer.
Let me show you the eval architecture that actually fixes this, built around three layers: property-based tests Claude generates from your spec, human-written invariant assertions that capture domain-critical behavior, and a CLAUDE.md hook that prevents Claude from marking a task complete until both pass.
What you need
A Claude Code project with at least one real feature to test. A basic pytest setup (or your preferred test runner). Familiarity with CLAUDE.md for project-level instructions. You don’t need an existing eval framework — we’re building the stack from scratch.
The architecture before the code
Before writing anything, sketch the three layers:
flowchart TD
A[Feature Spec] --> B[Layer 1: Claude-Generated Property Tests]
A --> C[Layer 2: Human Invariant Assertions]
B --> D[CI Gate / CLAUDE.md Hook]
C --> D
D -->|All pass| E[Task Complete]
D -->|Any fail| F[Claude Iterates]
F --> B
Layer 1 answers: does the output have the right shape and properties? Claude can generate these automatically from a spec — covering edge cases you’d never write by hand.
Layer 2 answers: does the output actually behave correctly in ways that matter to the business? These require domain judgment. Claude can’t write them reliably. You write them.
Layer 3 is the enforcement mechanism. Without it, both layers are optional suggestions.
Step 1: Get Claude to generate property tests from your spec
The prompt pattern matters here. Don’t ask Claude to “write tests.” Ask it to enumerate failure modes first, then generate tests for each one.
You are writing property-based tests for the following feature spec:
<spec>
A summarization endpoint that accepts article text (100–10,000 chars)
and returns a summary. The summary must be shorter than the input,
must not introduce claims absent from the input, and must complete
in under 3 seconds on inputs under 5,000 chars.
</spec>
Step 1: List every failure mode you can imagine — edge cases, boundary
conditions, adversarial inputs, off-by-one errors, encoding issues.
Step 2: For each failure mode, write a pytest test that would catch it.
Use parametrize and random input generation where appropriate to cover
the property space rather than just specific examples.
Claude will produce something like this:
# claude_generated_tests.py
import pytest
import random
import string
from myapp.summarizer import summarize
# Failure mode: summary longer than input
def test_summary_shorter_than_input():
# Run across a range of random inputs to cover the property space
for _ in range(50):
length = random.randint(100, 10_000)
text = "".join(random.choices(string.ascii_letters + " ", k=length))
result = summarize(text)
assert len(result) < len(text), (
f"Summary ({len(result)} chars) not shorter than input ({len(text)} chars)"
)
# Failure mode: empty string input
def test_empty_input_raises():
with pytest.raises(ValueError, match="Input too short"):
summarize("")
# Failure mode: input at exact boundary (100 chars)
def test_minimum_length_boundary():
text = "a" * 100
result = summarize(text)
assert result is not None
# Failure mode: unicode and encoding edge cases
@pytest.mark.parametrize("text", [
"café " * 25,
"日本語テスト " * 20,
"emoji 🎉🔥💡 content " * 15,
])
def test_unicode_input_handled(text):
try:
result = summarize(text)
assert isinstance(result, str)
except ValueError:
pass # Explicit rejection is acceptable; silent failure is not
Run these immediately. Fix anything that fails. The randomized inputs will surface edge cases that hand-picked examples would never catch.
Step 2: Write the human invariant assertions
Property tests tell you about shape and boundaries. They won’t tell you whether the summary is faithful to the source, or whether it strips out a medical disclaimer your legal team requires. That’s domain knowledge. You own it.
# human_invariants.py
import pytest
import re
from myapp.summarizer import summarize
MEDICAL_DISCLAIMER = "consult a healthcare professional"
# Invariant: disclaimers in source must appear in summary
def test_medical_disclaimer_preserved():
source = (
"Ibuprofen can reduce fever effectively. "
f"Always {MEDICAL_DISCLAIMER} before use. "
"Do not exceed recommended dosage." * 20 # pad to meet minimum length
)
result = summarize(source)
assert MEDICAL_DISCLAIMER in result.lower(), (
"Medical disclaimer was dropped from summary — this is a compliance failure"
)
# Invariant: no hallucinated statistics
def test_no_introduced_percentages():
source = "The product received positive reviews from early adopters. " * 30
result = summarize(source)
assert not re.search(r'\d+%', result), (
"Summary introduced a percentage not present in source"
)
# Invariant: tone must not flip sentiment on financial content
def test_negative_financial_news_stays_negative():
source = "The company reported a significant loss this quarter. " * 30
result = summarize(source)
positive_words = {"growth", "profit", "gain", "surge", "beat"}
result_words = set(result.lower().split())
assert not result_words & positive_words, (
"Summary inverted negative financial sentiment"
)
These tests encode decisions that don’t live anywhere else in your codebase. The medical disclaimer invariant, for instance, probably came from a legal review six months ago — if it’s only in a Google Doc, it will eventually break silently.
Step 3: Wire both layers into a CLAUDE.md completion hook
This is the enforcement layer. Without it, Claude can write code that fails your invariants and still declare the task done.
# CLAUDE.md
## Task Completion Requirements
Before marking ANY task as complete, you must:
1. Run the full test suite: `pytest claude_generated_tests.py human_invariants.py -v`
2. All tests must pass with exit code 0.
3. If any test fails, diagnose the failure, fix your implementation, and re-run.
4. Do NOT mark a task complete while any test is failing.
5. Do NOT delete or modify tests in `human_invariants.py` to make them pass.
These are non-negotiable behavioral requirements, not suggestions.
## Test Authorship Rules
- Tests in `claude_generated_tests.py`: you may add, modify, or remove these.
- Tests in `human_invariants.py`: you may NOT modify these without explicit
human approval in the conversation. If an invariant seems wrong, flag it
and ask — don't quietly remove it.
The key rule is the last one. Claude will sometimes “fix” a failing test by weakening the assertion. The CLAUDE.md instruction creates a clear boundary: the property tests are Claude’s domain, the invariant tests are yours.
Where this breaks
The biggest failure mode is invariant tests that are too brittle — string matching on output that varies legitimately across model versions or prompt tweaks. Write invariants against semantic properties (“disclaimer present”) not exact phrasing (“must contain exactly: ‘consult a healthcare professional’”). The latter will break the moment Claude rephrases naturally.
The second failure mode is letting the CI gate become a rubber stamp. If every invariant passes on every run without ever catching anything, either your invariants aren’t covering real risk or your model outputs are more stable than you think. Periodically audit: take a known bad output and confirm your suite would have caught it.
Next steps
The tools that come up repeatedly in community eval discussions — LangSmith, Braintrust, promptfoo, Phoenix — all slot naturally on top of this architecture as the trace and observability layer. They’re great for understanding why an invariant failed, not for defining what failure means. Get the assertion layer right first, then add observability.
If you’re starting from the “30 spreadsheet prompts” baseline: pick your three highest-risk invariants, write those tests first, add the CLAUDE.md hook, and run one Claude Code session with the gate active. The first time it blocks a task completion and forces a fix, the investment pays off. Start there.
← Back to blog