"All tests pass" is a warning, not a clearance

I was reviewing an implementation where timing instrumentation had been added to three parallel API services. The pattern was identical across all three: capture a timestamp before the external API call, compute the delta after it returns, pass the value through the response, write it to the database. Clean work. Consistent. All 1,297 tests passed.

Something about the third service bothered me. I pulled up the database model definition for its output table. The timing column wasn’t there.

The service computed the timing. The route wrote it to the model object. But the model had no column for it, and no migration existed to add one to the production database. If this had shipped, every request to that service would have crashed on database commit.

All 1,297 tests passed because the test suite creates fresh databases from model definitions on every run. The column didn’t exist in the model, so it didn’t exist in the test database either, and no test exercised the full write path with a real commit. The tests verified a world where the problem literally could not manifest.

This isn’t a testing failure. It’s a structural blind spot in how test suites work with persistent databases — and it shows up across every ORM framework I’ve worked with. Django, SQLAlchemy, ActiveRecord, Prisma. The pattern is always the same: tests get fresh schemas; production doesn’t.

There’s a quiet assumption in professional software engineering: if the tests pass and the build is clean, the work is probably correct. Most of the time, that assumption holds. But for an important class of changes — schema modifications, environment configuration, deployment artifacts, multi-layer data flows — it doesn’t just fail to hold. It actively misleads.

Tests create their own universe. Fresh databases with schemas generated from current code. Mocked external services that return exactly what you expect. Environment variables that always exist. Single-process execution with no concurrency. That universe is useful for catching logic errors and regressions. It is structurally incapable of catching certain categories of production failures.

A few I’ve encountered repeatedly across projects:

Missing database migrations. Test databases don’t need migrations — they’re created from scratch. Production databases need explicit ALTER TABLE statements. A new column can exist in every test run and not exist in production.

Mock gaps. When route tests mock the service layer, they verify that the route handles the service’s return value correctly. They cannot verify that the service and route actually work together. The mock is a promise that the real service will behave a certain way. Nobody checks whether the promise is kept.

Cached static assets. A UI component can compile, pass all frontend tests, and still show the old version to users because the CDN is serving cached JavaScript from before the deploy.

Environment parity failures. Code that works with SQLite in tests breaks with PostgreSQL in production. A feature that relies on an environment variable works on every developer machine and fails in CI because nobody added it to the deployment config.

The common thread: these are not bugs in the code. They are gaps between the world the code was tested in and the world the code runs in.

Why the third one always has the bug

There’s a second pattern I’ve noticed, and this one is cognitive rather than structural.

When you implement the same pattern across multiple components — three services, four endpoints, five database models — the quality of your work follows a predictable curve. The first implementation gets the most attention. You’re figuring out the pattern, thinking carefully, verifying each step. The second one gets solid attention — you’re pattern-matching now, but still checking. The third one gets the least. You’ve done this twice successfully, you know the pattern, it feels routine.

That false confidence is where bugs hide. The third implementation inherits your trust from the first two without inheriting your diligence.

This pattern becomes acute with AI coding agents. The timing instrumentation I described at the top? That was AI-generated code. An AI agent implemented the same pattern across three services in a single session. The first two services were complete across all five layers: service logic, return value, route handler, model column, and migration. The third had layers one through three but was missing four and five.

AI agents are fast, systematic, and consistent at applying patterns — which makes it easy to assume every instance is correct when the first two are. But they exhibit the same diminishing attention curve that humans do, and at larger scale. A human might implement the same pattern across three components in an afternoon. An AI agent might do it across a dozen in twenty minutes, producing code faster than you can review it. The velocity that makes AI coding agents valuable is the same velocity that makes systematic verification essential. If you’re reviewing AI-written code by spot-checking the first implementation and assuming the rest follow suit, you’re trusting pattern consistency that may not hold at the edges.

Self-review is a distinct discipline

Most teams have code review. Some have comprehensive test suites. A few run periodic codebase audits. These are all valuable. But none of them are designed to answer the question that matters most right after you finish implementing something: is this change genuinely complete?

Code review is external — someone else looking at your work. Test suites are automated — they check what they’re programmed to check. Codebase audits are periodic — they evaluate the system, not a specific change. There’s a gap between “I wrote the code” and “a peer reviews it” where the implementer is the only person with full context on what the change is supposed to do. That’s where self-verification belongs.

I’ve been refining this idea into what I think of as three verification scopes, and I recently codified them as open-source Agent Skills that any AI coding agent (or human) can use:

Scope	Skill	Core question
Single change	synthesis-implementation-integrity	Is this change genuinely complete?
Proposed merge	synthesis-pr-review	Should this change enter the codebase?
Entire system	synthesis-codebase-review	Is this system healthy?

The implementation integrity skill — the new one — is the self-verification layer. You run it after completing an implementation and before creating a pull request. It exists to catch what tests structurally can’t, and what you structurally won’t notice because you just spent an hour building the thing.

Seven passes, not a vibe check

The skill isn’t “look at your code and see if it seems right.” That’s what we already do, and it’s how the timing column got missed. The skill is a structured protocol with seven specific passes, each targeting a known failure mode:

Chain completeness. Trace every new data element through its full lifecycle — origin, transport, validation, transformation, storage, retrieval, presentation. Grep for the field name across the entire codebase. Every layer that handles the entity should reference it. If a layer doesn’t, that’s a broken link.

Placeholder detection. Search changed files for TODO, FIXME, for now, temporary, hardcoded values, stub implementations. These have an expected lifespan of forever. If it’s not acceptable as permanent code, it’s not acceptable to ship.

Test honesty. Read the tests that ostensibly cover the change and ask: do they exercise the actual code path that could fail in production, or do they mock it away? If you added a database column and zero tests were modified, the tests almost certainly don’t cover the write path.

Environment parity. What assumptions does this code make about its runtime environment? Fresh vs. persistent databases. Local filesystem vs. cloud storage. Single instance vs. multiple. Current code vs. CDN-cached assets.

Diminishing attention audit. If this is the Nth implementation in a series, verify the last one with first-item diligence. Count its layers against the first implementation. Every layer the first one has, the last one should have too.

Companion change completeness. Backend changes usually need frontend updates. New environment variables need deployment config. Schema changes need migrations. For each file you changed, ask: what other files would a complete implementation require?

Boundary verification. Where does the code meet external systems, user input, or other services? What happens when the external service is slow, returns garbage, or is unavailable?

Each pass includes a challenge question designed to force genuine examination rather than casual confirmation. The test honesty pass, for example, asks: “If I deleted the implementation I just wrote but kept the tests, would any test fail?” If the answer is no, the tests don’t actually cover the change.

The adversarial mindset

The hardest part of self-review isn’t the checklist. It’s the posture.

When you’ve just spent an hour building something, your default mode is confirmation bias. You’re looking for evidence that it works. That’s natural — you want it to work, you think it works, and every signal that confirms “it works” feels good.

The implementation integrity protocol asks you to flip that. You’re not verifying the work is correct. You’re trying to prove it’s wrong. Would you stake your reputation on this implementation? Would you deploy it at 5 PM on a Friday, confident nothing will page you at 2 AM?

If the answer is “yes, but I haven’t actually checked the migration” — that’s not confidence. That’s hope.

Five rules I’ve settled on for honest self-review:

Distrust your memory of what you did. Open the file and check. Your recollection of having added the column is not evidence the column exists.
Distrust passing tests. Ask what the tests actually exercise. A test that mocks the layer where the real risk lives is a test of your assumptions, not your code.
Distrust the last implementation in a series. Give it more scrutiny, not less.
Distrust “it compiles.” Compilation proves syntax. It proves nothing about data flow, schema state, configuration completeness, or environment parity.
Distrust your own confidence. The feeling of “this is obviously fine” is a signal to look harder. Obvious failures don’t survive to production. The non-obvious ones do.

Where this matters most: AI-written code

If you’re using AI coding agents — Claude Code, Cursor, Codex CLI, Copilot — this verification gap is your biggest practical risk. AI agents write code fast, confidently, and at volume. They implement patterns consistently across many components in a single session. When they miss something, the failure mode is subtle: three of four implementations are correct, which makes the fourth look correct by association. The agent declares the work complete. The tests pass. And the missing link only surfaces in production.

The implementation integrity skill is designed specifically for this workflow. After your AI agent finishes building something, you ask it to verify its own work using a structured adversarial protocol instead of its default instinct to report success. “Run an implementation integrity check” becomes the discipline that separates AI-assisted development from AI-trusted development.

These skills are formatted as Agent Skills — they work with Claude Code, Cursor, Codex CLI, and about forty other AI coding tools. Install them with one command:

npx skills add rajivpant/synthesis-skills --global --all --copy

The three skills cross-reference each other. The implementation integrity skill knows it’s the self-review step before pr-review. The pr-review skill knows it should check whether an integrity pass was run. The codebase-review skill positions itself as the periodic system-wide check that catches what neither change-level skill reveals.

The broader point

Software engineering has a mature practice around testing. We have unit tests, integration tests, end-to-end tests, property-based tests, mutation tests, fuzz tests. We have coverage tools and CI pipelines and pre-merge gates.

What we don’t have is a mature practice around verifying that a specific implementation is complete. Not “does the code work in the test environment” but “have I built every layer this change requires, for every environment it will run in.” Testing is necessary. It’s not sufficient. The gap between those two words is where production breaks.

The timing column incident wasn’t a close call because the fix was hard. Three lines of code: a column definition, an import, and a migration. It was a close call because everything else signaled “done” — green tests, clean build, consistent implementation across the other two services. Every quality gate we trust said this was ready. And it wasn’t.

If your test suite is green after a schema change and you didn’t modify any tests, treat that as a question, not an answer. The question is: do the tests know about this change? If they don’t, they’re not evidence it works. They’re evidence of a gap.

All three skills are open source at github.com/rajivpant/synthesis-skills under a CC0 license. Use them, adapt them, build on them. If they save you one Friday night page, they’ve earned their keep.

Rajiv Pant