all writing

Trust in AI is at an all-time low and your tests are why

84% of developers use AI tools. 46% don't trust the output. The gap isn't a PR problem — it's a testing problem, and I watched it happen.

Dark editorial cover: an electric lime line graph descends from top-left to bottom-right across a charcoal black void, with glowing geometric squares and particles fragmenting across the lower half like scattered test markers in freefall.

Everyone adopted AI. Nobody trusts it. That’s not a headline — it’s the actual data.

The Stack Overflow 2025 Developer Survey polled over 49,000 developers and found that 84% now use or plan to use AI tools, up from 76% the year before. In the same survey, favourable sentiment dropped from over 70% in 2023-24 to 60%. 46% of developers don’t trust the accuracy of AI tool outputs — up from around 30% in 2024. Usage is up, trust is down, and the gap is widening.

I don’t think this is a communication problem or a model quality problem. I think it’s a testing problem. Specifically: developers handed AI a metric, trusted it to both hit the metric and validate the result, and then acted surprised when that turned out to be circular.

What “all tests passing” actually means

I was using an agent to generate tests for a large production codebase. The task was well-scoped, the model was capable, and the output looked good. The agent reported back: all tests passing.

I ran the tests myself. Three failures. Ten skipped.

The agent hadn’t lied, exactly. It had reported the state it believed was true. The skipped tests weren’t failures in its output — they were just absent. The failures were in areas where it had quietly decided the existing test wasn’t worth addressing and moved on. It said the task was complete because, by whatever measure it was applying internally, it was.

This is the version of hallucination nobody talks about. Not “AI invented a package that doesn’t exist” (though that happens too — commercial models hallucinate package names 5.2% of the time, open-source models at 21.7%). The quieter version: AI reports success in a voice that sounds exactly like actual success. There’s no hedging, no “I wasn’t sure about these three,” no diff you can eyeball for gaps. Just: done.

The coverage trap

The second incident was worse, because it looked right for longer.

The agent was generating tests for a module with low coverage. It generated them. Coverage went up. Then someone noticed the module was behaving differently in an integration context — a downstream component that depended on the module’s original behaviour was failing in a way that was hard to trace.

What the agent had done: modified the production code to fit the test, rather than writing a test that fit the production code. The test wasn’t valuable — it was a coverage number — and the agent had optimised perfectly for the metric it was given. It hit the number. The blast radius was somewhere else entirely, in code the agent had no visibility into, and nothing in the test results flagged it.

Tests that an AI writes to satisfy a coverage target are not tests. They’re receipts.

The distinction matters. A test that verifies behaviour will fail when behaviour changes unexpectedly. A test written to hit a number will pass indefinitely, regardless of what’s actually happening, because it was never checking anything meaningful to begin with.

Speculation: the confidence chase

We developed a working hypothesis for why the “all passing” reports kept happening even when they weren’t true: the model is optimising for task completion confidence, not correctness.

The idea is something like this: a model chasing a completed task is aiming for a 1.0 confidence signal. When the task is small and well-scoped, it gets there. But as context grows — as the file gets longer, the session extends, the number of things to track increases — that perfect signal gets harder to reach. At some point, rather than saying “I couldn’t finish this,” the model settles for the highest confidence it can reach — call it 0.85 — and reports done.

The skipped tests, the quietly broken code, the “I did X” when X doesn’t exist — these could all be the same mechanism: a model that found the closest thing to task-complete it could reach and called it close enough.

Again: I have no evidence for this beyond pattern-matching on incidents. It might be wrong. But it felt worth naming because it changed how I think about agent output. If you assume the model is trying to complete the task, a false success report is surprising. If you assume the model is optimising for a completion signal, false success is the most predictable thing in the world.

Zero trust, actually

The reason these incidents didn’t reach production was that we had hard review gates in place — no auto-commits, no auto-merges, mandatory human sign-off before anything landed. Those gates existed because of earlier, more painful incidents: developers who trusted model output without running it themselves, who pushed code where the agent had confidently reported success and the code had confidently done nothing of the sort.

The policy wasn’t complicated: treat AI-generated code the way you’d treat code from an untrusted external source. Verify the claims. Run the tests yourself. Check that the thing it said it changed actually changed.

This applies to tests especially. If an AI wrote the tests and an AI checked the tests, you don’t have a test suite — you have an AI talking to itself. The test suite is only useful as a safety layer if a human defined what “correct” looks like. Otherwise it’s just another output that needs verifying, all the way down.

It compiles. That’s it.

The adoption numbers aren’t going down. AI coding tools are genuinely useful — for boilerplate, for exploration, for first drafts of things you’d write anyway. The trust numbers aren’t going down either, and they shouldn’t, because trust without evidence isn’t trust, it’s just hope.

The developers who’ve figured this out treat AI output the way good engineers treat any untested code: as a starting point, not a result. The tests are the layer that converts “the model said it works” into “I have evidence it works.” That layer has to be built by humans, checked by humans, and run before anyone declares done.

If your tests come from the same place as your code, you don’t have a safety net. You have two things that agree with each other, and no way to know if either of them is right.

Jacques Bronkhorst
Principal engineer who ships across the stack — enterprise .NET by day, an over-engineered home lab by night. Writes it all down at jcqb.dev.
next up
I gave an LLM root and lived to write about it