← Back to projects

AI model evaluation

Why I shipped plain code, not an AI model

Problem

I built a pipeline to transcribe made-up bedtime stories and label who said what. I imposed one constraint: nothing leaves my local machine, for family privacy. Analyzing the transcriptions, I found the most common error was mistranscribed proper names. That matters because I plan to chart these improvised characters across sessions, which only works if their names stay stable. It even misspelled my daughter's own name:

Her nameArti
transcribed as →
ArtieArthieEarthyArty

So before fixing it, I ran an evaluation: what's the most reliable way to catch proper-name errors against a set roster?

Action

First I went through the session transcripts and flagged every place the tool mis-rendered a name. With the family roster as the correct spellings, those flags became the ground truth I scored against. Then I ran a range of detectors over the same transcripts: plain code that matches words by phonetic sound, a size-ladder of local AI models (each prompted with the roster), and a frontier cloud model as a comparison.

One honest limit on the comparison: every model got the same prompt and I changed only the model. That's a fair head-to-head, not proof a sharper prompt couldn't push one further.

Result

Scored against the roughly 100 errors I'd flagged across 5 sessions (1,382 segments of transcript), the detectors split sharply by what they could actually do on the device.

DetectorRuns on device?Result
Phonetic matcher (plain code)yesCaught all but one name error1, and 96% of its flags were right (99% recall, 96% precision)
Gemma, ~2ByesCrashed on the longer stories
Gemma, ~4ByesStable, but over-flagged ordinary words
Qwen, ~8ByesCrashed on the long, name-dense story
Gemma, ~12ByesFirst to reason through a hard case, but buried it in false flags
Gemma, ~26Bno (out of memory)Won't run on the device
Opus (cloud)no (privacy)Reasoned out every case except one3, but breaks the local-only constraint
Gemma, Qwen, and Opus are AI models; the numbers (~2B to ~26B) are their sizes in billions of parameters. Bigger is more capable but heavier, and the ~26B wouldn't fit on the 16GB M1 MacBook Pro this ran on. Recall is the share of real errors caught; precision, the share that were right.

So the phonetic matcher is what ships. As a final check, I ran it on 2 new sessions it had never seen: 745 segments of fresh transcript. It caught all 10 name errors a careful listen could find, with a single false positive2 (100% recall, 91% precision). A very different name might not match as cleanly, but for a tool only I use, that's reliable enough.

The choice has honest costs, and they live in 3 outliers:

Swapped for another name1 ArtibecameRicky No sounds in common, so matching by sound is blind to it. Only the cloud model caught it, by reasoning from context that a parent was praising her.
Sound-alike false alarms2 weirdwhere'dflagged as her name Correctly transcribed words that share her name's phonetic code. They're lowercase mid-sentence where real names are capitalized, so a one-line capitalization check, added after this test, now screens them out.
Dissolved into ordinary words3 Artibecameare these Nothing name-shaped is left to flag. Every method missed it, cloud included.

That last case is the one nothing text-based will ever catch: once the name comes out as ordinary words that fit in context, the transcript holds no trace of the error.

The next step is to give the model the audio and let it listen, not just read.

The real payoff isn't the detector; it's the foundation the evaluation leaves. Once you've found the failure modes, sorted and measured them, and built a way to detect them, you can fix the problem on solid ground and keep catching it as the system grows.


The code behind this (the detector, its scorer, and the audio checks) is on GitHub. Written with AI assistance, reviewed and edited by me.