Tell Me A Story
A local-first system for capturing bedtime stories.
View on GitHub → (opens in new tab)Origin
My daughter, Arti, shouts "Tell me a story!" 1,000 times a day at me. Then at bedtime, I'm commanded to tell three more before she'll go to sleep. Folks, it's the end of the night. My brain's melted. I love my four-year-old but telling "one more story" feels like a type of tiny torture.
Still, when it's done, I'm always glad we share this ritual. And sometimes, I'm impressed by what we create together.
Pappy, the boy who looks like whatever food he wants to eat, usually a quesadilla. Too good, Arti! A perfectly executed Hero's Journey arc about a talking fork. Who knew it could be done?
Those bedtime stories are told and gone. I decided I wanted to keep them.
Why and for what, I'm not sure yet. One side of my brain complains, "Do we really have to digitize everything?" But a little voice insists, "Build the thing. Do it your way. Keep these stories. See what comes next when we get there." Ok...
The System
Right now, capture is voice memos on a phone. It works, but a phone doesn't belong in that calm, quiet bedtime space.
When I lay my head down on Arti's silly Elmo chair, I want to keep it dark and calm. The plan is an ESP32 device — screenless, dark-operable, tap once and it just works.
Messy audio in, structured transcript out.
Everything runs on-device for family privacy.
Currently running on Apple Silicon. Next phase is Jetson Orin Nano/CUDA architecture for always-on edge processing.
Right now there's a validation player for reviewing transcripts against audio, but that's a dev tool.
I know that I want to be able to "see" the story. What that means, I'll learn after I've processed enough sessions to decide where to go.
The Pipeline
Messy audio in, structured transcript out. Local processing.
Inbox Scan
Finds new audio, deduplicates by content hash
process_inbox.pyInit Session
Creates timestamped directory, moves audio into place
init_session.pyTranscribe
MLX Whisper large-v3 — words, timestamps, per-word probability
transcribe.pyDiarize
pyannote / torch — speaker segments with start, end, label
diarize.pySpeaker Labels
Aligns diarization segments to Whisper words — every word gets a speaker and coverage score
speaker.pyGap Detection
Injects [unintelligible] markers where a speaker was detected but Whisper produced no words
LLM Norm
Local model corrects phonetic mishearings — "you this there" → Yudhishthira
normalize.pyDict Norm
56-entry reference library standardizes variant spellings to canonical Sanskrit names
dictionary.pyHallucination Marking
Write confidence scores into the transcript from diarization coverage and word probability. The filter predicates exist — the pipeline stage doesn't yet.
LLM Speaker Correction
Use a local model to fix speaker misassignments using conversational context. "Papa, who is Arjuna?" is obviously the child.
Story Element Extraction
Pull characters, events, and relationships from finished transcripts. Needs more sessions before patterns emerge.
The Mahabharata
The test recordings are of me telling Arti stories from the Mahabharata, the ancient Indian epic I loved as a kid. It's crammed with Sanskrit names that make speech recognition systems sweat. Let's see what the models make of the Pandavas and Kauravas.
Whisper runs locally on Apple Silicon. It produces segments with word-level timestamps, which is what makes speaker alignment possible later. It handles Dad fine. But it doesn't know Sanskrit, and a four-year-old's pronunciation doesn't help.
The full transcript goes to a local language model with a simple instruction: find the Sanskrit names that Whisper mangled. The full transcript is important. "fondos" only maps to "Pandavas" if the model can see the surrounding conversation. Tested segment-by-segment, the model produced false positives like "dad" → "Pandu."
The prompt: This text is from a conversation about the Mahabharata epic...
The LLM catches the wild phonetic misses. The dictionary catches what the LLM gets close but not canonical. Sanskrit has multiple valid transliterations: "Duryodhan" and "Duryodhana" are both real spellings, but the dictionary standardizes to one. It also knows to leave legitimate aliases alone: "Partha" is a real name for Arjuna, not a misspelling.
Pyannote listens to the audio and maps out who is speaking when — not words, just stretches of time labeled by voice. Then our code takes each timestamped word from Whisper and drops it into the matching speaker block.
Each word's timestamp lands it inside a speaker block —
that's how it gets labeled. Pyannote doesn't know names, just SPEAKER_00 and
SPEAKER_01. Name mapping happens later, by a human. Sometimes pyannote hears a voice
but Whisper can't decode the words, especially when Arti is getting sleepy and her
voice drops to a murmur. The pipeline marks these moments
[unintelligible] rather than pretending nobody spoke.
Fabricated speech
Whisper sometimes invents words during silence. Two independent systems disagree — that's the signal.
Diarization: no speaker detected, coverage 0.0
→ Flagged as hallucination
"Well." — probability 0.133, fabricated
→ Two consecutive identical words. One real, one not.
Inaudible child speech
Diarization detects Arti's voice at three points where Whisper produces nothing. The pipeline marks them honestly rather than dropping them.
Whisper: [no words produced]
→ Marked [unintelligible]
Model size is existential
Whisper's tiny model produces absolute silence where Arti speaks. The large model recovers her voice. This isn't an optimization — it's whether we capture half the conversation.
large: "Dad, why do the fondos and the goros want to be king?"
AI & Kids
Once I have transcribed stories, the generative AI applications seem easy and obvious. Extract recurring themes and characters using a sprinkle of local model intelligence? (Seems fine) Generate Nano Banana illustrations of characters in the style of famed Pixar illustrator, Sanjay Patel? (um.. not cool) Build an Eleven Labs powered penguin companion stuffy that tells stories with a voice that sounds exactly like Daddy? (OH GOD. WHAT HAVE I DONE)
Models were trained on creatives' work without permission or money, but the tech is here. It's not disappearing. Our kids will grow up with it. What should its place be in their lives? How do I approach building on such a fraught foundation? (Meanwhile, Choksi, you use AI for coding every day... what about that IP? Hypocrite!)
What does it do to a kid when their thoughts skip straight to a generated image? Isn't the whole point to have those ideas live in your head, and then if you decide to put in the effort, pick up a crayon and be delighted with what your hands can make? What happens when a four-year-old forms a relationship with something that talks back to her whenever she wants, optimized to build attachment, before her brain is fully cooked?
So where do I go once the audio → transcripts system is built? Not sure. I know the point isn't to outsource storytelling, creativity, or imagination to machines. It's to understand our voices as storytellers. And have fun.
I do believe this project will help me clarify my own red lines about AI/technology and kids. As I work on this, I'm talking to child development experts, artists, and writers in my children's media, animation, writing, and edTech networks to help sharpen my thinking.
Guiding Principles
Immediate Connection
Removing friction brings creative choices you could not have imagined before.
Hard Fun
The best learning happens when you're building something you actually want.
Calm Technology
Technology can live on the periphery, there when you need it, and otherwise invisible.
The Arti Test
I only work on things for kids that I'd give my child.
Learning
Part of why I am excited about this project: I get to learn things I actually want to learn.
- AI-assisted development (Claude Code)
- Audio ML pipelines (Whisper, pyannote)
- IoT capture devices (ESP32)
- Edge ML deployment (Jetson, CUDA)
- Local-first architecture
When I realized this bedtime story project gave me a reason to explore NVIDIA's CUDA/Jetson stack, I was unreasonably happy. Robot-brain tech! Positronic! That's exciting. And I do want to have a voice in the rooms where these things that will interact with us — our families, our kids — get built.