Saurin Choksi
← Back to Projects

Tell Me A Story

A local-first system for capturing bedtime stories.

View on GitHub → (opens in new tab)

Origin

My daughter, Arti, shouts "Tell me a story!" 1,000 times a day at me. Then at bedtime, I'm commanded to tell three more before she'll go to sleep. Folks, it's the end of the night. My brain's melted. I love my four-year-old but telling "one more story" feels like a type of tiny torture.

Still, when it's done, I'm always glad we share this ritual. And sometimes, I'm impressed by what we create together.

Pappy, the boy who looks like whatever food he wants to eat, usually a quesadilla. Too good, Arti! A perfectly executed Hero's Journey arc about a talking fork. Who knew it could be done?

Those bedtime stories are told and gone. I decided I wanted to keep them.

Why and for what, I'm not sure yet. One side of my brain complains, "Do we really have to digitize everything?" But a little voice insists, "Build the thing. Do it your way. Keep these stories. See what comes next when we get there." Ok...

The System

🎙
Capture
Voice Memos
ESP32 Device

Right now, capture is voice memos on a phone. It works, but a phone doesn't belong in that calm, quiet bedtime space.

When I lay my head down on Arti's silly Elmo chair, I want to keep it dark and calm. The plan is an ESP32 device — screenless, dark-operable, tap once and it just works.

⚙️
Pipeline
Building

Messy audio in, structured transcript out.

Everything runs on-device for family privacy.

Currently running on Apple Silicon. Next phase is Jetson Orin Nano/CUDA architecture for always-on edge processing.

🖥
UI
Planned

Right now there's a validation player for reviewing transcripts against audio, but that's a dev tool.

I know that I want to be able to "see" the story. What that means, I'll learn after I've processed enough sessions to decide where to go.

The Pipeline

Messy audio in, structured transcript out. Local processing.

CaptureAudio in, session out

Inbox Scan

Finds new audio, deduplicates by content hash

process_inbox.py

Init Session

Creates timestamped directory, moves audio into place

init_session.py
Audio ModelsGPU · runs once per session

Transcribe

MLX Whisper large-v3 — words, timestamps, per-word probability

transcribe.py
transcript-raw.json

Diarize

pyannote / torch — speaker segments with start, end, label

diarize.py
diarization.json
Core EnrichmentContent-agnostic · pure data

Speaker Labels

Aligns diarization segments to Whisper words — every word gets a speaker and coverage score

speaker.py

Gap Detection

Injects [unintelligible] markers where a speaker was detected but Whisper produced no words

speaker.py
Content LayerMahabharata today · swappable

LLM Norm

Local model corrects phonetic mishearings — "you this there" → Yudhishthira

normalize.py

Dict Norm

56-entry reference library standardizes variant spellings to canonical Sanskrit names

dictionary.py
transcript-rich.json Speaker labels, gaps, corrections, normalized names
On the Horizon

Hallucination Marking

Write confidence scores into the transcript from diarization coverage and word probability. The filter predicates exist — the pipeline stage doesn't yet.

LLM Speaker Correction

Use a local model to fix speaker misassignments using conversational context. "Papa, who is Arjuna?" is obviously the child.

Story Element Extraction

Pull characters, events, and relationships from finished transcripts. Needs more sessions before patterns emerge.

Audio input
0:00 / 0:25

The Mahabharata

The test recordings are of me telling Arti stories from the Mahabharata, the ancient Indian epic I loved as a kid. It's crammed with Sanskrit names that make speech recognition systems sweat. Let's see what the models make of the Pandavas and Kauravas.

Whisper runs locally on Apple Silicon. It produces segments with word-level timestamps, which is what makes speaker alignment possible later. It handles Dad fine. But it doesn't know Sanskrit, and a four-year-old's pronunciation doesn't help.

seg 0 Okay, what is the question?
seg 1 Dad, why do the fondos and the goros want to be king?
seg 2 Why do the fondos and the goros want to be king?
seg 3 Uh-huh.
seg 4 Well...

The full transcript goes to a local language model with a simple instruction: find the Sanskrit names that Whisper mangled. The full transcript is important. "fondos" only maps to "Pandavas" if the model can see the surrounding conversation. Tested segment-by-segment, the model produced false positives like "dad" → "Pandu."

The prompt: This text is from a conversation about the Mahabharata epic...

What it found:
fondos
Pandavas
×7
goros
Kauravas
×4
Yudister
Yudhishthira
×3
Fondo
Pandu
×3
dhrashtra
Dhritarashtra
×1
18 corrections

The LLM catches the wild phonetic misses. The dictionary catches what the LLM gets close but not canonical. Sanskrit has multiple valid transliterations: "Duryodhan" and "Duryodhana" are both real spellings, but the dictionary standardizes to one. It also knows to leave legitimate aliases alone: "Partha" is a real name for Arjuna, not a misspelling.

Duryodhan
Duryodhana
×8
Yudhisthir
Yudhishthira
×6
Pondavas
Pandavas
×1
15 corrections

Pyannote listens to the audio and maps out who is speaking when — not words, just stretches of time labeled by voice. Then our code takes each timestamped word from Whisper and drops it into the matching speaker block.

Each word's timestamp lands it inside a speaker block — that's how it gets labeled. Pyannote doesn't know names, just SPEAKER_00 and SPEAKER_01. Name mapping happens later, by a human. Sometimes pyannote hears a voice but Whisper can't decode the words, especially when Arti is getting sleepy and her voice drops to a murmur. The pipeline marks these moments [unintelligible] rather than pretending nobody spoke.

Pipeline Output
SPEAKER_00
Okay, what is the question?
SPEAKER_01
Dad, why do the Pandavas and the Kauravas want to be king?
SPEAKER_00
Why do the Pandavas and the Kauravas want to be king?
SPEAKER_01
Uh-huh.
SPEAKER_00
Well...
· · ·

Fabricated speech

Whisper sometimes invents words during silence. Two independent systems disagree — that's the signal.

Segment 13
Whisper: "Right." — probability 0.087
Diarization: no speaker detected, coverage 0.0
→ Flagged as hallucination
Segments 4–5 — the subtle case
"Well." — probability 0.993, real speech
"Well." — probability 0.133, fabricated
→ Two consecutive identical words. One real, one not.

Inaudible child speech

Diarization detects Arti's voice at three points where Whisper produces nothing. The pipeline marks them honestly rather than dropping them.

Gap at 4:01
Diarization: SPEAKER_01 (Arti), 241.68s–242.83s
Whisper: [no words produced]
→ Marked [unintelligible]

Model size is existential

Whisper's tiny model produces absolute silence where Arti speaks. The large model recovers her voice. This isn't an optimization — it's whether we capture half the conversation.

Same audio, different models
tiny: [silence]
large: "Dad, why do the fondos and the goros want to be king?"

AI & Kids

Once I have transcribed stories, the generative AI applications seem easy and obvious. Extract recurring themes and characters using a sprinkle of local model intelligence? (Seems fine) Generate Nano Banana illustrations of characters in the style of famed Pixar illustrator, Sanjay Patel? (um.. not cool) Build an Eleven Labs powered penguin companion stuffy that tells stories with a voice that sounds exactly like Daddy? (OH GOD. WHAT HAVE I DONE)

Models were trained on creatives' work without permission or money, but the tech is here. It's not disappearing. Our kids will grow up with it. What should its place be in their lives? How do I approach building on such a fraught foundation? (Meanwhile, Choksi, you use AI for coding every day... what about that IP? Hypocrite!)

What does it do to a kid when their thoughts skip straight to a generated image? Isn't the whole point to have those ideas live in your head, and then if you decide to put in the effort, pick up a crayon and be delighted with what your hands can make? What happens when a four-year-old forms a relationship with something that talks back to her whenever she wants, optimized to build attachment, before her brain is fully cooked?

So where do I go once the audio → transcripts system is built? Not sure. I know the point isn't to outsource storytelling, creativity, or imagination to machines. It's to understand our voices as storytellers. And have fun.

I do believe this project will help me clarify my own red lines about AI/technology and kids. As I work on this, I'm talking to child development experts, artists, and writers in my children's media, animation, writing, and edTech networks to help sharpen my thinking.

Guiding Principles

Immediate Connection

Removing friction brings creative choices you could not have imagined before.

Hard Fun

The best learning happens when you're building something you actually want.

Calm Technology

Technology can live on the periphery, there when you need it, and otherwise invisible.

The Arti Test

(Me)

I only work on things for kids that I'd give my child.

Learning

Part of why I am excited about this project: I get to learn things I actually want to learn.

  • AI-assisted development (Claude Code)
  • Audio ML pipelines (Whisper, pyannote)
  • IoT capture devices (ESP32)
  • Edge ML deployment (Jetson, CUDA)
  • Local-first architecture

When I realized this bedtime story project gave me a reason to explore NVIDIA's CUDA/Jetson stack, I was unreasonably happy. Robot-brain tech! Positronic! That's exciting. And I do want to have a voice in the rooms where these things that will interact with us — our families, our kids — get built.