Tell Me A Story

A local-only system for capturing bedtime stories.

by Saurin Choksi

Creative Technologist. Emmy-nominated children's animation writer (PBS, Nickelodeon, Netflix), building tools and pipelines at the intersection of storytelling and code.

View on GitHub →

Origin

My daughter, Arti, shouts "Tell me a story!" 1,000 times a day at me. Then at bedtime, I'm commanded to tell three more before she'll go to sleep. Folks, it's the end of the night. My brain's melted. I love my four-year-old but telling "one more story" feels like a type of tiny torture.

Still, when it's done, I'm always glad we share this ritual. And sometimes, I'm impressed by what we create together.

Pappy, the boy who looks like whatever food he wants to eat, usually a quesadilla. Too good, Arti! A perfectly executed Hero's Journey arc about a talking fork. Who knew it could be done?

Those bedtime stories are told and gone. I decided I wanted to keep them.

Why and for what, I'm not sure yet. One side of my brain complains, "Do we really have to digitize everything?" But a little voice insists, "Build the thing. Do it your way. Keep these stories. See what comes next when we get there." Ok...

The System

🎙

Capture

Voice Memos

ESP32 Device

Right now, capture is voice memos on a phone. It works, but a phone doesn't belong in that calm, quiet bedtime space.

When I lay my head down on Arti's silly Elmo chair, I want to keep it dark and calm. The plan is an ESP32 device (a tiny, cheap microcontroller common in smart-home gadgets) — screenless, dark-operable, tap once and it just works.

⚙️

Pipeline

Building

Local ML models turn recordings into speaker-labeled transcripts.

Everything runs on-device for family privacy.

Currently running on Apple Silicon. Next phase is a Jetson Orin Nano (a small AI computer that can run models continuously without a laptop in the loop).

🖥

Building

There's a Session Reviewer tool for validating transcripts against audio.

For the real UI... I want to be able to SEE the stories. What that means exactly, I'll discover over time.

The Pipeline

Messy audio in, structured transcript out. Local processing.

Intake

Scan for new audio, set up session

Transcribe

Speech to text with word-level timestamps

transcript-raw.json

Diarize

Identify who is speaking and when

diarization.json

Enrich

Attach speakers to words, mark gaps

Normalize

Fix mishearings, standardize names

Identify

Match speakers to known voices

transcript-rich.json The finished transcript — named speakers, accurately transcribed

Click to explore each stage ↓

Audio input

0:00 / 0:20

The test recordings are of me telling Arti stories from the Mahabharata, the ancient Indian epic I loved as a kid. It's crammed with Sanskrit names that make speech recognition systems sweat. Let's see what the models make of the Pandavas and Kauravas.

↓

Whisper runs locally on Apple Silicon. It produces segments with word-level timestamps, which is what makes speaker alignment possible later. It handles Dad fine. But it doesn't know Sanskrit, and a four-year-old's pronunciation doesn't help.

0:01 Dad, why do the fondos and the goros want to be king?

0:09 Why do the fondos and the goros want to be king?

0:15 Uh-huh.

0:17 Well...

↓

Pyannote listens to the audio and maps out who is speaking when.

SPEAKER_01 SPEAKER_00

0:01 – 0:08 0:09 – 0:14 0:15 0:17 – 0:20

No words — just stretches of audio labeled by voice

↓

Each word's timestamp lands it inside a speaker block — that's how it gets labeled. Pyannote doesn't know names, just SPEAKER_00 and SPEAKER_01. Real names come later, once the pipeline matches voiceprints to known speakers.

SPEAKER_00 SPEAKER_01

…

0:13

0:14

king

0:15

0:16

huh

0:17

Well

Sometimes pyannote heard a voice but Whisper couldn't decode the words, especially when Arti is getting sleepy and her voice drops to a murmur. The pipeline marks these moments [unintelligible] rather than pretending nobody spoke.

↓

Two passes, one goal: correct the names. The LLM sees the full transcript and catches wild phonetic misses — "fondos" only maps to "Pandavas" if the model can see the surrounding conversation. The dictionary then standardizes transliteration variants: "Duryodhan" and "Duryodhana" are both real spellings, but only one is canonical.

LLM Pass

Qwen3 8B · phonetic mishearings

fondos → Pandavas ×7

goros → Kauravas ×4

Yudister → Yudhishthira ×3

Fondo → Pandu ×3

dhrashtra → Dhritarashtra ×1

18 corrections across the full session

Dictionary Pass

56-entry reference · transliteration variants

Duryodhan → Duryodhana ×8

Yudhisthir → Yudhishthira ×6

Pondavas → Pandavas ×1

15 corrections across the full session

↓

The pipeline builds a voiceprint — a compact fingerprint of how each speaker sounds. It compares that fingerprint against known profiles. High-confidence matches are assigned automatically; the system learns Dad's voice from Arti's without being told every time.

When the pipeline encounters a voice it hasn't heard before, or isn't confident about, a dedicated speaker review page lets a human step in — confirm a match, create a new profile, or reassign. Those decisions persist across sessions and sharpen the voiceprint, so the system gets more confident over time.

SPEAKER_00 ✓ identified Dad

SPEAKER_01 ✓ identified Arti

SPEAKER_02 ? review select profile · create new · add variant

↓

Pipeline Output | transcript-rich.json

Arti

Dad, why do the Pandavas and the Kauravas want to be king?

Dad

Why do the Pandavas and the Kauravas want to be king?

Arti

Uh-huh.

Dad

Well...

↓

Session Reviewer

The pipeline produces a transcript. But how do I know any of it is right? I built a tool to find out — play back audio against the transcript word-by-word, catch hallucinations, leave notes on anything suspicious.

Click to enlarge

Pipeline Roadmap

Hallucination Marking

Write confidence scores into the transcript from diarization coverage and word probability.

Session Classification

Break sessions into Stories, Conversations, Moments, and Other.

Story Element Extraction

Pull characters, events, and relationships from finished transcripts.

A few things the models have taught me ↓

Fabricated speech

Whisper sometimes invents words during silence. Two independent systems disagree — that's the signal.

Segment 13

Whisper: "Right." — probability 0.087
Diarization: no speaker detected, coverage 0.0
→ Flagged as hallucination

Segments 4–5 — the subtle case

"Well." — probability 0.993, real speech
"Well." — probability 0.133, fabricated
→ Two consecutive identical words. One real, one not.

Inaudible child speech

Diarization detects Arti's voice at three points where Whisper produces nothing. The pipeline marks them honestly rather than dropping them.

Gap at 4:01

Diarization: SPEAKER_01 (Arti), 241.68s–242.83s
Whisper: [no words produced]
→ Marked [unintelligible]

Model size is existential

Whisper's tiny model produces absolute silence where Arti speaks. The large model recovers her voice. This isn't an optimization — it's whether we capture half the conversation.

Same audio, different models

tiny: [silence]
large: "Dad, why do the fondos and the goros want to be king?"

AI & Kids

Once I have transcribed stories, the generative AI applications seem easy and obvious. Extract recurring themes and characters using a sprinkle of local model intelligence? (Seems fine) Generate Nano Banana illustrations of characters in the style of famed Pixar illustrator, Sanjay Patel? (um.. not cool) Build an Eleven Labs powered penguin companion stuffy that tells stories with a voice that sounds exactly like Daddy? (OH GOD. WHAT HAVE I DONE)

Models were trained on creatives' work without permission or money, but the tech is here. It's not disappearing. Our kids will grow up with it. What should its place be in their lives? How do I approach building on such a fraught foundation? (Meanwhile, Choksi, you use AI for coding every day... what about that IP? Hypocrite!)

What does it do to a kid when their thoughts skip straight to a generated image? Isn't the whole point to have those ideas live in your head, and then if you decide to put in the effort, pick up a crayon and be delighted with what your hands can make? What happens when a four-year-old forms a relationship with something that talks back to her whenever she wants, optimized to build attachment, before her brain is fully cooked?

So where do I go once the audio → transcripts system is built? Not sure. I know the point isn't to outsource storytelling, creativity, or imagination to machines. It's to understand our voices as storytellers. And have fun.

I do believe this project will help me clarify my own red lines about AI/technology and kids. As I work on this, I'm talking to child development experts, artists, and writers in my children's media, animation, writing, and edTech networks to help sharpen my thinking.

Go Deeper

Guiding Principles

Immediate Connection

(Bret Victor)

Removing friction brings creative choices you could not have imagined before.

Ideas start out tiny, weak and fragile. — Victor

Hard Fun

(Seymour Papert)

Harness the passion of the learner to the hard work needed to master difficult material.

I have no doubt that this kid called the work fun because it was hard rather than in spite of being hard. — Papert

Calm Technology

(Mark Weiser)

Technology can live on the periphery, there when you need it, and otherwise invisible.

If computers are everywhere they better stay out of the way. — Weiser

The Arti Test

(Me)

I only work on things for kids that I'd give my child.

Daaaaad, tell me a story whiiiiile I'm on the potty. — Arti

Learning

Part of why I am excited about this project: I get to learn things I actually want to learn.

AI-assisted development (Claude Code)
Audio ML pipelines (Whisper, pyannote)
IoT capture devices (ESP32)
Edge ML deployment (Jetson, CUDA)
Local-first architecture

When I realized this bedtime story project gave me a reason to explore NVIDIA's CUDA/Jetson stack, I was unreasonably happy. Robot-brain tech! Positronic! That's exciting. And I do want to have a voice in the rooms where these things that will interact with us — our families, our kids — get built.