tellmebaby — talk to your PC

act one · talk

Hold the key. Say the thing. Done.

Hold Ctrl + Win, talk, let go. Your words appear at the cursor — Slack, Outlook, VS Code, Notion, that one internal tool nobody else supports. Terminals, vim and remote desktops hate paste, so I type the words in for them instead, key by key. Either way: you talk, text lands.

draft — slack #design

act two · edit

Say the change. Watch it happen.

Select text anywhere in Windows, hold Alt + Win, and say how it should change — "make this sound more confident", "turn this into bullet points", "translate to French". Let go, and the rewrite streams into place, live, right where the selection was.

mail — reply to client

your selection

The report is late again. Send the corrected version today.

"make this friendlier" AltWinhold · speak · release

rewritten in place — streaming

Hi! Just checking in on the report — could you send the corrected version over today? Thanks so much!

Alt and Win are pure modifiers — the chord physically can't type a character, so it can never type over your selection

act three · read

I'll read it to you.

Select something — or just copy it — and press the hotkey. I read it aloud through your speakers in a voice that doesn't sound robotic. I synthesize sentence by sentence, so I start speaking almost before you finish pressing. Press Esc and I stop mid-word.

article — reading aloud

The quarterly review went better than anyone expected. Revenue is up, the team is growing, and the roadmap finally fits on one page. The next milestone is the spring release. Everything else is details.

31 languages 10 voices starts instantly selection or clipboard

your words, verbatim

Tidied, never rewritten.

A little local AI sweeps up after you — the "um"s go, the stutters untangle, the punctuation lands. That's everything it's allowed to do. Every cleanup has to pass a gate before it can touch your text: drop a phrase, rewrite your phrasing, soften how hard you said something — rejected, and your raw words ship instead. I keep your words exactly as you said them. Including the spicy ones.

history — this morning, 9:42

what you said what changed

so So um here's where we landed: the onboarding flow ships Friday. i I know that's tight, but the the hard parts are are already done. uh ping me if anything looks off.

gate passed — nothing dropped, nothing softened raw transcript · kept

every transcript in History keeps both layers, word-level diff included — flip between them any time

× Drops a phrase → rejected × Rewrites your phrasing → rejected × Softens what you said → rejected ✓ Your raw words → always kept

per-app styles

Same sentence. Dressed for the room.

Teach each app its own voice — buttoned-up in email, loose in chat, tidy markdown in your editor. Give an app a custom style and the AI gets real creative freedom. Anywhere you haven't set one, the default holds: your words, exactly as you said them, just tidied.

you say — "i'll get you the numbers by thursday morning"

Your mail app professional

I'll have the numbers over to you by Thursday morning.

Your chat app casual

gonna get you those numbers by thursday morning

Your notes app tidy bullets

• Numbers — send by Thursday morning

Everywhere else verbatim · default

I'll get you the numbers by Thursday morning.your words, exactly — just tidied

the little things

Small features. Big difference daily.

None of these need a settings safari. They're just there, quietly making every dictation a little more yours.

Your personal dictionary

Names, jargon, that one client nobody can spell. Add a word once and I bias recognition toward it on every engine, forever. No retraining, no cloud roundtrip.

Snippets

Say a trigger phrase, get a block of text — your address, your signature, that prompt you paste eleven times a day. Spoken shorthand for anything you retype.

History you can search

Every dictation is kept — audio and text, raw and tidied — on your own disk. Search it, flip the diff, copy yesterday's paragraph again. Nothing auto-deletes.

Works offline

Pull the network cable mid-sentence; I don't notice. Planes, tunnels, cabins, hostile hotel Wi-Fi — every feature keeps working, because nothing needs a server.

Hotkeys that don't fight you

The hold-to-talk chords are pure modifiers — they physically can't type a character, so they can never destroy a selection. Esc stops everything, globally. Bindings survive lock and unlock.

Signed auto-updates

Releases are signed, downloads are SHA256-verified, and the updater refuses anything that doesn't check out. One click to install, and you're back to talking.

the brain is built in

A brain in the box. Yours.

Cleanup, Edit Mode, per-app styles — they all need a language model, so I ship one. It downloads itself once and then lives entirely on your machine. No account to create. No API key to paste. No cloud to trust.

Local, contained, boring on purpose.

The built-in brain listens only on 127.0.0.1 — your own machine talking to itself. It's contained so it can never outlive the app: close tellmebaby and the brain is gone with it. Your text goes in, better text comes out, and the network never hears about either.

// the built-in brain

● downloads itself — once, ~2.4 GB

● speaks only to 127.0.0.1

● contained — dies when the app dies

○ accounts, API keys, clouds — none

Your words, your machine.

No cloud accounts. No telemetry. No analytics beacons. The only network traffic is downloading models the first time — open-weight files you can delete and re-fetch any time — plus a small signed check for updates. Your voice never travels.

No microphone uploads No transcript uploads No analytics No accounts Works offline

The network log is short.

Recognition, cleanup and speech all run on your CPU or GPU from locally-stored model files. No microphone audio is ever sent over the network. We say "100% local" and we mean it — open Wireshark and watch nothing happen.

// network log

● initial model fetch — once

● update check — every 6h

○ microphone audio — never

○ transcript text — never

○ telemetry — never

Your data, in plain sight.

Recordings are ordinary .wav files. Transcripts live in a SQLite database any client can open. Settings are JSON. Everything sits in one folder you can browse, back up, or delete — deleting it is a complete reset. Nothing hidden, nothing "synced for your convenience."

// ~/.tellmebaby

recordings/ — your audio, as .wav

history.db — transcripts, searchable

config/ — settings, as JSON

logs/ — what happened, daily

why you can actually trust this

The receipts.

"100% local" is easy to claim. Here's the technical reality so you can verify rather than take our word for it.

The audio path is auditable

The recognizer is sherpa-onnx, an open-source ONNX Runtime wrapper that runs INT8 model weights on your CPU. No code we write touches a network socket during transcription. Open Wireshark; watch nothing happen.

Updates are signed

Every release is signed with an ed25519 key (separate from Authenticode) and the public half is embedded in the app. The auto-updater refuses to install anything that doesn't verify against it. You can't be tricked into installing a malicious "update."

Verifiable downloads

SHA256 of every release is published right next to the download button. certutil -hashfile setup.exe SHA256 confirms what you got matches what we shipped — no MITM, no tampered binaries. Model downloads are pinned and SHA256-checked the same way.

Open model licenses

The default models — Parakeet TDT and Whisper Large v3 Turbo — ship under permissive open licenses. You can fork tellmebaby, swap the models, redistribute. No vendor can pull the rug.

The brain stays home

The built-in language model binds to 127.0.0.1 only and runs inside a Windows Job Object, so it can never outlive the app or accept a connection from outside your machine. Your drafts never become anyone's training data.

Free without a moat

No paid tier planned. No telemetry to monetize. If maintenance ever shifts to paid, current installs keep working at the price they were installed at. The version you have now is the version you have forever.

for the nerds

SOTA models, on your CPU, no asterisks.

tellmebaby ships speech engines that pick themselves automatically based on the languages you speak. The defaults rank near the top of the public Hugging Face Open ASR leaderboard. All of them run in INT8 on a normal laptop CPU at multiples of real-time.

Model	WER ↓	Speed (RTF) ↑	Languages	Size
Parakeet TDT 0.6B v2	~6.05%	~30×	English	660 MB
Whisper Large v3 Turbo	~7.4%	~4×	99 (multilingual)	1.0 GB
Parakeet TDT 0.6B v3	~6.3%	~30×	25 EU (incl. English)	660 MB
Cohere Transcribe 2B	~5.4%	~12×	14 (EU + zh/ja/ko/vi/ar)	1.7 GB
OmniLingual ASR 300M	varies	~6×	1600+ (low-res. fallback)	700 MB

WER on the Hugging Face Open ASR Leaderboard (avg across English benchmarks, lower is better). RTF measured on 8-core x86 CPU, INT8 quantized weights, sherpa-onnx runtime. We default to Parakeet v2 + Whisper Turbo because that lineup held up better in real-world dictation (technical vocab, proper nouns, quick utterances); the others stay one click away in Settings → Speech & AI for users who want them.

Nothing is silently lost.

Long recordings decode as silence-bounded windows, so a ten-minute ramble comes out whole. If a decode comes back blank, I retry down a ladder — drop the hotwords, split on silence, switch to a different engine architecture. And if a stretch still couldn't be decoded, the transcript says so. I'd rather admit a gap than invent words.

// decode ladder

1 — decode in silence-bounded windows

2 — blank? retry without hotwords

3 — still blank? split on silence, recurse

4 — last resort: a different engine

! gap remains → the transcript says so

choreography

Five hotkeys. Zero menus.

Picked so they don't fight your browser, your editor, or Windows itself. Ctrl+Shift+Alt + a letter is essentially never claimed by anything else — so the chords stay yours.

Ctrl+Win Dictate. Hold the chord, talk, release. Words appear at the cursor. Modifier-only so the Windows shell never steals it from us.

Alt+Win Edit selection. Select text, hold and say what to change ("translate to French", "make this shorter"). Rewritten in place on release — modifier-only, so the chord can never type over your selection.

Ctrl+Shift+Alt+R Read selection. Highlight first, then press. I read it through your speakers.

Ctrl+Shift+Alt+V Read clipboard. Whatever you copied last gets read aloud, no app switch needed.

Ctrl+Shift+Alt+S Stop reading. Cuts off any in-progress read-aloud immediately. Useful for long paragraphs you only needed the first sentence of.

all five are remappable in Settings → Activation

Esc is a true global stop — cancel a capture or silence a read-aloud from anywhere. Bindings survive lock and unlock.

real questions

The stuff people actually ask.

Why does Windows say "Windows protected your PC" when I run it?

That's SmartScreen. It warns about apps it hasn't seen before — code-signing certs cost a few hundred dollars a year, and we haven't bought one yet. Click More info → Run anyway and the install proceeds normally. The download has a SHA256 published right next to the button you can verify against.

Is my voice actually private? For real?

Yes. Speech recognition, AI cleanup, and read-aloud all run entirely on your CPU or GPU using locally-stored model files. Pull your network cable mid-sentence and tellmebaby keeps working. The only network traffic is the initial model downloads (so you can pick languages) plus a signed JSON poll every six hours to see if there's an update. That's the entire network footprint.

How accurate is it, really?

English: very good. Parakeet TDT 0.6B v2 is one of the best open-weight ASR models available right now, INT8-quantized but still benchmark-competitive with closed cloud APIs. Multilingual: Whisper Large v3 Turbo, also INT8. We pick the right model automatically based on the languages you say you speak in onboarding. And your personal dictionary tilts recognition toward your names and jargon on top of that.

Will the AI rewrite what I say?

No — not unless you ask it to. The default cleanup removes fillers and stutters and adds punctuation, and every output has to pass an acceptance gate first: if it dropped a phrase, rewrote your phrasing, or softened how strongly you said something, it's rejected and your raw words ship instead. The two places the AI gets creative freedom are the ones you explicitly opt into: Edit Mode (you spoke the instruction) and per-app custom styles (you wrote the style).

What's the big download after install?

The built-in brain — a local language model (about 2.4 GB) that powers cleanup, Edit Mode, and per-app styles, plus the speech models for the languages you picked. All downloads are pinned and SHA256-verified, resume if interrupted, and live in %USERPROFILE%\.tellmebaby where you can delete and re-fetch them any time. The brain runs only on 127.0.0.1 and shuts down with the app.

Where does it store my recordings and transcripts?

%USERPROFILE%\.tellmebaby — recordings as .wav, transcripts in a SQLite database, settings as JSON. Nothing's compressed, nothing's encrypted, nothing's hidden. You can browse it all directly in File Explorer. Deleting that folder is a complete reset.

Does it auto-update?

Yes. The app polls a signed manifest a few hours after launch. When a new version is out, you get a banner across the top of the main window with a one-click Install + restart. Updates are signed with an ed25519 key — you can't be tricked into installing a fake one.

What if I want it to ignore my microphone in some apps?

That's what per-app modes are for. Set "passthrough" mode for, say, your password manager — the hotkey doesn't do anything when that app is focused. You can also pause the hotkey globally with one click in the sidebar.

Why such weird hotkey combos?

Because Windows reserves Win + letter at the shell level (so we never see Win+R, etc.), and browsers eat Ctrl+Shift+letter shortcuts (Ctrl+Shift+R reloads, Ctrl+Shift+V pastes plain text). Ctrl+Shift+Alt + letter is essentially never claimed by anything else, so the chords stay yours no matter which app is in front. The two hold-to-talk actions go further: dictation (Ctrl + Win) and Edit Mode (Alt + Win) use modifier-only chords — keys that physically can't type — so nothing can ever land in your document, or over your selection, by accident. Re-bind to whatever you prefer in Settings → Activation.

macOS / Linux?

Not yet. The Windows build uses WASAPI for audio capture and Win32 keyboard hooks for the global hotkey — porting those to macOS/Linux is real engineering work, not a config flip. Windows-only for now; other platforms once the Windows experience is rock-solid.

What does it cost?

Free. Forever, on Windows. There's no paid tier planned, no "pro" features locked behind a paywall, no telemetry to monetize. If that ever changes, anyone who installed before the change keeps the version they had at the price they had.

Is there a hidden catch?

No. tellmebaby is a personal project that exists because the maintainer wanted a local-first dictation tool with a brain and couldn't find one. You're welcome to use it, share it, fork the source if you want to. There's no "actually it phones home for analytics" footnote.

One download. no account, no card.

Run the installer, pick the languages you speak, and start talking in under five minutes. Updates are automatic and signed.

Download v0.3.9 for Windows

tellmebaby_0.3.9_x64-setup.exe · 13.9 MB · Windows 10 / 11 (x64)

sha256 · 66f949ff2568717e3bc04ae9563dec3699cb40e4f517029aede65c6820f793da