About

LiveCaptionIt is a free, browser-only live captions tool. Pick any audio source your browser can hear — a YouTube tab, a Spotify track, a Google Meet call, a podcast — and LiveCaptionIt shows live captions in a floating window that stays visible while you switch between apps.

It runs the Whisper speech recognition model directly in your browser via transformers.js + WebGPU. Audio is processed locally and never uploaded to any server. The only network requests are the one-time model download from Hugging Face Hub (cached after first use).

Built by Shrestha Tripathi. The product principle: your data should never leave your device.

Questions, bugs, or feature requests? Get in touch.

Honest limitations

No speaker identity. LiveCaptionIt doesn't say "Speaker A vs Speaker B" — and not because we're lazy. When you capture a tab, every speaker (your colleague in Zoom, the host on a podcast, both interviewers) reaches the browser as a single mono audio stream at similar volume. Telling them apart needs a separate voice-fingerprinting model (~300MB) that we haven't shipped yet. We may split paragraphs on long pauses in a future release — that's the honest version of "diarization" we can deliver from one channel.

Accuracy isn't perfect. Whisper's base model is fast and works offline, but it'll mishear technical jargon, proper nouns, and heavily accented speech. The model picker lets you try small (244MB) for better accuracy at the cost of a slower first load.

~1-second delay. Real-time speech recognition needs to wait for enough audio context to make a stable guess. The first word usually shows within ~1 second of speech; words refine in place as Whisper hears more.

Music and silence aren't speech. Whisper hallucinates plausible-sounding captions on near-silent or musical audio. LiveCaptionIt drops repeating filler-token chains and resets the buffer after 2.5s of silence, but expect some weirdness if you point it at a song with sparse vocals.