Frequently asked
Why use Whisper instead of audio-amplitude detection?
Audio-amplitude tools cut everything below a dB threshold — which means they cut breaths, soft consonants, and the natural quiet at the end of phrases. The result is choppy. Whisper's word-level timestamps tell you exactly when words end and start, so silence cuts only land in real gaps. The output sounds like the speaker just paused less, not like the audio was sliced.
What's a good silence threshold for talking-head?
0.7 seconds (the default) is a good middle ground — keeps natural breathing room without long awkward pauses. 0.4 seconds creates a podcast-snappy delivery (good for vlogs, tutorials, social cuts). 1.0 seconds preserves contemplative pacing (good for narrative, meditation, longer-form). Most creators land between 0.5 and 0.8 after a couple of test runs.
Will the cuts be visible or audible?
FFmpeg concatenates the kept segments with no crossfade by default, so cuts can have a subtle audible pop when the silence on either side has different background noise. For most talking-head content recorded in one location this is unnoticeable. For mixed-environment recordings, the desktop app's Audio Enhancer tool can normalize background noise across the whole video first, eliminating the pop.
Does it work on non-English audio?
Yes. Whisper's silence detection is language-agnostic — it identifies word boundaries regardless of language. The tool has been validated on English, Spanish, French, German, Mandarin, Japanese and Korean.
Can I review the cuts before rendering?
The desktop app shows a preview list of every segment that would be removed (with timestamps) before you click Render. If you spot one you want to keep — say, a pregnant pause for emphasis — you can exclude it from the cut list with one click.