I built this pipeline end-to-end inside ViralMint and the desktop app ships it as a four-stage flow: extract → transcribe → assemble → composite. On an Apple M2 Max it turns a 10-minute source video into a 60-second captioned short in ~90 seconds wall-clock — 22s of Whisper transcription with the medium model + ~25s of FFmpeg clip extraction and concat + ~35s for the ASS-format word-by-word caption burn-in. No GPU required; no cloud round-trip past the optional AI script generation.

This post walks through the exact FFmpeg invocations and faster-whisper Python calls we run in production, with numbers from internal benchmarks. The full implementation is open source under AGPL-3.0 at github.com/openclaw-easy/ViralMint — every command below is copied verbatim from the shipped codebase. If you’ve seen MoneyPrinterTurbo, this is the same script-to-video idea taken further: trend scouting and Whisper-based competitor analysis decide what to build before the pipeline assembles it.

The Core Architecture

A production-grade pipeline follows four distinct stages:

Extraction: Isolating audio for analysis.
Intelligence: Transcription and alignment via Whisper.
Assembly: Mapping visuals to the timeline.
Compositing: The final FFmpeg “burn” for captions and overlays.

1. High-Speed Audio Extraction

Whisper doesn’t need video pixels. To speed up transcription, extract the audio into a mono 16kHz WAV file (Whisper’s native format) to avoid internal re-encoding.

ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio_for_whisper.wav

2. Intelligence: Local faster-whisper

We use faster-whisper because it implements CTranslate2, which allows for int8 quantization. This is how ViralMint achieves transcription speeds of 10 minutes of audio in 22 seconds on an Apple M2 Max.

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, _ = model.transcribe("audio_for_whisper.wav", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.2f}s -> {word.end:.2f}s]: {word.word}")

3. Dynamic Visual Assembly

Once you have word-level timestamps, you can calculate exactly where to place stock footage or AI-generated b-roll. For ViralMint’s Smart Video feature, we generate an FFmpeg filter script that handles transitions and the “Ken Burns” effect programmatically.

The “Ken Burns” Zoom Filter

ffmpeg -i image.png -vf "zoompan=z='zoom+0.001':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=125:s=1080x1920" output.mp4

4. Compositing word-by-word captions

Don’t use standard SRT files. They are too rigid. Instead, generate Advanced Substation Alpha (.ass) files. This allows you to highlight individual words using the {\\1c&H00FFFF&} color override tag.

Example ASS logic:

Dialogue: 0,0:00:01.00,0:00:01.50,Viral,,0,0,0,,{\1c&H00FFFF&}Look{\1c&HFFFFFF&} at this!
Dialogue: 0,0:00:01.50,0:00:02.00,Viral,,0,0,0,,Look {\1c&H00FFFF&}at{\1c&HFFFFFF&} this!

Then, burn them in with a single pass:

ffmpeg -i background.mp4 -vf "ass=captions.ass" final_video.mp4

Performance Benchmarks

Here is the ViralMint performance baseline for this specific pipeline:

Hardware	Stage	Workload	Time
Apple M2 Max	Intelligence	10m Audio	22s
NVIDIA RTX 3060	Compositing	60s 9:16 Render	14s
Intel i7 (12th Gen)	Intelligence	10m Audio	55s

Automation with Model Context Protocol (MCP)

By exposing these FFmpeg and Whisper steps as MCP Tools, you can let AI agents like Claude Code drive the entire pipeline. You simply say: “Extract audio, find the best hook, and generate 3 variations of the captions,” and the agent orchestrates the Python services for you — see the ViralMint MCP video server for the full tool surface.

Conclusion

Building an AI video pipeline is about balancing the speed of local processing (Whisper/FFmpeg) with the creativity of cloud-based models (for scripts and imagery).

If you want to skip the engineering and start generating videos today, download the ViralMint Desktop App. It bundles this entire pipeline into a one-click creator toolkit.

How to Build an AI Video Pipeline with FFmpeg and Whisper (2026)

The Core Architecture

1. High-Speed Audio Extraction

2. Intelligence: Local faster-whisper

3. Dynamic Visual Assembly

The “Ken Burns” Zoom Filter

4. Compositing word-by-word captions

Performance Benchmarks

Automation with Model Context Protocol (MCP)

Conclusion

Ready to create viral videos with AI?

How to Build an AI Video Pipeline with FFmpeg and Whisper (2026)

The Core Architecture

1. High-Speed Audio Extraction

2. Intelligence: Local faster-whisper

3. Dynamic Visual Assembly

The “Ken Burns” Zoom Filter

4. Compositing word-by-word captions

Performance Benchmarks

Automation with Model Context Protocol (MCP)

Conclusion

Related posts

Ready to create viral videos with AI?