The open-source yt-dlp + FFmpeg + Whisper wrapper

ViralMint wires the canonical creator-video OSS stack (yt-dlp, FFmpeg, faster-whisper) into a single Python pipeline with a REST API at :16888/api/* and an MCP server at :16888/mcp. Stop reimplementing the orchestration in every project — drive the whole stack from curl, Python, or Claude Code instead. AGPL-3.0, no subscription.

1,000+ yt-dlp sites supported
100+ Whisper languages
REST + MCP API surface
AGPL-3.0 License

The stack

yt-dlp

Download

1,000+ sites — YouTube, TikTok, Bilibili, Twitter, Reddit, Twitch, Vimeo, Instagram, etc. ViralMint wraps it with curl-cffi browser impersonation, cookie-auth extraction, PoT fallbacks, and a Playwright-driven manifest-capture path for hard-to-extract sites.

faster-whisper

Transcribe

Local int8 transcription with word-level timestamps. Bundled small.en + small (multilingual) models; medium / large-v3 selectable in Settings. ~30s to transcribe a 5-minute video on a mid-range laptop CPU.

FFmpeg

Composite

Clip stitching, ASS subtitle burn-in (word-by-word from Whisper timestamps), audio mixing with ducking, watermark overlay, reframe (MediaPipe face-tracking), EBU R128 normalization, silence removal, multi-aspect export. All composable.

FastAPI

REST API

Every operation above is exposed as a /api/* endpoint. SQLite job tracking, async dispatch via asyncio (no Redis / Celery), WebSocket for real-time progress events.

FastMCP

MCP Server

86 MCP tools mounted at /mcp. Claude Code / Claude Desktop / Cursor can drive the entire pipeline from natural-language chat. wait_for_job helper for multi-step orchestration.

Example: download → transcribe → reframe via REST

# Download → Transcribe → Reframe a video via the REST API
# (No CLI flags, no FFmpeg argv construction, no Whisper config.)

curl -X POST http://127.0.0.1:16888/api/download \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'
# → {"job_id": "j_abc123"}

# Poll until terminal (the MCP wait_for_job helper does this for you):
curl http://127.0.0.1:16888/api/jobs/j_abc123
# → {"status": "success", "output": {"downloaded_video_id": 42}}

# Auto-transcribe is chained — Whisper ran in the same job.
# Now reframe to 9:16 with face-tracking + burn captions:
curl -X POST http://127.0.0.1:16888/api/tools/reframe \
  -H "Content-Type: application/json" \
  -d '{"video_id": 42, "target_aspect": "9:16", "burn_captions": true}'
# → {"job_id": "j_xyz789"}
# → output mp4 at downloaded_video.video_path_9x16

Same operations are also exposed as MCP tools — Claude Code can chain them with natural-language prompts. See /developers/mcp-video-server/.

What ViralMint solves over raw yt-dlp + FFmpeg

  • Format selection + remux defaults. yt-dlp's -f bestvideo+bestaudio is right most of the time but breaks on sites that serve segmented streams. The download wrapper picks the right format string per platform (different defaults for YouTube vs Bilibili vs Douyin) and remuxes to mp4 automatically.
  • Cookie auth without per-call setup. Export browser cookies once via Settings → Cookies; yt-dlp uses them on every subsequent download for age-gated, private, or paid content. No CLI flags per call.
  • Browser impersonation + Playwright fallback. When yt-dlp's direct extractor hits bot detection, the wrapper falls back to curl-cffi browser impersonation; when that fails, it spins up bundled Chromium via Playwright to capture the HLS manifest. Three tiers, automatic.
  • Whisper word-timestamp → ASS subtitle alignment. Whisper outputs word-level timestamps; mapping them onto ASS-format subtitle timing with proper line breaks, max-2-words-per-frame styling, and language-aware spacing is non-trivial. ViralMint owns the alignment.
  • FFmpeg argv discipline. Every FFmpeg call is built from argv lists (never shell=True string concatenation) so user input doesn't shell-inject. Reframe + caption burn + audio mix are single-pass filter graphs, not chained subprocess calls.
  • Async-everywhere job model. Long-running operations (download, transcription, video generation) return a job_id immediately; poll GET /api/jobs/{id} for status / progress / output. No blocking subprocess waits in your code.

Where the code lives

  • Download: backend/services/ytdlp_service.py — yt-dlp wrapper with the multi-tier fallback chain.
  • Transcribe: backend/services/whisper_service.py — faster-whisper wrapper with model loading + word-timestamp extraction.
  • FFmpeg: backend/services/ffmpeg_service.py — argv-list-based wrapper for stitch / caption / mix / reframe / normalize.
  • REST API: backend/api/ — FastAPI routers for every operation, mounted on /api/*.
  • MCP server: backend/mcp/ — FastMCP tools that wrap the REST routes for MCP-aware clients.
  • Job runner: backend/core/task_runner.py — asyncio in-process dispatcher (no Redis / Celery).

Frequently asked

What does ViralMint wrap that yt-dlp + FFmpeg + Whisper don't do alone?

Each underlying tool is best-in-class but requires careful orchestration: yt-dlp's format selection / cookie handling / browser impersonation; FFmpeg's argv construction for video stitching + caption burn-in + audio mixing; Whisper's word-timestamp JSON alignment with subtitle timing. ViralMint wires them together as a single Python pipeline with sane defaults, a desktop UI, a REST API at /api/*, and an MCP server at /mcp — so you don't reimplement the orchestration in every project.

Is ViralMint open source?

Yes — AGPL-3.0. Source at github.com/openclaw-easy/ViralMint. The license requires that derivative network services release their source; for desktop forks the obligation is the standard GPL-style copyleft. Commercial use is allowed under AGPL terms.

Can I script ViralMint from Python or shell?

Yes via two paths. (1) The desktop ships a REST API at http://127.0.0.1:16888/api/* — call any endpoint with curl / requests / fetch. (2) The MCP server at /mcp exposes 86 tools to MCP-aware clients (Claude Code, Claude Desktop, Cursor) for natural-language driving. For shell scripts, the REST path is usually cleanest.

Does ViralMint handle yt-dlp's edge cases (PoT, sig fixes, bot detection)?

Yes — the download wrapper handles YouTube's PO Token / missing-PoT fallbacks, curl-cffi browser-impersonation headers for sites with bot detection, cookie extraction from your browser for age-gated / private content, and Playwright-driven manifest capture as a last-resort fallback when yt-dlp's direct extractor fails. Updates to yt-dlp's master ship within ViralMint's release cycle.

What FFmpeg operations are bundled?

Clip stitching, audio mixing (amix with ducking), ASS subtitle burn-in (word-by-word from Whisper word-timestamps), watermark overlay, 16:9 → 9:16 reframe with face-tracking via MediaPipe, EBU R128 audio loudness normalization, silence removal (Whisper-detected + FFmpeg cut), speed change with pitch preservation, multi-aspect export (9:16 + 1:1 + 16:9 zip bundle). All accessible via REST or MCP.

Can I run just the API without the desktop UI?

Technically yes — the FastAPI backend at backend/main.py runs standalone via python run.py. The packaged desktop bundle includes a pystray tray launcher; in dev mode you can skip the launcher and run the FastAPI server directly. The frontend (Vite/React UI) is optional; the API is fully usable without it.

Stop reimplementing the wrapper

The desktop app ships everything wired together — REST API, MCP server, job tracking, WS progress events. AGPL-3.0 source, run it locally, fork it if you need to.