Faceless video generator

Open-source Faceless Video Generator

Type a topic, ViralMint writes the script, picks the voice, generates captions and assembles stock footage or AI b-roll into a finished mp4. No camera, no editing, no subscription.

Download free → See how it works ↓

Faceless pipeline at a glance

{
  "tool": "ViralMint Faceless Video Generator",
  "input": "topic prompt OR custom script (≤10000 chars)",
  "output": "1080p mp4 (9:16, 16:9 or 1:1)",
  "voice_options": "13 Gemini voices + 400+ Edge TTS voices",
  "stock_b_roll_source": "Pexels keyword-matched (free)",
  "ai_b_roll_source": "OpenRouter video models (paid)",
  "caption_style_presets": ["viral", "classic", "bold"],
  "music_mix_db": -20,
  "platform_metadata_generated": ["youtube_title", "youtube_description", "youtube_tags", "tiktok_caption"],
  "subscription": false,
  "watermark": false
}

Pipeline details in backend/agents/generator_video.py. See faceless channel and Smart Video in the glossary.

ViralMint Smart Video page showing the topic input, aspect ratio picker, voice selector, music dropdown and a script panel mid-generation. — Smart Video — the page that turns a topic into a finished faceless mp4. Type the topic on the left, ViralMint handles script, voice, captions, stock footage and music on the right.

What's a faceless video?

A faceless video is a short-form or long-form video where no person appears on camera. Instead of filming yourself, the video is built from a voiceover (AI or human) over stock footage, AI-generated b-roll, screen recordings or animated text. Faceless channels dominate niches like personal finance, news commentary, top-10 lists, productivity, and educational content on YouTube, TikTok and Instagram — they let a single creator publish daily without showing their face, and they scale to multiple channels at once.

A complete faceless workflow involves 10+ steps: research, scripting, voiceover, caption generation, clip sourcing, editing, music mixing, thumbnail design, title optimization, and posting. ViralMint runs all of it in one pipeline. The stock-footage tier covers Pexels stock with AI voiceover and captions at no per-video cloud cost; the paid tier swaps stock for AI-generated video clips (Sora 2 Pro, Veo 3.1, Seedance) and adds AI-generated b-roll imagery (Nano Banana) and AI music (Lyria 3 Pro) — typical per-video cost is under $1.

How to make a faceless video with AI in 5 steps

Pick a topic. Type a niche, paste a competitor URL, or scout the trend with ViralMint's trending video finder. The AI script writer uses the input plus search-demand data to shape the hook.
Pick a voice. Choose from 13 paid Gemini voices (Gemini 3.1 Flash TTS, $0.12 per 1,000 characters) or 400+ Edge TTS voices included in 100+ languages. Voiceover is generated at 192Kbps with Whisper word-level timestamps so captions land on the exact word.
Pick a tier. Stock tier uses Pexels stock footage with keyword matching to your script (no per-video cloud cost). Budget tier uses Wan 2.7 or Hailuo 2.3 for AI-generated clips (~$0.25 per 5-second clip). Premium adds Sora 2 Pro or Veo 3.1 (~$1.50 per clip) with synced audio.
Generate. The 11-step pipeline runs in the background — script, voice, transcription, clip generation, stitching, music mix, audio merge, caption burn, thumbnail extraction, AI-drafted YouTube and TikTok metadata. About 3-6 minutes on stock-footage tier, 8-15 on AI-video tier.
Download and post. Finished mp4 + AI-drafted title, description, tags and TikTok caption. You post manually — auto-upload was prototyped and removed because it kept breaking on platform OAuth changes.

What's in the box

AI script writer

Takes a niche or competitor URL and produces a 60-second to 5-minute script tuned for retention. Uses Whisper transcripts of competitor videos when you give it a reference URL.

13 paid + 400+ included voices

Google Gemini 3.1 Flash TTS (Kore, Puck, Charon, etc.) for premium voice quality, or Edge TTS included across 100+ languages — Spanish, Portuguese, Mandarin, Japanese, German, French, and more.

Word-by-word captions

Whisper word-level timestamps + ASS subtitle rendering. Three presets — viral (yellow highlight, Montserrat Bold 56pt), classic (full sentence, Arial 42pt), bold (green highlight, Impact 64pt). Customizable per channel.

Pexels stock or AI b-roll

Stock tier: Pexels stock footage matched to your script's keywords clip-by-clip, no per-clip cloud cost. Paid tier: Nano Banana generates custom b-roll images from script prompts (text-to-image), or Sora 2 / Veo 3.1 generates the actual moving clips.

Background music

Bring your own royalty-free tracks, or generate AI tracks via Lyria 3 Pro (~$0.05 / 30s). Auto-mixed at -20dB with fade-in/out so the voiceover stays the focus.

Multi-aspect export

9:16 for TikTok and YouTube Shorts, 16:9 for YouTube long-form, 1:1 for Instagram feed. One generate → all three aspect ratios in a ZIP, ready to post on every platform without re-rendering.

Frequently asked

What is a faceless video?

A faceless video is a short-form or long-form video that doesn't show a person on camera. Instead, an AI or human-written voiceover narrates over stock footage, AI-generated b-roll, screen recordings, or animated text. Faceless channels are popular on YouTube, TikTok and Instagram because they let creators scale without filming themselves — a single person can run multiple niches without showing their face.

How much does ViralMint's faceless video generator cost?

The stock-footage tier runs entirely on your machine with no per-video cloud cost: Pexels stock footage, Edge TTS voiceover (400+ included voices, 100+ languages), word-by-word captions and FFmpeg merging. New accounts get a daily starter allowance to evaluate the cloud AI features. Premium AI features (Nano Banana b-roll images, Gemini 3.1 Flash TTS voices, Lyria 3 Pro AI music, Sora 2 / Veo 3.1 / Seedance AI video clips) cost a few cents per video, billed per use via prepaid USD top-ups — no subscription.

How long does it take to make a faceless video?

Typing the prompt + clicking generate is about 30 seconds. ViralMint then runs the 11-step pipeline (script, voice, transcription for caption timing, clip generation or stock search, stitching, music mix, caption burn, thumbnail extraction, AI-drafted title and tags) in the background — about 3-6 minutes for a 60-second short on Pexels stock footage, 8-15 minutes for an AI-video tier with Sora 2 Pro or Veo 3.1 clips. You get a notification when it's ready and a finished mp4 to download.

What niches work well for faceless YouTube videos?

Personal finance, news commentary, top-10 lists, productivity tips, history explainers, science breakdowns, motivational content and language learning all work well for faceless formats. Niches that need a personal presenter (vlogs, lifestyle, reaction content) don't translate as well. ViralMint's trending video finder can scout any niche across YouTube, TikTok, Douyin and Reddit to confirm demand before you commit to a channel.

Can I make faceless videos in languages other than English?

Yes. Edge TTS covers 100+ languages — Spanish, Portuguese, Mandarin, Japanese, German, French, Korean, Arabic and more — with multiple voices per language. The AI script writer accepts any language as the target. Word-by-word captions render Unicode cleanly (Chinese, Japanese, Arabic, Hebrew). The Whisper transcription used for caption timing supports 99 languages.

Start your first faceless video

Sign up takes 30 seconds. The browser version covers AI chat, AI image, AI voice and AI music — same balance as the desktop app. The full Smart Video pipeline ships in the open-source desktop app.

Open browser version → Download desktop app →