Pixelle-Video on Ubuntu: Headless CLI, Prompt-to-Video, and Character Persistence — A Deep Dive

By John | July 4, 2026

You've seen the headline: "输入一句话，AI 全自动帮你做短视频" — type one sentence and AI builds an entire short video. No editing. No timeline. No Premiere Pro-induced existential crisis.

That's Pixelle-Video, an Apache 2.0-licensed open-source engine from AIDC-AI (now under ATH-MaaS) that's racked up 24,000+ GitHub stars and 3,500+ forks in its first few months. But here's the thing — almost every tutorial assumes you're clicking buttons in a browser. What if you want to run it headless on Ubuntu, trigger generation with a single CLI command, and keep your AI character looking consistent across every video?

I spent the weekend spelunking through the codebase. Here's everything I found.

🎬 What Pixelle-Video Actually Does

The pipeline is genuinely impressive:

Topic/Text → LLM Scripting → Image Prompt Gen → Media Generation (ComfyUI/API)
                                                      ↓
              Final MP4 ← BGM Mixing ← Frame Composition ← TTS Voiceover

In plain English: you feed it a topic like "How do black holes evaporate?" and it:

Writes a multi-scene narration script
Generates image prompts for each scene
Renders images via ComfyUI (local) or RunningHub (cloud)
Synthesizes voiceover with TTS
Composes each frame using HTML templates
Concatenates everything with background music

All of this happens automatically. The default output is vertical 1080×1920 — perfect for TikTok and YouTube Shorts.

🏗️ Architecture: What's Under the Hood

Here's the tech stack breakdown:

Layer	Technology	Role
Web UI	Streamlit (`web/app.py`)	Browser-based control panel
API	FastAPI (`api/app.py`)	REST endpoints on port 8000
Core Engine	`PixelleVideoCore` in `pixelle_video/service.py`	Orchestrates everything
LLM	OpenAI-compatible SDK	Scriptwriting, prompt generation
Media	ComfyUI (local) or RunningHub (cloud)	Image/video generation
TTS	Edge TTS + ComfyUI workflows	Voice synthesis with cloning
Templates	HTML/CSS (1080×1920, 1920×1080)	Frame layout and rendering
Video	FFmpeg + Playwright	Composition, rendering, concatenation

The critical insight: PixelleVideoCore is completely decoupled from the web UI. It's a standalone Python class with an async generate_video() method. The Streamlit UI and FastAPI are just wrappers around it.

🖥️ The CLI Solution: 3 Approaches for Headless Ubuntu

The good news: Pixelle-Video already ships with everything you need for headless operation. Here are three approaches ranked from simplest to most powerful.

Approach 1: Direct Python Script (Zero Web UI)

This is the cleanest. Write a small Python script that imports PixelleVideoCore directly:

#!/usr/bin/env python3
"""pixelle-cli.py — Headless Pixelle-Video CLI for Ubuntu"""
import asyncio
import argparse
from pixelle_video.service import PixelleVideoCore

async def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("text", help="Topic or script for video generation")
    parser.add_argument("--mode", default="generate",
                        choices=["generate", "fixed"])
    parser.add_argument("--scenes", type=int, default=5)
    parser.add_argument("--template", default="1080x1920/image_default.html")
    parser.add_argument("--prompt-prefix",
                        default="consistent character: a wise old sage in a library, "
                                "warm lighting, detailed illustration style")
    parser.add_argument("--ref-audio", help="Reference audio for voice cloning")
    parser.add_argument("--output", help="Output path for video")
    args = parser.parse_args()

    core = PixelleVideoCore()
    await core.initialize()

    result = await core.generate_video(
        text=args.text,
        mode=args.mode,
        n_scenes=args.scenes,
        frame_template=args.template,
        prompt_prefix=args.prompt_prefix,
        ref_audio=args.ref_audio,
        output_path=args.output,
    )

    print(f"✅ Video generated: {result.video_path}")
    print(f"   Duration: {result.duration:.1f}s")
    await core.cleanup()

if __name__ == "__main__":
    asyncio.run(main())

Usage:

uv run python pixelle-cli.py "How do quantum computers work?" \
  --scenes 5 \
  --prompt-prefix "consistent character: professor with glasses, lab coat, clean illustration style"

Approach 2: API Server + curl (Headless but Still Running)

Start the API in the background, then hit it with curl:

# Terminal 1: Start headless API
uv run python api/app.py --host 127.0.0.1 --port 8000

# Terminal 2: Generate video via API
curl -X POST http://localhost:8000/api/video/generate/sync \
  -H "Content-Type: application/json" \
  -d '{
    "text": "The history of sushi in 60 seconds",
    "mode": "generate",
    "n_scenes": 5,
    "frame_template": "1080x1920/image_default.html",
    "prompt_prefix": "consistent character: Japanese chef, warm kitchen, anime art style"
  }'

This returns a JSON with video_url pointing to the generated MP4.

Approach 3: Docker with the API-Only Container

For production use, modify docker-compose.yml to skip the Streamlit web container:

# Clone and configure
git clone https://github.com/ATH-MaaS/Pixelle-Video.git
cd Pixelle-Video
cp config.example.yaml config.yaml
# Edit config.yaml with your API keys

# Start only the API container
docker compose up api -d

# Generate video via curl
curl -X POST http://localhost:8000/api/video/generate/async \
  -H "Content-Type: application/json" \
  -d '{"text": "5 life lessons from stoicism", "mode": "generate", "n_scenes": 6}'

👤 Character Persistence: The Secret Sauce

This is where it gets interesting. Pixelle-Video doesn't have a built-in "character memory" system, but it exposes three powerful levers for maintaining visual consistency:

Lever 1: `prompt_prefix` — Your Character's DNA

The prompt_prefix gets prepended to every image prompt. If your prefix is consistent and descriptive, your character stays consistent:

# Good — specific character description
prompt_prefix: "A young woman with short silver hair, round glasses, wearing a 
  navy blue lab coat, cartoon illustration style, Pixar-inspired, same character 
  in every image"

# Not so good — too vague
prompt_prefix: "Minimalist black-and-white matchstick figure style illustration"

The key: describe your character like you're filling out a police sketch form. Hair color, eye shape, signature accessory, art style — lock it all down.

Lever 2: Digital Human Extension (Lip-Sync Avatars)

Pixelle-Video's Digital Human extension module takes an image (your character) and generates a talking-head video synced to your TTS audio. This is the closest thing to "persistent character" in the current release:

Upload a reference image of your character
The system generates lip-synced video segments
Same character across all scenes

The workflow lives in workflows/runninghub/ and workflows/selfhost/ — look for the digital human pipelines.

Lever 3: Reference Audio for Voice Identity

For voice consistency, pass a ref_audio clip:

uv run python pixelle-cli.py "Today's tech news..." \
  --ref-audio "/home/john/voice-samples/my-narrator.wav"

The TTS engine will clone that voice across all scenes. Combined with a locked prompt_prefix, you get the same face and the same voice — your AI persona is born.

Lever 4: Image-to-Video + Motion Transfer

Two more extension modules that help with character consistency:

Image-to-Video: Feed a character portrait → get animated clips of that same character
Motion Transfer: Take a reference dance/motion video + your character image → your character performs the motion

For a persistent character workflow: generate one high-quality character image → use it as seed → run Image-to-Video for each scene → same face, different actions.

📦 Full Ubuntu Installation (Step by Step)

# Prerequisites
sudo apt update && sudo apt install -y ffmpeg curl fonts-noto-cjk

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

# Clone Pixelle-Video
git clone https://github.com/ATH-MaaS/Pixelle-Video.git
cd Pixelle-Video

# Configure
cp config.example.yaml config.yaml
# Edit config.yaml:
#   - Set LLM provider (OpenAI, Qwen, DeepSeek, or local Ollama)
#   - Choose ComfyUI URL (local) or RunningHub API key (cloud)
#   - Set image style prompt_prefix

# Install dependencies
uv sync
uv run playwright install --with-deps chromium

# Test your setup
uv run python -c "
import asyncio
from pixelle_video.service import PixelleVideoCore
async def test():
    core = PixelleVideoCore()
    await core.initialize()
    print('✅ Pixelle-Video core initialized successfully')
    await core.cleanup()
asyncio.run(test())
"

☁️ ComfyUI vs RunningHub: Choosing Your Media Backend

Pixelle-Video supports two media generation paths:

	Local ComfyUI	RunningHub Cloud
Cost	Free (your electricity)	Pay-per-generation (~$0.01-0.10/video)
GPU	Required (6GB+ VRAM recommended)	None needed
Speed	Depends on your hardware	Consistent cloud performance
Privacy	Everything stays local	Images processed in cloud
Setup	Install ComfyUI + workflows + models	Just an API key

My recommendation for integrated GPUs like the Radeon 800M: The iGPU is probably tight for ComfyUI (6GB VRAM minimum for Flux), so RunningHub is the pragmatic choice. Set runninghub_api_key in config.yaml and use runninghub/ workflows.

🧩 Bonus: Pixelle-MCP — The Agent-Native Approach

The same team also built Pixelle-MCP, which exposes ComfyUI workflows as MCP (Model Context Protocol) tools. This means AI agents can directly call video generation workflows without touching the Pixelle-Video web UI at all.

# Install and run Pixelle-MCP CLI
uvx pixelle@latest
# Or: pip install pixelle && pixelle

This is arguably the most "headless" approach — a unified CLI that bridges LLMs and ComfyUI, with MCP endpoint at http://localhost:9004/pixelle/mcp. You get:

Web interface at http://localhost:9004
MCP server for Claude Desktop, Cursor, etc.
Zero-code workflow → MCP tool conversion
Both ComfyUI and RunningHub backends

⚡ Real-World Performance Notes

A 5-scene video on RunningHub takes about 2-4 minutes end-to-end
The LLM scripting phase is fast — under 10 seconds with most providers
The bottleneck is always media generation (image/video models)
Static templates (no image generation) complete in under 30 seconds
TTS with voice cloning adds about 5-10 seconds per scene

🎯 The TL;DR

Pixelle-Video is ready for headless Ubuntu CLI today — the core engine is decoupled from the UI
Three CLI approaches: Direct Python script (cleanest), curl + API server (simplest), Docker API-only (production)
Character persistence is achievable via prompt_prefix + Digital Human extension + reference audio cloning
On integrated GPUs, use RunningHub cloud — iGPUs won't comfortably run Flux image models
24K stars for a reason — this is the most polished open-source short-video pipeline right now

Questions? Found a better CLI approach? Drop a comment below — or fork the repo and ship a PR. The Pixelle-Video team is actively accepting contributions.

Built and tested on Ubuntu 26.04 (Resolute Raccoon) with kernel 7.0.0-22-generic.