Table of contents
Open Table of contents
ReelMistri, then OpenReels
Two weeks ago I built ReelMistri, a Python CLI that generates Bangla-language YouTube Shorts. One command, full video. It worked. The pipeline ran, the TTS spoke Bengali, the captions rendered, the video assembled.
And the output looked like every other AI video on the internet. Stock image, fade, stock image, fade, stock image. Some ambient track that had nothing to do with the content. The pieces were all technically fine in isolation, but the video felt like it was made by five people who never talked to each other.
Because it was. Each model generated its part without knowing what the others were doing.
I started OpenReels two days later. Same week. The Python prototype proved the pipeline concept; the TypeScript rewrite was about fixing the coordination problem.
DirectorScore
The first thing I built in OpenReels wasn’t a pipeline stage. It was a JSON schema.
Think about how film production works. The director doesn’t operate the camera and mix the audio and score the music. The director writes a plan. Every department executes from the same plan. I’d never been on a film set, but the analogy was obvious once I saw the problem.
So the creative director LLM generates a structured production plan before any assets exist. I call it the DirectorScore. Every downstream stage reads from it:
// Simplified - real schema has ~40 fields per scene
interface DirectorScore {
scenes: Array<{
sceneNumber: number;
visual: {
type: 'ai_image' | 'ai_video' | 'stock_video' | 'text_card';
description: string;
motion: { type: string; intensity: string };
};
voiceover: string;
transition: { type: string; durationMs: number };
emotionalIntensity: number; // 1-10, drives music
}>;
musicDirection: {
genre: string;
mood: string;
instruments: string[];
};
}
The image prompter reads visual.description plus the archetype’s style bible to write generation prompts. The music prompter reads emotionalIntensity per scene and writes timestamp-synced Lyria prompts. Captions read voiceover and word timestamps. Assembly reads transition and motion for Remotion composition props.
Same concept as an API contract. One model defines the interface, everything else implements against it.
Once this existed, the output actually started to feel directed. Scene 4 is a tense close-up, scene 5 is a wide establishing shot with a musical swell, and every model knows that because they’re all reading from the same plan.
The pipeline
Six stages, orchestrated by Mastra:
| Stage | What happens |
|---|---|
| Research | Web search grounds the script in facts, not hallucinations |
| Creative direction | Generates the DirectorScore: script, scene plan, visual types, transitions, emotional arc |
| Voiceover | TTS with word-level timestamps for karaoke-style captions |
| Visuals | AI images, AI video clips, and vision-verified stock footage |
| Music | AI-generated score via Lyria 3 Pro, synced to the video’s emotional arc |
| Assembly | Remotion composites everything into a vertical MP4 |
A critic agent at the end scores the output against a rubric. Below threshold, the pipeline re-runs creative direction with revision notes attached. Capped at one retry so it doesn’t loop forever.
I went with Mastra over LangChain because LangChain is too opinionated about how you talk to LLMs. I had my own provider layer already and didn’t want two abstractions fighting each other. Mastra gives you typed step composition, retry logic, and event emission without touching the model calls. It stays out of the way.
What broke (and what I did about it)
Stock footage is mostly garbage
Search for “Apollo 13 launch” on any stock API and you’ll get toy rockets, the moon from a different mission, and space shuttle footage from the wrong decade.
So every stock result now gets verified by a VLM before it enters the video. The model looks at the image and decides if it actually matches the scene description. If not, the pipeline rewrites the search query in terms that stock APIs understand (concrete visual nouns, no proper nouns) and tries again. After 3 misses, AI image generation takes over. The VLM’s rejection reasons get fed into the image prompt as negative examples, so the generated image avoids the same mistakes.
Before this, maybe 40% of stock footage actually matched the scene. That number is way higher now.
Characters don’t stay consistent
Generate 8-12 AI images for a video and you want the Roman senator in scene 1 to look like the same person in scene 8. He won’t.
I built a style bible system. Each archetype defines art direction, lighting, composition, palette, and anti-artifact instructions. The image prompter injects all of this into every generation prompt. It constrains the output enough that scenes feel related, but it can’t guarantee character identity across independent generation calls.
I don’t have a fix for this. Nobody does, really, not with independent image generation. Reference images and IP adapters help. I haven’t integrated them yet. The architecture has a slot for it when I do.
Background music was filler
Before AI music, every video got a random royalty-free track picked by mood tag. It sat underneath the video without following it. Fine for a first version. Clearly filler.
Now the music prompter reads each scene’s emotionalIntensity (1-10) and writes timestamp-synced instructions for Lyria 3 Pro:
[0:00 - 0:12] sparse piano, building tension. Intensity: 3/10
[0:12 - 0:25] strings enter, rising urgency. Intensity: 6/10
[0:25 - 0:38] full orchestra, climax. Intensity: 9/10
[0:38 - 0:45] resolve, single piano note. Intensity: 2/10
Every video gets a unique score that actually follows what’s happening on screen. Night and day compared to a random lofi track.
Architecture
Remotion (and why I rewrote from Python)
ReelMistri used ffmpeg for video assembly. It works, but your video logic ends up living in shell commands and filter graphs. Debugging a 200-character ffmpeg invocation at 2am is not fun.
With Remotion, the entire video is a React component tree:
OpenReelsVideo
├── Sequence (per scene)
│ ├── AIImageBeat / StockVideoBeat / TextCardBeat
│ │ └── Ken Burns zoom/pan
│ └── Transition (crossfade / slide / wipe / flip)
├── CaptionOverlay (word-level timing)
├── VoiceoverTrack
└── MusicTrack (automatic ducking under voiceover)
Adding a new visual type is writing a React component, not parsing ffmpeg docs. That alone justified the rewrite.
Provider layer
Every AI capability has a base interface (BaseLLM, BaseTTS, etc.), concrete implementations for each provider, and a factory that instantiates the right one from job config. API keys come in per-job, never stored server-side. Adding a new provider is one file and one line in the factory.
I built 15+ integrations across LLM, TTS, image, video, music, and stock. Too many for launch, honestly. The abstraction layer is clean, but 3-4 providers would have been enough to ship and I could have added the rest after.
Job queue and streaming
The API server and worker are separate processes because Remotion rendering (Chromium under the hood) is CPU-heavy and would block the API. BullMQ + Redis handles the queue. The worker emits progress through Redis pub/sub, the API server picks it up and forwards it as SSE to the browser.
SSE over WebSockets because the data only flows one direction. Simpler reconnection, works through proxies, less code. The useSSE hook on the frontend manages the EventSource lifecycle.
Archetypes
14 visual styles. Each one controls color palette, caption style, transitions, lighting, pacing tier, and a style bible for AI image generation. Same topic through two different archetypes looks like two different channels made it.
Pacing tiers control scene count: fast archetypes (infographic, bold_illustration, comic_book) produce 8-12 quick cuts. Cinematic ones (cinematic_documentary, moody_cinematic, studio_realism, warm_narrative, pastoral_watercolor) produce 5-8 deliberate scenes. The rest fall in between.
How long it took
First commit on OpenReels was March 28. Scaffolding and project setup took a couple days. v0.1.0 shipped March 31 with the full core pipeline: research, script, voiceover, visuals, captions, assembly, 14 archetypes, 6 caption styles, 5 agents, dual LLM support.
From there it was roughly one major feature per day. Transitions and Docker. Web UI with SSE streaming. Vision-verified stock footage. Mastra migration. Then April 4 happened and I somehow shipped four things: Gemini as third LLM, AI video via Veo and Kling, three new TTS providers with a unified alignment layer, and AI music via Lyria 3 Pro. I don’t know how that day worked either. Pacing tiers and a 47-page docs site on April 5, then bug fixes through April 7.
v0.13.2, 288 tests, 10 days from first commit.
The DirectorScore is the reason it moved this fast. Every new feature had a clear integration point: you read from the score or write to it. And the provider abstraction made adding a new provider mechanical work: one file, one factory entry, done.
What I’d do differently
I’d build the web UI first. The CLI came first because it was faster, but the web UI is what actually makes the pipeline comprehensible. Six stages streaming live in the browser, you can watch the AI think. Should have been day 1, not day 3.
Visuals still render sequentially even though scenes are independent. Parallelizing them would cut generation time roughly in half. I keep meaning to do it.
Numbers
A typical video costs about $0.68 in API calls: $0.003 for LLM, $0.017 for TTS, $0.30 for images, $0.30 for a video clip, $0.08 for music. You can go cheaper with Kokoro (free local TTS), bundled music tracks, and stock-only visuals. Generation takes 3-8 minutes depending on scene count.
288 tests, 15+ provider integrations, 14 archetypes, 6 caption styles.
Try it
git clone https://github.com/tsensei/OpenReels.git
cd OpenReels
cp .env.example .env # add your API keys
docker compose up # Redis + API + Worker
# http://localhost:3000
Or:
docker run --env-file .env --shm-size=2gb \
-v ./output:/output ghcr.io/tsensei/openreels \
"the apollo 13 disaster"