Build an AI YouTube Shorts Pipeline - Remotion + TTS + Automated Publishing

Download printable cheat-sheet (CC-BY 4.0)

28 Mar 2026, 00:00 Z

TL;DR AutoShorts.ai and Canva will get you to "good enough" in an afternoon. This post is for people who need something different: full control over TTS voice, custom Remotion compositions, deterministic artifact storage, and a publish layer you own end-to-end. The tradeoff is infrastructure you have to build and operate. Here is exactly what that looks like.

1 Why build your own pipeline instead of using a SaaS tool

The honest answer is that most of the time you should not build your own pipeline. Canva, AutoShorts.ai, OpusClip, and Pictory solve the 80% case: you give them a long video or a topic, they give you clips, you publish.

Build your own if you need one or more of these things:

Voice fidelity at the character level. If you have a fine-tuned checkpoint for a specific speaker (e.g. FEMALE_01 from the IMDA NSC corpus) and you need that exact voice, no SaaS tool gives you that. You need to own the TTS call.
Composition programmability. Remotion lets you express video structure as React components with TypeScript props. KaTeX equations, animated code blocks, data-driven charts - these are impossible in template-based editors.
Artifact ownership. Your rendered video, the captions JSON, the TTS audio file, the QA report - they all need to be in your storage, addressable, versioned, and deletable on your schedule.
Multi-platform publish logic you control. If you need the same render to hit LinkedIn, YouTube, TikTok, and Instagram with platform-specific caption formatting and retry handling, you need code you wrote.
Near-zero marginal render cost. On a self-hosted GPU, a 60-second Short costs roughly the same whether you render it once or fifty times. That economics makes 136-render iteration cycles viable. SaaS pricing models punish that pattern.

If none of those apply to you, stop reading and open AutoShorts.ai. This guide is not for "faceless YouTube channel" automation - those tools optimise for volume at minimal cost. This is for builders who need compositional control and iteration depth at production quality.

One important risk to acknowledge upfront: YouTube's July 2025 policy update renamed "repetitious content" to "inauthentic content" and explicitly targets mass-produced, template-based AI videos lacking originality. A custom pipeline that produces differentiated compositions is better positioned than a SaaS tool generating near-identical outputs - but you still need to understand the policy boundary. See the AI content rules guide for the full breakdown.

2 What 136 render cycles taught us about pipeline architecture

Across 38 Remotion sessions, one number stands out: 136 renders, 533 TTS cycles, 419 user turns, 34 calendar days - all for a single video composition.

Complexity tier	Renders	TTS cycles	Calendar time
Simple (1 composition, pre-recorded audio)	2–5	0	1 day
Medium (custom TTS, 5–10 composition steps)	10–20	20–60	3–5 days
Complex (TTS + music + SFX + B-roll)	50–136	100–533	2–5 weeks

Platform	Status	Notes
LinkedIn	Fully implemented	Video post + caption
YouTube	Implemented	Shorts endpoint, title + description
TikTok	Implemented	Direct post API
Instagram	Implemented	Reels endpoint
X (Twitter)	Implemented	Media upload + tweet
Threads	Implemented	Meta Graph API
Facebook	Implemented	Page video post
RedNote (小红书)	Implemented	Notes post format
Lemon8	Implemented	Lifestyle post format

Build an AI YouTube Shorts Pipeline - Remotion + TTS + Automated Publishing

1 Why build your own pipeline instead of using a SaaS tool

2 What 136 render cycles taught us about pipeline architecture

Need consented AI voiceovers?

3 The 7-phase workflow

4 TTS integration: IndexTTS2 on RunPod serverless

5 Inngest orchestration: the job → render → publish flow

6 Worker boundary design: why the render worker is external

7 Concurrency and billing gates

8 Multi-platform publishing: one render, eight destinations

9 Artifact versioning and the re-render loop

10 What this costs to run

11 When NOT to build your own

12 FAQ

Related Posts

1 Why build your own pipeline instead of using a SaaS tool

2 What 136 render cycles taught us about pipeline architecture

Need consented AI voiceovers?

3 The 7-phase workflow

4 TTS integration: IndexTTS2 on RunPod serverless

5 Inngest orchestration: the job → render → publish flow

6 Worker boundary design: why the render worker is external

7 Concurrency and billing gates

8 Multi-platform publishing: one render, eight destinations

9 Artifact versioning and the re-render loop

10 What this costs to run

11 When NOT to build your own

12 FAQ

Related Posts

Function Calling and MCP First Principles

FP8 on RTX 3090 Ti - What Actually Works on Consumer GPUs

LoRA vs Full SFT for Voice Models - What Actually Changes on a 24 GB GPU