// TECHNICAL DOCUMENTATION v1.0

How Bob Works

A fully automated AI pipeline running on self-hosted hardware in South Australia. Here's everything under the hood.

The Pipeline

01 Audience Votes TikTok + agent-browser
Viewers comment OPTION words on TikTok. OpenClaw runs a headless Chrome browser every 5 minutes to scrape comments and tally votes. First option to 100 votes wins — or the leader at 3am AEST.
02 AI Writes the Episode Claude Sonnet via Anthropic API
OpenClaw calls the Claude API with the full series bible, the complete story so far from a running tracker, and the winning vote. Claude returns a JSON script with 10–15 lines of dialogue, character expressions, backgrounds and 3 new vote options.
03 Character Generation ComfyUI + DreamShaper 8 + ControlNet
Any new character referenced in the script is automatically generated using ComfyUI with the DreamShaper model and OpenPose ControlNet. Bob's real photo is used as the pose reference so all characters maintain consistent facial positioning. Images are saved for reuse in future episodes.
04 Voice Synthesis Microsoft Edge TTS
Each line of dialogue is converted to audio using Edge TTS Australian voices — en-AU-WilliamNeural for Bob and male characters, en-AU-NatashaNeural for female characters, and American voices for the aliens. Audio is resampled to 16kHz mono WAV for SadTalker compatibility.
05 Lip Sync Animation SadTalker + GTX 1060 6GB
SadTalker animates each character image to match the audio using a 3D morphable face model. It extracts facial landmarks, generates expression coefficients from the audio mel spectrogram, renders a talking head video and composites it back onto the original face using seamless cloning.
06 Background Removal rembg + onnxruntime-gpu + CUDA
The SadTalker output has a plain background that needs removing before compositing. rembg runs the U2-Net neural network on each frame to generate an alpha mask, producing transparent PNG frames. This runs on the GTX 1060 via CUDA for acceleration.
07 Scene Compositing FFmpeg
FFmpeg overlays the transparent character frames over the background image, scales to 1080×1920 (TikTok portrait format), adds subtitle text using drawtext, mixes in the audio and encodes to H.264/AAC MP4. The final episode is assembled by concatenating all scenes with a dynamic intro card and endcard.

Hardware

Render Node
Super
Ubuntu 24.04, custom build
GPU
NVIDIA GTX 1060
6GB VRAM, Pascal architecture
RAM
8GB System RAM
4GB swap configured
Orchestration
OpenClaw
Docker on Unraid NAS
Web Server
Allium
Plesk, 1500+ days uptime
Internet
Starlink
CGNAT, self-hosted via NPM

Render Times Per Scene

Voice Synthesis (Edge TTS)~5 sec
Lip Sync (SadTalker)~45 sec
Background Removal (rembg + CUDA)~60 sec
Scene Compositing (FFmpeg)~10 sec
Total per episode (~13 scenes)~25 min

The Full Stack

[AI]
Claude Sonnet
Story Writer
Anthropic's Claude writes every episode from scratch using the series bible and story tracker. It invents characters, dialogue, plot twists and vote options.
[IMG]
ComfyUI + DreamShaper 8
Character & Background Generator
Stable Diffusion workflow for generating new character portraits and outback backgrounds on demand. ControlNet with OpenPose keeps face positions consistent.
[LIP]
SadTalker
Lip Sync Engine
3D face model animation that drives character portraits to speak in sync with generated audio. Runs entirely on local GPU hardware.
[TTS]
Microsoft Edge TTS
Voice Synthesis
High-quality Australian neural voices. en-AU-WilliamNeural for Bob. Runs via the edge-tts Python CLI — free, no API key required.
[BG]
rembg
Background Removal
U2-Net neural network running via onnxruntime-gpu to remove SadTalker backgrounds frame by frame, enabling character compositing over scene backgrounds.
[VID]
FFmpeg
Video Assembly
Handles all compositing, scaling, subtitle rendering, audio mixing, scene concatenation and final encoding to H.264/AAC for TikTok.
[AGT]
OpenClaw
AI Agent Orchestrator
Self-hosted AI agent platform running in Docker on an Unraid NAS. Orchestrates the entire pipeline via SSH — from vote polling to episode rendering to notifications.
[WEB]
agent-browser
TikTok Vote Scraper
Headless Chrome automation that visits the TikTok video page every 5 minutes to read comments and tally votes without requiring API access.

Geek FAQ

Does Bob actually know what's happening to him?
No. Claude writes each episode fresh from the series bible and a story tracker JSON file. It has no persistent memory — just the text summary of what's happened so far. Every decision is made in a single API call.
How are new characters generated?
When Claude writes a new character into an episode, it includes a description field. The pipeline detects the missing image, starts ComfyUI, submits a txt2img workflow with ControlNet using Bob's real photo as a pose reference, and saves the output PNG for future episodes.
What stops people from voting multiple times?
Nothing — it's TikTok comments. Each comment counts as one vote regardless of who posted it. If someone really wants BIRDSVILLE to win, they can comment 100 times. That's kind of the point.
How long does a full episode take to render?
On the GTX 1060, approximately 25–40 minutes for a 13-scene episode. SadTalker is the slowest step at ~45 seconds per scene, followed by rembg background removal at ~60 seconds per scene running on CUDA.
Is this running in the cloud?
No. Everything runs on physical hardware in South Australia — an Unraid NAS for orchestration and a Linux box with a GTX 1060 for rendering. The only cloud calls are to the Anthropic API for writing and Microsoft Edge TTS for voices.
What happens if the render fails mid-episode?
The pipeline has a resume feature — it checks for existing scene_N.mp4 files and skips already-completed scenes. A failed run can be restarted and will pick up from where it left off.
Will Bob ever find out about the $5 million?
Yes — when he finds an ATM and checks his balance. Tap and go doesn't show the balance so it has to be a proper ATM. When that happens is entirely up to the audience votes.