How Bob Works

The Pipeline

01 Audience Votes TikTok + agent-browser

Viewers comment OPTION words on TikTok. OpenClaw runs a headless Chrome browser every 5 minutes to scrape comments and tally votes. First option to 100 votes wins — or the leader at 3am AEST.

02 AI Writes the Episode Claude Sonnet via Anthropic API

OpenClaw calls the Claude API with the full series bible, the complete story so far from a running tracker, and the winning vote. Claude returns a JSON script with 10–15 lines of dialogue, character expressions, backgrounds and 3 new vote options.

03 Character Generation ComfyUI + DreamShaper 8 + ControlNet

Any new character referenced in the script is automatically generated using ComfyUI with the DreamShaper model and OpenPose ControlNet. Bob's real photo is used as the pose reference so all characters maintain consistent facial positioning. Images are saved for reuse in future episodes.

04 Voice Synthesis Microsoft Edge TTS

Each line of dialogue is converted to audio using Edge TTS Australian voices — en-AU-WilliamNeural for Bob and male characters, en-AU-NatashaNeural for female characters, and American voices for the aliens. Audio is resampled to 16kHz mono WAV for SadTalker compatibility.

05 Lip Sync Animation SadTalker + GTX 1060 6GB

SadTalker animates each character image to match the audio using a 3D morphable face model. It extracts facial landmarks, generates expression coefficients from the audio mel spectrogram, renders a talking head video and composites it back onto the original face using seamless cloning.

06 Background Removal rembg + onnxruntime-gpu + CUDA

The SadTalker output has a plain background that needs removing before compositing. rembg runs the U2-Net neural network on each frame to generate an alpha mask, producing transparent PNG frames. This runs on the GTX 1060 via CUDA for acceleration.

07 Scene Compositing FFmpeg

FFmpeg overlays the transparent character frames over the background image, scales to 1080×1920 (TikTok portrait format), adds subtitle text using drawtext, mixes in the audio and encodes to H.264/AAC MP4. The final episode is assembled by concatenating all scenes with a dynamic intro card and endcard.

The Full Stack

[AI]

Claude Sonnet

Story Writer

Anthropic's Claude writes every episode from scratch using the series bible and story tracker. It invents characters, dialogue, plot twists and vote options.

[IMG]

ComfyUI + DreamShaper 8

Character & Background Generator

Stable Diffusion workflow for generating new character portraits and outback backgrounds on demand. ControlNet with OpenPose keeps face positions consistent.

[LIP]

SadTalker

Lip Sync Engine

3D face model animation that drives character portraits to speak in sync with generated audio. Runs entirely on local GPU hardware.

[TTS]

Microsoft Edge TTS

Voice Synthesis

High-quality Australian neural voices. en-AU-WilliamNeural for Bob. Runs via the edge-tts Python CLI — free, no API key required.

[BG]

rembg

Background Removal

U2-Net neural network running via onnxruntime-gpu to remove SadTalker backgrounds frame by frame, enabling character compositing over scene backgrounds.

[VID]

FFmpeg

Video Assembly

Handles all compositing, scaling, subtitle rendering, audio mixing, scene concatenation and final encoding to H.264/AAC for TikTok.

[AGT]

OpenClaw

AI Agent Orchestrator

Self-hosted AI agent platform running in Docker on an Unraid NAS. Orchestrates the entire pipeline via SSH — from vote polling to episode rendering to notifications.

[WEB]

agent-browser

TikTok Vote Scraper

Headless Chrome automation that visits the TikTok video page every 5 minutes to read comments and tally votes without requiring API access.

Geek FAQ

Does Bob actually know what's happening to him?

No. Claude writes each episode fresh from the series bible and a story tracker JSON file. It has no persistent memory — just the text summary of what's happened so far. Every decision is made in a single API call.

How are new characters generated?

When Claude writes a new character into an episode, it includes a description field. The pipeline detects the missing image, starts ComfyUI, submits a txt2img workflow with ControlNet using Bob's real photo as a pose reference, and saves the output PNG for future episodes.

What stops people from voting multiple times?

Nothing — it's TikTok comments. Each comment counts as one vote regardless of who posted it. If someone really wants BIRDSVILLE to win, they can comment 100 times. That's kind of the point.

How long does a full episode take to render?

On the GTX 1060, approximately 25–40 minutes for a 13-scene episode. SadTalker is the slowest step at ~45 seconds per scene, followed by rembg background removal at ~60 seconds per scene running on CUDA.

Is this running in the cloud?

No. Everything runs on physical hardware in South Australia — an Unraid NAS for orchestration and a Linux box with a GTX 1060 for rendering. The only cloud calls are to the Anthropic API for writing and Microsoft Edge TTS for voices.

What happens if the render fails mid-episode?

The pipeline has a resume feature — it checks for existing scene_N.mp4 files and skips already-completed scenes. A failed run can be restarted and will pick up from where it left off.

Will Bob ever find out about the $5 million?

Yes — when he finds an ATM and checks his balance. Tap and go doesn't show the balance so it has to be a proper ATM. When that happens is entirely up to the audience votes.

The Pipeline

Hardware

Render Times Per Scene

The Full Stack

Geek FAQ