// ROADMAP v1.0

The Future of Bob

Bob is just getting started. Here's where we're hoping to take him — better animation, faster production, more interactivity, and eventually a proper studio setup running in a shed in South Australia.

Now

Where We Are

The pipeline is running. Episodes are being generated end-to-end with no human involvement after the vote is counted. It works — but it's rough around the edges.

  • SadTalker lip sync — functional but limited to portrait talking heads
  • rembg background removal — frame by frame, CPU-heavy
  • DreamShaper characters — consistent but cartoon quality
  • Edge TTS voices — good Australian accents, slightly robotic
  • ~30 minute render time per episode on GTX 1060
Soon

Animation Upgrade

SadTalker is good for what it is, but it only animates the face. The next step is full-body animation — characters that move, gesture, and react physically to the dialogue.

  • Replace SadTalker with HeyGen or similar photorealistic talking head API
  • Full-body character animation using pose estimation and motion transfer
  • Animated backgrounds — subtle parallax, weather, time of day changes
  • Better lip sync accuracy using wav2lip or similar dedicated model
  • Scene transitions between dialogue lines
Soon

Sound Design

Currently Bob's world is mostly silent except for voices. Real storytelling needs ambient sound — the creak of a pub, the wind across the Birdsville Track, the clunk of a blown tyre.

  • AI-generated ambient soundscapes per scene (outback wind, pub noise, car interior)
  • Sound effects triggered by script keywords
  • Improved voice synthesis — ElevenLabs or Cartesia for more natural delivery
  • Dynamic music scoring — different themes per location and emotional tone
Later

Production Pipeline

Right now each episode takes 30-45 minutes to render sequentially. With better hardware and parallelisation, that drops to under 10 minutes — meaning same-hour episode release after voting closes.

  • A GPU-populated server for serious parallel rendering
  • Parallel scene rendering — multiple scenes processing simultaneously
  • Episode archive page on this website with full back catalogue
  • Automated episode summary posted to website after each render
Dream

Bob's World

The long-term vision is a fully interactive AI story universe. Bob is just the start.

  • Multi-camera angles per scene — cutaways, reaction shots, wide establishing shots
  • 3D environments — Bob's world rendered in real-time 3D with consistent locations
Soon

Growing the Audience

Bob's existence depends on people watching. We're building in mechanics that make that explicit and turn it into part of the story.

  • "Bob Knows He Might Die" — a short where Bob breaks the fourth wall about his AI existence and asks viewers to follow to keep him alive
  • End card CTA — "Follow or Bob dies" alongside the vote options
  • Behind the scenes shorts — showing the pipeline rendering, the GPU working, the AI writing
  • Bob reacts to real TikTok comments in standalone shorts
  • Bob's survival tied to follower count in the narrative — the aliens return and reveal the nuroliser only stays stable while enough humans are watching
  • A dedicated follow-drive pinned video at the top of the profile

The Hardware Problem

Every AI task in the pipeline — lip sync, background removal, image generation, voice synthesis — runs on a single NVIDIA GTX 1060 6GB from 2016. It's a remarkable machine that punches well above its weight, but it's showing its limits.

The GTX 1060 has 1280 CUDA cores and 6GB of VRAM. Newer cards have tensor cores specifically designed for the matrix operations that power these AI models — meaning the same task runs 5-10x faster on modern hardware.

Current GPU
GTX 1060 6GB
Target: RTX 3060 12GB
CUDA Cores
1,280
3,584
VRAM
6 GB
12 GB
Render Time
~35 min/ep
~8 min/ep

Animation: Now vs Future

Current Pipeline
Static character portrait images
SadTalker face-only animation
Plain colour scene backgrounds
No body movement or gestures
Dialogue subtitles only
No ambient sound
Edge TTS — good but synthetic
Target Pipeline
Full-body animated characters
Photorealistic lip sync
Animated scene environments
Gesture and expression matching
Contextual sound effects
Ambient soundscapes per location
Natural voice synthesis