Skip to content

Instantly share code, notes, and snippets.

@zeroasterisk
Last active April 5, 2026 03:48
Show Gist options
  • Select an option

  • Save zeroasterisk/bda2bc9f52a19e3e956f2e82fa8254a5 to your computer and use it in GitHub Desktop.

Select an option

Save zeroasterisk/bda2bc9f52a19e3e956f2e82fa8254a5 to your computer and use it in GitHub Desktop.
Video Tour Skill for OpenClaw — Ken Burns narrated tours from screenshots + TTS

Animated Explainer (Component)

Generate short animated motion-graphic style videos from a text description.

Status: Research / Future

This is a future component inspired by Replit's animated video feature. The idea: describe a concept in natural language, get a polished animated explainer video.

Current Alternatives

Until we build or adopt a proper motion graphics pipeline:

  1. GIF from screenshots — stitch Playwright screenshots into a GIF (works now via cuj-screenshots)
  2. Asciinema cast — terminal recordings, editable as JSON (we've done this for ADK homepage)
  3. Mermaid diagrams — render architecture/flow diagrams as images
  4. Canvas HTML — render an HTML page with animations, screenshot/record it

Future Architecture

If we build this:

Text prompt → LLM generates React/HTML animation code → 
  Headless browser renders frames → ffmpeg stitches to MP4/GIF

Similar to Replit's approach: programmatic animations (not AI-generated video like Runway/Sora), built with web animation libraries, rendered headlessly.

When This Would Trigger

  • "Make an explainer video for X"
  • "Create an animated tour of this feature"
  • "Generate a demo video for the README"

Dependencies We'd Need

  • ffmpeg (for video encoding)
  • A motion graphics template library (Remotion, Motion Canvas, or custom)
  • Headless Chrome rendering (already have)
# /// script
# requires-python = ">=3.11"
# dependencies = ["playwright", "google-cloud-texttospeech"]
# ///
"""
Render a narrated tour video from a JSON spec.
Usage:
python render_tour.py '<json_data>' [output.mp4]
JSON spec:
{
"title": "Feature Name",
"bullets": ["Point 1", "Point 2"],
"metric": "42%",
"metricLabel": "improvement",
"narration": "Text to speak as voiceover.",
"voice": "en-US-Journey-D" // optional, defaults to en-US-Journey-D
}
Requires:
- Google Cloud TTS credentials (GOOGLE_APPLICATION_CREDENTIALS)
- ffmpeg + ffprobe on PATH (or set FFMPEG_PATH / FFPROBE_PATH)
- Playwright with Chromium installed
"""
import asyncio
import json
import os
import subprocess
import sys
from playwright.async_api import async_playwright
FFMPEG = os.environ.get("FFMPEG_PATH", "ffmpeg")
FFPROBE = os.environ.get("FFPROBE_PATH", "ffprobe")
TEMPLATE = """
<!DOCTYPE html>
<html>
<head>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
width: 1280px; height: 720px;
background: linear-gradient(135deg, #0a0a0a 0%, #1a1a2e 50%, #16213e 100%);
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif;
color: #fff;
overflow: hidden;
display: flex;
align-items: center;
}
.container { padding: 60px 80px; width: 100%; }
.badge {
display: inline-block;
background: #22c55e;
color: #000;
padding: 6px 16px;
border-radius: 20px;
font-size: 14px;
font-weight: 700;
text-transform: uppercase;
letter-spacing: 1px;
margin-bottom: 24px;
}
h1 {
font-size: 52px;
font-weight: 800;
line-height: 1.1;
margin-bottom: 40px;
}
.bullet {
font-size: 26px;
color: #b0b0b0;
margin-bottom: 18px;
padding-left: 24px;
position: relative;
}
.bullet::before {
content: '→';
position: absolute;
left: 0;
color: #22c55e;
}
.metric {
position: absolute;
bottom: 60px;
right: 80px;
text-align: right;
}
.metric .value {
font-size: 56px;
font-weight: 800;
color: #22c55e;
}
.metric .label {
font-size: 16px;
color: #888;
text-transform: uppercase;
letter-spacing: 2px;
}
</style>
</head>
<body>
<div class="container">
<div class="badge" id="badge">COMPLETED</div>
<h1 id="title"></h1>
<div id="bullets"></div>
<div class="metric" id="metric">
<div class="value" id="metricValue"></div>
<div class="label" id="metricLabel"></div>
</div>
</div>
<script>
const data = __DATA__;
document.getElementById('title').textContent = data.title;
document.getElementById('metricValue').textContent = data.metric || '';
document.getElementById('metricLabel').textContent = data.metricLabel || '';
const bulletsEl = document.getElementById('bullets');
data.bullets.forEach(b => {
const div = document.createElement('div');
div.className = 'bullet';
div.textContent = b;
bulletsEl.appendChild(div);
});
</script>
</body>
</html>
"""
def generate_narration(text: str, output_path: str, voice_name: str = "en-US-Journey-D") -> float:
"""Generate TTS narration via Google Cloud TTS. Returns duration in seconds."""
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
synthesis_input = texttospeech.SynthesisInput(text=text)
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name=voice_name,
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3,
speaking_rate=1.05,
)
response = client.synthesize_speech(
input=synthesis_input, voice=voice, audio_config=audio_config
)
with open(output_path, "wb") as f:
f.write(response.audio_content)
result = subprocess.run(
[FFPROBE, "-v", "quiet", "-show_entries", "format=duration", "-of", "csv=p=0", output_path],
capture_output=True, text=True,
)
return float(result.stdout.strip())
async def render_frames(data: dict, duration_secs: float, fps: int = 30) -> str:
"""Render animated HTML frames via Playwright. Returns frames directory path."""
total_frames = int(fps * duration_secs)
frames_dir = "/tmp/tour-frames"
os.makedirs(frames_dir, exist_ok=True)
for f in os.listdir(frames_dir):
os.remove(os.path.join(frames_dir, f))
html = TEMPLATE.replace("__DATA__", json.dumps(data))
html_path = "/tmp/tour-template.html"
with open(html_path, "w") as f:
f.write(html)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page(viewport={"width": 1280, "height": 720})
await page.goto(f"file://{html_path}")
await page.wait_for_timeout(200)
num_bullets = len(data.get("bullets", []))
for frame_num in range(total_frames):
t = frame_num / total_frames
badge_opacity = min(1, max(0, (t - 0.0) * 10))
title_opacity = min(1, max(0, (t - 0.05) * 8))
title_offset = max(0, 20 * (1 - min(1, (t - 0.05) * 8)))
metric_opacity = min(1, max(0, (t - 0.7) * 5))
bullet_js = ""
for i in range(num_bullets):
bt_start = 0.15 + (i * 0.5 / max(1, num_bullets))
bt = max(0, (t - bt_start) * 8)
bullet_js += f"""
bullets[{i}].style.opacity = Math.min(1, {bt});
bullets[{i}].style.transform = 'translateX(' + Math.max(0, 15 * (1 - Math.min(1, {bt}))) + 'px)';
"""
await page.evaluate(f"""
document.getElementById('badge').style.opacity = {badge_opacity};
document.getElementById('title').style.opacity = {title_opacity};
document.getElementById('title').style.transform = 'translateY({title_offset}px)';
document.getElementById('metric').style.opacity = {metric_opacity};
const bullets = document.querySelectorAll('.bullet');
{bullet_js}
""")
await page.screenshot(path=f"{frames_dir}/frame_{frame_num:04d}.png")
await browser.close()
return frames_dir
def stitch_video(frames_dir: str, audio_path: str | None, output_path: str, fps: int = 30):
"""Stitch frames into video, optionally with audio."""
if audio_path and os.path.exists(audio_path):
cmd = [
FFMPEG, "-y",
"-framerate", str(fps),
"-i", f"{frames_dir}/frame_%04d.png",
"-i", audio_path,
"-c:v", "libx264", "-pix_fmt", "yuv420p",
"-preset", "fast", "-crf", "23",
"-c:a", "aac", "-b:a", "128k",
"-shortest",
output_path,
]
else:
cmd = [
FFMPEG, "-y",
"-framerate", str(fps),
"-i", f"{frames_dir}/frame_%04d.png",
"-c:v", "libx264", "-pix_fmt", "yuv420p",
"-preset", "fast", "-crf", "23",
output_path,
]
subprocess.run(cmd, check=True, capture_output=True)
size = os.path.getsize(output_path)
print(f"Output: {output_path} ({size / 1024:.0f}KB)")
async def main():
data = json.loads(sys.argv[1]) if len(sys.argv) > 1 else {
"title": "Example Tour",
"bullets": [
"First key point",
"Second key point",
"Third key point",
],
"metric": "100%",
"metricLabel": "complete",
"narration": "This is an example tour video. It demonstrates animated title cards with bullet points and a highlighted metric.",
}
output = sys.argv[2] if len(sys.argv) > 2 else "/tmp/tour-output.mp4"
audio_path = None
duration = 5.0
narration = data.get("narration")
if narration:
print("Generating narration...")
audio_path = "/tmp/tour-narration.mp3"
duration = generate_narration(narration, audio_path,
voice_name=data.get("voice", "en-US-Journey-D"))
print(f"Narration: {duration:.1f}s")
duration += 1.0
print(f"Rendering {int(duration * 30)} frames...")
frames_dir = await render_frames(data, duration)
print("Stitching video...")
stitch_video(frames_dir, audio_path, output)
print("Done!")
if __name__ == "__main__":
asyncio.run(main())
name tour
description Generate narrated video tours of completed work — Ken Burns over real captures, TTS narration, posted to Discord.

Tour Skill

Show what was built. Don't just describe it.


Pipeline: How a Tour Video Gets Made

1. Write script (terse narration per segment)
2. Capture visuals (Playwright screenshots + live video)
3. Generate TTS audio (Google Cloud TTS)
4. Render segments (ffmpeg — crop/zoompan over images, mux audio)
5. QA each segment (extract frames, verify readability)
6. Stitch all segments (ffmpeg concat)
7. QA the stitched output (verify transitions, no silence gaps)
8. Compress for Discord (<8MB)
9. Post

Every step has a QA gate. Never skip to step 9.


Cheat Sheet: What Tool For What

Visual type Capture method ffmpeg filter Notes
Tall page (GitHub README, PR, docs) Playwright full-page screenshot (1920px wide) crop=1920:1080:0:'min(t*(ih-1080)/DUR,ih-1080)' ⚠️ NEVER use zoompan s=WxH — it SHRINKS
Normal screenshot (1920x1080 or smaller) Playwright viewport screenshot scale=3840:-1,zoompan=z='min(zoom+0.0015,1.12)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=FRAMES:s=1920x1080:fps=30 Upscale first for smooth sub-pixel
Non-16:9 screenshot (e.g. Discord 1520x1458) Playwright screenshot Pad to 16:9 first: pad=W:H:(ow-iw)/2:(oh-ih)/2:black, then zoompan Don't let zoompan distort aspect ratio
Live UI demo (Dojo, openclaw-live) Playwright recordVideo Use directly, add 0.3-0.5s fade crossfades in/out Don't add zoom on top of video
Terminal output Run command, render as HTML, screenshot zoompan center zoom or pan-down Syntax-highlight if possible
Architecture diagram Render as HTML, screenshot at 1920x1080 zoompan center zoom 1→1.08 Dark theme, clean labels

Ken Burns Standards

  • Rate: zoom+0.0015 per frame at 30fps → ~4.5%/sec
  • Cap: min(zoom+0.0015, 1.15) for segments ≤15s
  • Vary direction: alternate zoom-in, pan-down, pan-right, zoom-out across segments
  • Source resolution: upscale to 3840+ wide before zoompan
  • Frame rate: 30fps minimum
  • 3-second rule: No shot sits still >3s without a pan, zoom, cut, or new element appearing
  • Vary shots per topic: Don't show one static view. For a GitHub repo: PR title → file tree → README scroll. For a UI: overview → detail → interaction. Minimum 2 distinct views per segment.

The #1 Gotcha: Tall Image Squish

This caused 80% of our bugs. If an image is taller than 1080px:

# ✅ CORRECT: crop+pan (scrolls a 1080px window down the page)
ffmpeg -loop 1 -i tall.png -vf "crop=1920:1080:0:'min(t*(ih-1080)/DUR,ih-1080)'" -t DUR -c:v libx264 -pix_fmt yuv420p -r 30 out.mp4

# ❌ WRONG: zoompan with s= (shrinks entire page into one frame)
ffmpeg -loop 1 -i tall.png -vf "zoompan=z=1.001:s=1920x1080:d=300:fps=30" -t 10 out.mp4

Rule: Check image height BEFORE choosing filter. If height > 1080, use crop.


Audio

  • Voice: en-US-Studio-O (supports full SSML — IPA phonemes, <say-as>, breaks)
  • Credentials: . ~/.openclaw/credentials/secrets.sh
  • Pronunciation: Use SSML for technical terms: <say-as interpret-as="characters">A2UI</say-as>
  • Duration drives video: Calculate audio duration first, match video segment to it
  • Never use -shortest: It truncates audio mid-word
  • Check for silence: ffmpeg -af silencedetect=noise=-30dB:d=0.5 -f null - — trim dead air at segment ends

Stitching

# Normalize ALL chapters to same format before concat
for f in *.mp4; do
  ffmpeg -y -i "$f" -c:v libx264 -c:a aac -ar 44100 -ac 2 -r 30 -s 1920x1080 -pix_fmt yuv420p "norm_$f"
done

# Handle non-1080p sources (e.g. 1280x720)
ffmpeg -y -i source.mp4 -vf "scale=1920:1080:force_original_aspect_ratio=decrease,pad=1920:1080:(ow-iw)/2:(oh-ih)/2:black" ...

# Concat
ffmpeg -f concat -safe 0 -i list.txt -c copy output.mp4

# Compress for Discord (<8MB)
ffmpeg -i full.mp4 -c:v libx264 -crf 35 -preset fast -c:a aac -b:a 64k -s 960x540 discord.mp4

QA Checklist (MANDATORY before posting)

Per-segment QA

  • Extract frame at 2-3 timestamps: ffmpeg -ss T -i seg.mp4 -frames:v 1 check.png
  • Text readable? (not squished, not tiny)
  • Correct aspect ratio? (no stretching)
  • No error pages? (GitHub rate limit, 404)
  • No black frames? (unless intentional fade)
  • Audio plays to natural end? (no mid-word cutoff)

Stitched tour QA

  • Check every chapter boundary (extract frame at each transition timestamp)
  • No gray rectangles or compositing artifacts? (resolution mismatch = artifact)
  • No dead silence >0.5s between chapters?
  • Total duration matches expected sum?
  • Watch at least the problem areas at 960x540 (Discord resolution)

Before posting to Alan

  • I verified this myself. Sub-agent reports are insufficient.
  • I am NOT using Alan as QA.

Tone

Terse, fast, information-dense.

  • One sentence per idea. Move on.
  • Stats are content: "938 tests, 0 failures"
  • Cut any sentence that doesn't add new information
  • Never repeat the same point in different words
  • ~2/3 architecture + implementation, ~1/3 capabilities/evidence
  • Err on the side of too short

Content Philosophy

Show the thing working, not the PR about the thing. Developers want to see what they'd experience — terminal commands running, UI rendering, output appearing. They don't want a narrated summary of a GitHub diff.

Cut aggressively. If a segment doesn't teach the viewer something they'd need, kill it. Two great segments beat six mediocre ones.

Never compare old vs new. Just show the new thing working. The viewer doesn't have context on what was broken before.


Animated Terminals (preferred over Ken Burns on terminal screenshots)

Static terminal screenshots + Ken Burns = text gets cropped, zoomed off-screen, or too small. Instead:

  1. Build an HTML page with CSS animation-delay on each line (simulates typing)
  2. Record with Playwright recordVideo (not screenshot)
  3. Use the recorded .webm directly as a video segment
<div class="line" style="opacity:0; animation: fadeIn 0.1s forwards; animation-delay: 2.3s;">
  <span class="prompt"></span><span class="cmd">uv run .</span>
</div>
const ctx = await browser.newContext({
  viewport: { width: 1920, height: 1080 },
  recordVideo: { dir: '/tmp/out/', size: { width: 1920, height: 1080 } }
});
const p = await ctx.newPage();
await p.goto('file:///tmp/animated_term.html');
await new Promise(r => setTimeout(r, 14000)); // wait for animation
await ctx.close();

Rule: Ken Burns is for visual content (UI mockups, diagrams, photos). Never for text-heavy images.


HTML Mockups for UI Demos

Render mockup UIs as HTML instead of screenshotting live apps:

  • Faster to iterate, always pixel-perfect, loads instantly
  • No need to boot a full client app
  • Can show JSON → rendered UI side-by-side in one page
  • Great for "money shot" frames that communicate a protocol's value prop

GitHub Page Captures

Use domcontentloaded instead of networkidle for GitHub URLs. GitHub's background analytics never reach idle — you'll hit timeouts every time.

await p.goto(githubUrl, { waitUntil: 'domcontentloaded', timeout: 60000 });
await new Promise(r => setTimeout(r, 5000)); // extra wait for rendering

Capture Tooling

Playwright (Node.js — for screenshots and video)

cd /tmp && npm init -y && npm install playwright && npx playwright install chromium --with-deps
const { chromium } = require('playwright');
const b = await chromium.launch({ headless: true, args: ['--no-sandbox'] });
const p = await b.newPage();
await p.setViewportSize({ width: 1920, height: 1080 });
await p.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
await new Promise(r => setTimeout(r, 3000)); // extra wait
await p.screenshot({ path: 'out.png', fullPage: true }); // tall page
await p.screenshot({ path: 'out.png' }); // viewport only

GitHub rate limiting: If you get a "too many requests" page, wait 30s and retry (up to 3 times). Always verify the screenshot shows real content.

Google Cloud TTS

from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
input = texttospeech.SynthesisInput(ssml=ssml_text)
voice = texttospeech.VoiceSelectionParams(language_code="en-US", name="en-US-Studio-O")
config = texttospeech.AudioConfig(audio_encoding=texttospeech.AudioEncoding.MP3)
response = client.synthesize_speech(input=input, voice=voice, audio_config=config)

ffmpeg

Binary: ~/.openclaw/bin/ffmpeg and ~/.openclaw/bin/ffprobe


Simple Tour (Discord message, no video)

For quick tours that don't need narration:

## ✅ [Feature Name]

[One sentence: what + why]

**What changed:**
- Point 1
- Point 2
- Point 3

**Try it:** [URL or command]

[screenshot.png attached]

Post to the project's Discord channel. #zaf for internal/meta work.


Future: Motion Canvas / Manim Integration

Alan's feedback (Mar 17): "I'm surprised the images and screenshots and websites aren't imported into the motion suite of tools."

Current pipeline is pure ffmpeg — functional but limited. The upgrade path:

  • Motion Canvas (TypeScript): Programmatic animations, can import images/screenshots as scene elements with transitions, text overlays, layout control. Best for polished explainer-style videos.
  • Manim (Python): Mathematical/technical animations. Good for architecture diagrams, data flow visualizations.

When to upgrade: When we need text overlays on screenshots, split-screen layouts, animated callouts/highlights, or smooth scene transitions beyond crossfades. The current ffmpeg pipeline can't do these well.

What this would replace: The zoompan/crop Ken Burns approach for screenshots. Images would become scene elements in a Motion Canvas project, with proper entrance animations, annotations, and transitions.

Status: Not yet implemented. Current ffmpeg pipeline works for simple tours. Evaluate Motion Canvas for the next major video project.


Workspace & Assets

Tour scripts, captures, and rendered segments should live in a dedicated directory:

  • Working dir: /tmp/tour-<name>/ (ephemeral, per-tour)
  • Scripts: memory/tour-script/ (persist narration scripts for reference)
  • Future: Consider a dedicated repo (zeroasterisk/zaf-videos) for durability — Alan suggested this Mar 15.

Cost

< $0.01 per tour video. TTS ≈ $0.005 per segment. ffmpeg is free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment