Why
ROADMAP Phase 4 lists "compose-scenes-with-skills narration awareness — pass word-level transcript so Claude can word-sync animations to audio" as an open item.
The interesting part: the data already exists but is dropped on the floor when building the LLM prompt.
State of code (2026-04-29)
Word-level transcript is generated upstream in vibe scene build:
packages/cli/src/commands/scene.ts:854-878 — runs Whisper with granularity: "word", writes assets/transcript-<id>.json, and assembles transcriptWords: { text, start, end }[]
scene.ts:988 — these words flow into the beat result for runtime use (Hyperframes __hf.media consumers)
But the per-beat compose prompt doesn't see them:
packages/cli/src/commands/_shared/compose-prompts.ts:218 — instructions reference cues.narration (raw text) only
packages/cli/src/commands/_shared/compose-scenes-skills.ts:178-179 — cue-rendering only emits the narration string, not timings
So the LLM composing scene HTML can't author word-synced animations because it doesn't know when each word is spoken.
Scope
Reference
- ROADMAP.md Phase 4 "Open items in Phase 4 (v0.61+ candidates)"
- Word-sync comment already in code:
scene.ts:850 "GSAP word-sync from it. Failure is non-fatal — narration still plays..."
Why
ROADMAP Phase 4 lists "
compose-scenes-with-skillsnarration awareness — pass word-level transcript so Claude can word-sync animations to audio" as an open item.The interesting part: the data already exists but is dropped on the floor when building the LLM prompt.
State of code (2026-04-29)
Word-level transcript is generated upstream in
vibe scene build:packages/cli/src/commands/scene.ts:854-878— runs Whisper withgranularity: "word", writesassets/transcript-<id>.json, and assemblestranscriptWords: { text, start, end }[]scene.ts:988— these words flow into the beat result for runtime use (Hyperframes__hf.mediaconsumers)But the per-beat compose prompt doesn't see them:
packages/cli/src/commands/_shared/compose-prompts.ts:218— instructions referencecues.narration(raw text) onlypackages/cli/src/commands/_shared/compose-scenes-skills.ts:178-179— cue-rendering only emits the narration string, not timingsSo the LLM composing scene HTML can't author word-synced animations because it doesn't know when each word is spoken.
Scope
transcriptWordsthrough tocomposeScenesWithSkills()per beatcompose-scenes-skills.ts:178-193cue rendering, whentranscriptWordsexists, emit a structured block (probably YAML or JSON inline) listing{ text, start, end }per wordcompose-prompts.tsinstructions (line 218 area) to mention word-level timings as available — and what the LLM is expected to do with them (data-attributes on spans for GSAP timing? CSS keyframes? leave that to the Hyperframes skill)transcriptWords.lengthand either truncate or skip if over a thresholdcompose-prompts.test.ts/compose-scenes-skills.test.tscovering: no transcript, short transcript, oversized transcriptReference
scene.ts:850"GSAP word-sync from it. Failure is non-fatal — narration still plays..."