Yuxin Guo1,2,
Teng Wang2,β,
Yuying Ge2,
Shijie Ma1,2,
Yixiao Ge2,
Wei Zou1,
Ying Shan2
1Institute of Automation, CAS
2ARC Lab, Tencent PCG
β¨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understandingβgeneration framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.
[2025/09/02] π₯π₯ Text-to-long audio checkpoint released!
[2025/08/28] π₯π₯ We release the inference code!
[2025/08/28] π₯π₯ We release our demo videos!
Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:
- Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized componentsβa bridging query for intra-event semantic alignment and a consistency query for cross-event coherence preservation.
- End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives.
Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity.
Dubbing is achieved using AudioStory (trained on Tom & Jerry) with visual captions extracted from videos.
159.mp4 |
226.mp4 |
231.mp4 |
cute_dogs.mp4 |
fight_cat_and_dog.mp4 |
cute_cats.mp4 |
snoopy.mp4 |
Donald_Duck2.mp4 |
nezha2.mp4 |
We_Bare_Bears.mp4 |
Donald_Duck.mp4 |
sora1.mp4 |
sora2.mp4 |
mifei.mp4 |
nezha1.mp4 |
| Instruction: "Develop a comprehensive audio that fully represents jake shimabukuro performs a complex ukulele piece in a studio, receives applause, and discusses his career in an interview. The total duration is 49.9 seconds." | video_demos_ttla_1.mp4 |
| Instruction: "Develop a comprehensive audio that fully represents a fire truck leaves the station with sirens blaring, signaling an emergency response, and drives away. The total duration is 35.1 seconds." | video_demos_ttla_2.mp4 |
| Instruction: "Understand the input audio, infer the subsequent events, and generate the continued audio of the coach giving basketball lessons to the players. The total duration is 36.6 seconds." | video_demos_ttla_3.mp4 |
To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end, AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs interleaved reasoning generation, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation.
- Python >= 3.10 (Recommend to use Anaconda)
- PyTorch >=2.1.0
- NVIDIA GPU + CUDA
git clone https://github.com/TencentARC/AudioStory.git
cd AudioStory
conda create -n audiostory python=3.10 -y
conda activate audiostory
bash install_audiostory.sh
Download model checkpoint from Huggingface Models.
python evaluate/inference.py \
--model_path ckpt/audiostory-3B \
--guidance 4.0 \
--save_folder_name audiostory \
--total_duration 50When building the codebase of continuous denosiers, we refer to SEED-X and TangoFlux. Thanks for their wonderful projects.
- Release our gradio demo.
- πΎ Release AudioStory model checkpoints
- Release AudioStory-10k dataset.
- Release training codes of all three stages.
This repository is under the Apache 2 License.
@misc{guo2025audiostory,
title={AudioStory: Generating Long-Form Narrative Audio with Large Language Models},
author={Yuxin Guo and Teng Wang and Yuying Ge and Shijie Ma and Yixiao Ge and Wei Zou and Ying Shan},
year={2025},
eprint={2508.20088},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.20088},
}
If you have further questions, feel free to contact me: guoyuxin2021@ia.ac.cn
Discussions and potential collaborations are also welcome.

