AudioStory: Generating Long-Form Narrative Audio with Large Language Models

Yuxin Guo^1,2, Teng Wang^2,✉, Yuying Ge², Shijie Ma^1,2, Yixiao Ge², Wei Zou¹, Ying Shan²
¹Institute of Automation, CAS ²ARC Lab, Tencent PCG

✨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.

🎥 Watch Full Demo on YouTube

📖 Release

[2025/09/02] 🔥🔥 Text-to-long audio checkpoint released!
[2025/08/28] 🔥🔥 We release the inference code!
[2025/08/28] 🔥🔥 We release our demo videos!

🔎 Introduction

Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:

Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components—a bridging query for intra-event semantic alignment and a consistency query for cross-event coherence preservation.
End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives.

Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity.

⭐ Demos

1. Video Dubbing (Tom & Jerry style)

Dubbing is achieved using AudioStory (trained on Tom & Jerry) with visual captions extracted from videos.

159.mp4

226.mp4

231.mp4

2. Cross-domain Video Dubbing (Tom & Jerry style)

cute_dogs.mp4	fight_cat_and_dog.mp4	cute_cats.mp4
snoopy.mp4	Donald_Duck2.mp4	nezha2.mp4
We_Bare_Bears.mp4	Donald_Duck.mp4	sora1.mp4
sora2.mp4	mifei.mp4	nezha1.mp4

3. Text-to-Long Audio (Natural sound)

Instruction: "Develop a comprehensive audio that fully represents jake shimabukuro performs a complex ukulele piece in a studio, receives applause, and discusses his career in an interview. The total duration is 49.9 seconds."	video_demos_ttla_1.mp4
Instruction: "Develop a comprehensive audio that fully represents a fire truck leaves the station with sirens blaring, signaling an emergency response, and drives away. The total duration is 35.1 seconds."	video_demos_ttla_2.mp4
Instruction: "Understand the input audio, infer the subsequent events, and generate the continued audio of the coach giving basketball lessons to the players. The total duration is 36.6 seconds."	video_demos_ttla_3.mp4

🔎 Methods

To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end, AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs interleaved reasoning generation, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation.

🔩 Installation

Dependencies

Python >= 3.10 (Recommend to use Anaconda)
PyTorch >=2.1.0
NVIDIA GPU + CUDA

Installation

git clone https://github.com/TencentARC/AudioStory.git
cd AudioStory
conda create -n audiostory python=3.10 -y
conda activate audiostory
bash install_audiostory.sh

📊 Evaluation

Download model checkpoint from Huggingface Models.

Inference

python evaluate/inference.py \
    --model_path ckpt/audiostory-3B \
    --guidance 4.0 \
    --save_folder_name audiostory \
    --total_duration 50

🔋 Acknowledgement

When building the codebase of continuous denosiers, we refer to SEED-X and TangoFlux. Thanks for their wonderful projects.

📆 TO DO

Release our gradio demo.
💾 Release AudioStory model checkpoints
Release AudioStory-10k dataset.
Release training codes of all three stages.

📜 License

This repository is under the Apache 2 License.

📚 BibTeX

@misc{guo2025audiostory,
      title={AudioStory: Generating Long-Form Narrative Audio with Large Language Models}, 
      author={Yuxin Guo and Teng Wang and Yuying Ge and Shijie Ma and Yixiao Ge and Wei Zou and Ying Shan},
      year={2025},
      eprint={2508.20088},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.20088}, 
}

📧 Contact

If you have further questions, feel free to contact me: guoyuxin2021@ia.ac.cn

Discussions and potential collaborations are also welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
configs		configs
demos		demos
envs/peft		envs/peft
evaluate		evaluate
src		src
tokenizer		tokenizer
.gitignore		.gitignore
README.md		README.md
audiostory.png		audiostory.png
audiostory_framework.png		audiostory_framework.png
install_audiostory.sh		install_audiostory.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

📖 Release

🔎 Introduction

⭐ Demos

1. Video Dubbing (Tom & Jerry style)

2. Cross-domain Video Dubbing (Tom & Jerry style)

3. Text-to-Long Audio (Natural sound)

🔎 Methods

🔩 Installation

Dependencies

Installation

📊 Evaluation

Inference

🔋 Acknowledgement

📆 TO DO

📜 License

📚 BibTeX

📧 Contact

About

Uh oh!

Packages

Contributors 2

Languages

Uh oh!

Uh oh!

TencentARC/AudioStory

Folders and files

Latest commit

History

Repository files navigation

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

📖 Release

🔎 Introduction

⭐ Demos

1. Video Dubbing (Tom & Jerry style)

2. Cross-domain Video Dubbing (Tom & Jerry style)

3. Text-to-Long Audio (Natural sound)

🔎 Methods

🔩 Installation

Dependencies

Installation

📊 Evaluation

Inference

🔋 Acknowledgement

📆 TO DO

📜 License

📚 BibTeX

📧 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Contributors 2

Languages

Packages