🤖✨ Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics
Junjie Chen1,2 ·
Fei Wang1,2 ·
Zhihao Huang5,6 ·
Qing Zhou8 ·
Kun Li7
Dan Guo1 ·
Linfeng Zhang4 ·
Xun Yang3
1 Hefei University of Technology ·
2 IAI, Hefei Comprehensive National Science Center
3 USTC ·
4 SJTU ·
5 TeleAI, China Telecom ·
6 Northwestern Polytechnical University
7 United Arab Emirates University ·
8 Anhui Polytechnic University
- Core source code release
- Pretrained checkpoints (CKPT)
- Usage documentation and tutorials
- Rendering tools
- ...
- 🧠 Causal turn-level formulation for streaming conversational generation
- 🔄 Unified talking & listening modeling within a single framework
- 🎧🗣️ Interleaved multimodal tokens from both interlocutors
- 🌊 Diffusion-based 3D head decoding for expressive and stochastic motion
- 📉 15–30% error reduction over strong baselines (e.g., DualTalk)
Human conversation is a continuous exchange of speech and nonverbal cues—including head nods, gaze shifts, and subtle expressions.
Most existing approaches, however, treat talking-head and listening-head generation as separate problems, or rely on non-causal full-sequence modeling that is unsuitable for real-time interaction.
We propose a causal, turn-level framework for interactive 3D conversational head generation.
Our method models dialogue as a sequence of causally linked turns, where each turn accumulates multimodal context from both participants to produce coherent, responsive, and humanlike 3D head dynamics.
TIMAR (Turn-level Interleaved Masked AutoRegression) is the core method proposed in this work.
- Represent conversation as interleaved audio–visual tokens:
- 👤 User speech + user head motion
- 🤖 Agent speech + agent head motion
- Perform:
- 🔁 Bidirectional fusion within each turn (intra-turn alignment)
- ⏱️ Strictly causal reasoning across turns (inter-turn dependency)
This design mirrors how humans coordinate speaking and listening over time.
Core components:
- 🧠 Turn-Level Causal Attention (TLCA)
- Bidirectional attention inside a turn
- Causal masking across turns (no future leakage)
- 🌊 Lightweight Diffusion Head
- Predicts continuous 3D head motion
- Captures expressive stochasticity beyond deterministic regression
We evaluate our framework on the interactive 3D conversational head benchmark, following the DualTalk protocol.
Results at a glance:
- ⬇️ 15–30% reduction in Frechet Distance (FD) and MSE
- 📈 Improved expressiveness and synchronization (SID ↑)
- 🌍 Strong generalization on out-of-distribution conversations
Click to see the results
| Demo | Preview |
|---|---|
| Demo 1 | demo_1.mp4 |
| Demo 2 | demo_2.mp4 |
| Demo 3 | demo_3.mp4 |
Notation
- Agent GT denotes the ground-truth 3D head motion.
- TIMAR Agent denotes our generated results.
- DualTalk Agent denotes the outputs from the DualTalk baseline.
TIMAR produces:
- Natural listening behavior when the agent is silent
- Context-aware reactions with longer conversational history
- Smoother and more stable 3D head motion
We analyze the contribution of each design choice:
- ❌ MLP head vs 🌊 diffusion-based head
- ❌ Full bidirectional attention vs ✅ turn-level causal attention
- ❌ Encoder–decoder vs ✅ encoder-only backbone
Each component is critical for causal coherence and generalization.
If you find this work useful, please consider citing:
@article{chen2025timar,
title={Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics},
author={Chen, Junjie and Wang, Fei and Hunag, Zhihao and Zhou, Qing and Li, Kun and Guo, Dan and Zhang, Linfeng and Yang, Xun},
journal={arXiv preprint arXiv:2512.15340},
year={2025}
}Our implementation benefits from the publicly available codebases of MAR and DualTalk. We thank the authors for releasing their repositories, which provided valuable references for implementation details and experimental protocols.
We further acknowledge the broader research community for foundational advances in autoregressive diffusion modeling, multimodal representation learning, and interactive 3D conversational head generation. While our framework introduces a distinct turn-level causal formulation and interleaved modeling strategy, it builds upon and extends these prior technical developments.



