🤖✨ Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Junjie Chen^1,2 · Fei Wang^1,2 · Zhihao Huang^5,6 · Qing Zhou⁸ · Kun Li⁷
Dan Guo¹ · Linfeng Zhang⁴ · Xun Yang³

¹ Hefei University of Technology   ·   ² IAI, Hefei Comprehensive National Science Center
³ USTC   ·   ⁴ SJTU   ·   ⁵ TeleAI, China Telecom   ·   ⁶ Northwestern Polytechnical University
⁷ United Arab Emirates University   ·   ⁸ Anhui Polytechnic University

📌 Open-Source Roadmap

🔥 Highlights

🧠 Causal turn-level formulation for streaming conversational generation
🔄 Unified talking & listening modeling within a single framework
🎧🗣️ Interleaved multimodal tokens from both interlocutors
🌊 Diffusion-based 3D head decoding for expressive and stochastic motion
📉 15–30% error reduction over strong baselines (e.g., DualTalk)

🚀 Overview

Human conversation is a continuous exchange of speech and nonverbal cues—including head nods, gaze shifts, and subtle expressions.
Most existing approaches, however, treat talking-head and listening-head generation as separate problems, or rely on non-causal full-sequence modeling that is unsuitable for real-time interaction.

We propose a causal, turn-level framework for interactive 3D conversational head generation.
Our method models dialogue as a sequence of causally linked turns, where each turn accumulates multimodal context from both participants to produce coherent, responsive, and humanlike 3D head dynamics.

🧩 Method: TIMAR

TIMAR (Turn-level Interleaved Masked AutoRegression) is the core method proposed in this work.

🧱 Key Idea

Represent conversation as interleaved audio–visual tokens:
- 👤 User speech + user head motion
- 🤖 Agent speech + agent head motion
Perform:
- 🔁 Bidirectional fusion within each turn (intra-turn alignment)
- ⏱️ Strictly causal reasoning across turns (inter-turn dependency)

This design mirrors how humans coordinate speaking and listening over time.

⚙️ Architecture

Core components:

🧠 Turn-Level Causal Attention (TLCA)
- Bidirectional attention inside a turn
- Causal masking across turns (no future leakage)
🌊 Lightweight Diffusion Head
- Predicts continuous 3D head motion
- Captures expressive stochasticity beyond deterministic regression

🧪 Experiments

We evaluate our framework on the interactive 3D conversational head benchmark, following the DualTalk protocol.

📊 Quantitative Results

Click to see the results

Results at a glance:

⬇️ 15–30% reduction in Frechet Distance (FD) and MSE
📈 Improved expressiveness and synchronization (SID ↑)
🌍 Strong generalization on out-of-distribution conversations

🎭 Qualitative Results

Click to see the results

Demo	Preview
Demo 1	demo_1.mp4
Demo 2	demo_2.mp4
Demo 3	demo_3.mp4

Notation

Agent GT denotes the ground-truth 3D head motion.

TIMAR Agent denotes our generated results.

DualTalk Agent denotes the outputs from the DualTalk baseline.

TIMAR produces:

Natural listening behavior when the agent is silent
Context-aware reactions with longer conversational history
Smoother and more stable 3D head motion

🧩 Ablation Studies

Click to see the results

We analyze the contribution of each design choice:

❌ MLP head vs 🌊 diffusion-based head
❌ Full bidirectional attention vs ✅ turn-level causal attention
❌ Encoder–decoder vs ✅ encoder-only backbone

Each component is critical for causal coherence and generalization.

📚 Citation

If you find this work useful, please consider citing:

@article{chen2025timar,
  title={Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics},
  author={Chen, Junjie and Wang, Fei and Hunag, Zhihao and Zhou, Qing and Li, Kun and Guo, Dan and Zhang, Linfeng and Yang, Xun},
  journal={arXiv preprint arXiv:2512.15340},
  year={2025}
}

🙏 Acknowledgements

Our implementation benefits from the publicly available codebases of MAR and DualTalk. We thank the authors for releasing their repositories, which provided valuable references for implementation details and experimental protocols.

We further acknowledge the broader research community for foundational advances in autoregressive diffusion modeling, multimodal representation learning, and interactive 3D conversational head generation. While our framework introduces a distinct turn-level causal formulation and interleaved modeling strategy, it builds upon and extends these prior technical developments.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
datasets		datasets
model		model
render		render
util		util
.gitignore		.gitignore
README.md		README.md
TUTORIAL.md		TUTORIAL.md
calc_metric.py		calc_metric.py
main.py		main.py
make_noised_dataset_batch.sh		make_noised_dataset_batch.sh
make_noised_testset.py		make_noised_testset.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖✨ Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

📌 Open-Source Roadmap

🔥 Highlights

🚀 Overview