Skip to content

Official repository of the paper "Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics"

Notifications You must be signed in to change notification settings

CoderChen01/towards-seamless-interaction

Repository files navigation

Project Logo

🤖✨ Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Junjie Chen1,2 · Fei Wang1,2 · Zhihao Huang5,6 · Qing Zhou8 · Kun Li7
Dan Guo1 · Linfeng Zhang4 · Xun Yang3

1 Hefei University of Technology   ·   2 IAI, Hefei Comprehensive National Science Center
3 USTC   ·   4 SJTU   ·   5 TeleAI, China Telecom   ·   6 Northwestern Polytechnical University
7 United Arab Emirates University   ·   8 Anhui Polytechnic University


📌 Open-Source Roadmap

  • Core source code release
  • Pretrained checkpoints (CKPT)
  • Usage documentation and tutorials
  • Rendering tools
  • ...

🔥 Highlights

  • 🧠 Causal turn-level formulation for streaming conversational generation
  • 🔄 Unified talking & listening modeling within a single framework
  • 🎧🗣️ Interleaved multimodal tokens from both interlocutors
  • 🌊 Diffusion-based 3D head decoding for expressive and stochastic motion
  • 📉 15–30% error reduction over strong baselines (e.g., DualTalk)

🚀 Overview

Human conversation is a continuous exchange of speech and nonverbal cues—including head nods, gaze shifts, and subtle expressions.
Most existing approaches, however, treat talking-head and listening-head generation as separate problems, or rely on non-causal full-sequence modeling that is unsuitable for real-time interaction.

We propose a causal, turn-level framework for interactive 3D conversational head generation.
Our method models dialogue as a sequence of causally linked turns, where each turn accumulates multimodal context from both participants to produce coherent, responsive, and humanlike 3D head dynamics.

Framework Overview

🧩 Method: TIMAR

TIMAR (Turn-level Interleaved Masked AutoRegression) is the core method proposed in this work.

🧱 Key Idea

  • Represent conversation as interleaved audio–visual tokens:
    • 👤 User speech + user head motion
    • 🤖 Agent speech + agent head motion
  • Perform:
    • 🔁 Bidirectional fusion within each turn (intra-turn alignment)
    • ⏱️ Strictly causal reasoning across turns (inter-turn dependency)

This design mirrors how humans coordinate speaking and listening over time.

⚙️ Architecture

TIMAR Architecture

Core components:

  • 🧠 Turn-Level Causal Attention (TLCA)
    • Bidirectional attention inside a turn
    • Causal masking across turns (no future leakage)
  • 🌊 Lightweight Diffusion Head
    • Predicts continuous 3D head motion
    • Captures expressive stochasticity beyond deterministic regression

🧪 Experiments

We evaluate our framework on the interactive 3D conversational head benchmark, following the DualTalk protocol.

📊 Quantitative Results

Click to see the results

Quantitative Results

Results at a glance:

  • ⬇️ 15–30% reduction in Frechet Distance (FD) and MSE
  • 📈 Improved expressiveness and synchronization (SID ↑)
  • 🌍 Strong generalization on out-of-distribution conversations

🎭 Qualitative Results

Click to see the results

Qualitative Results

Demo Preview
Demo 1
demo_1.mp4
Demo 2
demo_2.mp4
Demo 3
demo_3.mp4

Notation

  • Agent GT denotes the ground-truth 3D head motion.
  • TIMAR Agent denotes our generated results.
  • DualTalk Agent denotes the outputs from the DualTalk baseline.

TIMAR produces:

  • Natural listening behavior when the agent is silent
  • Context-aware reactions with longer conversational history
  • Smoother and more stable 3D head motion

🧩 Ablation Studies

Click to see the results

Ablation Studies Ablation Studies

We analyze the contribution of each design choice:

  • ❌ MLP head vs 🌊 diffusion-based head
  • ❌ Full bidirectional attention vs ✅ turn-level causal attention
  • ❌ Encoder–decoder vs ✅ encoder-only backbone

Each component is critical for causal coherence and generalization.

📚 Citation

If you find this work useful, please consider citing:

@article{chen2025timar,
  title={Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics},
  author={Chen, Junjie and Wang, Fei and Hunag, Zhihao and Zhou, Qing and Li, Kun and Guo, Dan and Zhang, Linfeng and Yang, Xun},
  journal={arXiv preprint arXiv:2512.15340},
  year={2025}
}

🙏 Acknowledgements

Our implementation benefits from the publicly available codebases of MAR and DualTalk. We thank the authors for releasing their repositories, which provided valuable references for implementation details and experimental protocols.

We further acknowledge the broader research community for foundational advances in autoregressive diffusion modeling, multimodal representation learning, and interactive 3D conversational head generation. While our framework introduces a distinct turn-level causal formulation and interleaved modeling strategy, it builds upon and extends these prior technical developments.


About

Official repository of the paper "Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published