Official implementation of Synch-Transformer for synchronous motion captioning:
This work introduces a Transformer-based design to address the task of motion-to-text synchronization, as introduced in this previous project m2t-segmentation.
Synchronous captioning aims to generate text aligned with the time evolution of 3D human motion. Implicitly, this mapping provides fine-grained action recognition and unsupervised event localization with temporal phrase grounding through unsupervised motion-language segmentation.
In the following visual animations, we present the synchronized output results for some motions, mainly compositional, which include samples containing two or more actions:
Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and time-aligned text generation. We achieve this through masking strategies and structuring losses that push the model to maximize attention only on the most important frames contributing to the generation of a motion word. These constraints aim to prevent undesired mixing of information in attention maps and to provide a monotonic attention distribution across tokens. Thus, the cross attentions of tokens are used for progressive text generation in synchronization with human motion sequences.
- Phrase-level
- Word-level
If you find this work useful in your research, please cite:
@article{radouane2024ControlledTransformer,
title={Transformer with Controlled Attention for Synchronous Motion Captioning},
author={Karim Radouane and Sylvie Ranwez and Julien Lagarde and Andon Tchechmedjiev},
journal = {arXiv},
year = {2024}
}