Text-Driven 3D Human Motion Generation for Pose Estimation using Dual-Transformer Architecture

Overview

Text-to-motion generation has improved, but current methods struggle with realistic 3D human motions that capture both body and emotional expressions. Most focus on body motion, neglecting body language and pose estimation. We propose DT3DPE (Dual-Transformer for 3D Pose Estimation), which incorporates pose estimation and body language for more realistic, text-aligned motions. Our model uses a hierarchical quantization scheme and a dual transformer architecture for motion prediction and refinement. Experiments show DT3DPE outperforms existing methods on HumanML3D and KIT-ML datasets.

👁️💬 Architecture

The DT3DPE framework works as follows: (a) The input text describes an action, such as a person walking forward. (b) A movement residual is generated based on the input text. (c) The masked transformer and residual transformer process this residual to produce motion tokens and refine the motion details. (d) The final output is a detailed and coherent animation that accurately reflects the described action.

Installation

conda create python=3.9 --name DT3DPE
conda activate DT3DPE

Install the requirements

pip install -r requirements.txt

Download Pre-trained Models, Evaluation Models, and Gloves

bash prepare/download_models.sh
bash prepare/download_evaluator.sh
bash prepare/download_glove.sh

Demo

Output from a single prompt

python t2m_animation_generator.py --gpu_id 1 --ouput ouput1 --text "A person performs jumping jacks."

Output from a text file

python t2m_animation_generator.py --gpu_id 1 --ouput ouput2 --text_path ./assets/textfile.txt

Visualization

blender --background --python render.py -- --cfg=./configs/render.yaml --dir=/home/abbas/motiontext/outcomes/motiontext/HumanML3D/samples_2024-11-10-18-50-15/ --mode=video --joint_type=HumanML3D

python -m fit --dir /home/abbas/motiontext/outcomes/motiontext/HumanML3D/samples_2024-11-10-18-50-15/ --save_folder /home/abbas/motiontext/outcomes/motiontext/HumanML3D/samples_2024-11-10-18-50-15/tamp --cuda True

blender --background --python render.py -- --cfg=./configs/render.yaml --dir=/home/abbas/motiontext/results/motiontext/1222_PELearn_Diff_Latent1_MEncDec49_MdiffEnc49_bs64_clip_uncond75_01/samples_2024-10-18-22-15-14/ --mode=video --joint_type=HumanML3D

Qualitative results demonstrating DT3DPE's capability to synthesize human movement for pose estimation from textual descriptions.

The person raises both arms up, claps their hands together, and takes two steps forward.	A person bends down, touches his toes, and stands back up.	The man walks forward, turns to his left, and raises both arms up.	A person stretches both arms out to the sides and spins around.
The person hops on his right foot, then lands on both feet.	A person kicks with right leg, kneels down, and stands back up.	The person lifts his left leg, kicks forward, and then steps back.	A person raises his right hand, waves, and walks away.
The man crouches down, reaches forward, and stands back up.	A person performs jumping jacks.	The person spins left in place and then raises both arms.	The person steps forward and walks back.

Datasets

We evaluated DT3DPE using three key datasets for text-driven human movement synthesis:

HumanML3D: Combines HumanAct12 and AMASS datasets, featuring 14,616 movements and 44,970 text descriptions. It spans diverse actions like daily tasks, athletics, and performances, with clips totaling 28.59 hours. Each movement has 3-4 descriptive sentences. Dataset Link
KIT-ML: Includes 3,911 movements with 6,278 text descriptions, linking human actions to natural language. It advances research on movement-language correlations with a focus on accessibility and clarity. Dataset Link

Train & Evaluate

Train

python vq_trainer.py --name rvq_name --gpu_id 1 --dataset_name t2m --batch_size 256 --num_quantizers 6  --max_epoch 50 --quantize_dropout_prob 0.2 --gamma 0.05
python train_t2m_mask.py --name mtrans_name --gpu_id 2 --dataset_name t2m --batch_size 64 --vq_name rvq_name
python train_t2m_res.py --name rtrans_name  --gpu_id 2 --dataset_name t2m --batch_size 64 --vq_name rvq_name --cond_drop_prob 0.2 --share_weight
``
- **Evaluation**

python vq_evaluator.py --gpu_id 1 --name rvq_nq6_dc512_nc512_noshare_qdp0.2 --dataset_name t2m --ext rvq_nq6 python vq_evaluator.py --gpu_id 1 --name rvq_nq6_dc512_nc512_noshare_qdp0.2_k --dataset_name kit --ext rvq_nq6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text-Driven 3D Human Motion Generation for Pose Estimation using Dual-Transformer Architecture

Overview

👁️💬 Architecture

Installation

Demo

Datasets

Train & Evaluate

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
animation		animation
assets		assets
configs		configs
dataset		dataset
download		download
examples		examples
models		models
motions		motions
resources		resources
tools		tools
utils		utils
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
t2m_animation_editor.py		t2m_animation_editor.py
t2m_animation_generator.py		t2m_animation_generator.py
t2m_evaluator.py		t2m_evaluator.py
train_t2m_mask.py		train_t2m_mask.py
train_t2m_res.py		train_t2m_res.py
vq_evaluator.py		vq_evaluator.py
vq_trainer.py		vq_trainer.py

swerizwan/DT3DPE

Folders and files

Latest commit

History

Repository files navigation

Text-Driven 3D Human Motion Generation for Pose Estimation using Dual-Transformer Architecture

Overview

👁️💬 Architecture

Installation

Demo

Datasets

Train & Evaluate

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages