A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control (ICLR 2025, Spotlight)
teaser.mp4
This repository is under construction and the documentations for the following for will be updated. If you encounter any problems, please do not hesitate to contact us.
- Setup, generation demos, and visualization
- Data preparation and training
- Evaluation
Setup conda env:
conda env create -f environment.yml
conda activate DART
Tested system:
Our experiments and performance profiling are conducted on a workstation with single RTX 4090 GPU, intel i7-13700K CPU, 64GiB memory. The workstation runs with Ubuntu 22.04.4 LTS system.
-
Please download this google drive link containing model checkpoints and necessary data, extract and merge it to the project folder.
-
Please download the following data from the respective websites and organize as shown below:
-
AMASS (Only required for training, please down the gender-specific data for SMPL-H and SMPL-X)
-
BABEL (Only required for training)
-
HumanML3D(Only required for training)
-
Project folder structure of separately downloaded data:
./ ├── data │  ├── smplx_lockedhead_20230207 │  │  └── models_lockedhead │  │  ├── smplh │  │  │  ├── SMPLH_FEMALE.pkl │  │  │  └── SMPLH_MALE.pkl │  │  └── smplx │  │  ├── SMPLX_FEMALE.npz │  │  ├── SMPLX_MALE.npz │  │  └── SMPLX_NEUTRAL.npz │  ├── amass │  │  ├── babel-teach │  │  │ ├── train.json │  │  │ └── val.json │  │  ├── smplh_g │  │  │ ├── ACCAD │  │  │ ├── BioMotionLab_NTroje │  │  │ ├── BMLhandball │  │  │ ├── BMLmovi │  │  │ ├── CMU │  │  │ ├── CNRS │  │  │ ├── DanceDB │  │  │ ├── DFaust_67 │  │  │ ├── EKUT │  │  │ ├── Eyes_Japan_Dataset │  │  │ ├── GRAB │  │  │ ├── HUMAN4D │  │  │ ├── HumanEva │  │  │ ├── KIT │  │  │ ├── MPI_HDM05 │  │  │ ├── MPI_Limits │  │  │ ├── MPI_mosh │  │  │ ├── SFU │  │  │ ├── SOMA │  │  │ ├── SSM_synced │  │  │ ├── TCD_handMocap │  │  │ ├── TotalCapture │  │  │ ├── Transitions_mocap │  │  │ └── WEIZMANN │  │  └── smplx_g │  │  │ ├── ACCAD │  │  │ ├── BMLmovi │  │  │ ├── BMLrub │  │  │ ├── CMU │  │  │ ├── CNRS │  │  │ ├── DanceDB │  │  │ ├── DFaust │  │  │ ├── EKUT │  │  │ ├── EyesJapanDataset │  │  │ ├── GRAB │  │  │ ├── HDM05 │  │  │ ├── HUMAN4D │  │  │ ├── HumanEva │  │  │ ├── KIT │  │  │ ├── MoSh │  │  │ ├── PosePrior │  │  │ ├── SFU │  │  │ ├── SOMA │  │  │ ├── SSM │  │  │ ├── TCDHands │  │  │ ├── TotalCapture │  │  │ ├── Transitions │  │  │ └── WEIZMANN │  ├── HumanML3D │  │  ├── HumanML3D │  │  │  ├──... │  │  └── index.csv
- We use
pyrender
for interactive visualization of generated motions by default. Please refer to pyrender viewer for the usage of the interactive viewer, such as rotating, panning, and zooming. - The visualization script can render a generated sequence by specifying the
seq_path
argument. It also supports several optional functions, such as multi-sequence visualization, interactive play with frame forward/backward control using keyboards, and automatic body-following camera. More details of the configurable arguments can be found in the vis script. - The script can be slow when visualizing multiple humans together. You can choose to visualize only one human at a time by setting
--max_seq 1
in the command line, or use the blender visualization described below which is several times more efficient.
-
We also support exporting the generated motions as
npz
files and visualize in Blender for advanced rendering. To import one motion sequence into blender, please first install the SMPL-X Blender Add-on, and use the "add animation" feature as shown in this video. You can use the space key to start/stop playing animation in Blender.Demonstration of importing motion into Blender:
import.mp4
We offer a range of motion generation demos, including online text-conditioned motion generation and applications with spatial constraints and goals. These applications include motion in-betweening, waypoint goal reaching, and human-scene interaction generation.
source ./demos/run_demo.sh
This will open an interactive viewer and a command-line interface for text input. You can input text prompts and the model will generate the corresponding motion sequence on the fly. The model is trained on the BABEL dataset, which describes motions using verbs or phrases. The action coverage in the dataset can be found here. A demonstration video is shown below:
0927.mp4
We offer a headless script for text-conditioned motion composition, enabling users to generate motions from a timeline of actions defined via text prompts.
The text prompt follows the format:
action_1*num_1,action_2*num_2,...,action_n*num_n
where:
action_x
: A text description of the action (e.g., "walk forward," "turn left").num_x
: The duration of the action, measured in motion primitives (each primitive corresponds to 8 frames).
You can run the following command to generate example motions of walking in circles:
source ./demos/rollout.sh
We also provide some additional example text prompts which are commented out in this file.The output directory of generated motions will be displayed in the command line. The generated motions can be visualized using the pyrender viewer as follows:
python -m visualize.vis_seq --add_floor 1 --translate_body 1 --seq_path './mld_denoiser/mld_fps_clip_repeat_euler/checkpoint_300000/rollout/walk_in_circles*20_guidance5.0_seed0/*.pkl'
We refer to the vis script for detailed visualization configuration. The output directory also contains the exported motion sequences as npz
files for Blender visualization.
We provide a script to generate motions between two keyframes conditioned on text prompts.
The keyframes and the duration of inbetweening is specified using a SMPL parameter sequence via --optim_input
while the text prompt is specified using --text_prompt
.
The script offers two modes, selectable via the --seed_type
argument: repeat
and history
. These modes are designed to handle scenarios where either a single start keyframe or multiple start keyframes are provided. When multiple start keyframes are available, we aim to ensure velocity consistency in addition to maintaining initial location consistency.
- Repeat mode: The first frame of the input sequence is the start keyframe and the last frame is the goal keyframe, the rest frames are the repeat padding of the first frame. The output sequence length equals to the input sequence length.
- History mode: The first three frames of the input sequence serve as start keyframes to provide velocity context, and the last frame is the goal keyframe. The remaining frames can be filled using zero-padding or repeat-padding.
We show an example of in-betweening "pace in circles" between two keyframes:
source ./demos/inbetween_babel.sh
The generated sequences can be visualized using the commands below.
The white bodies represent the keyframes for reference, while the colored bodies depict the generated results.
To better assess goal keyframe reaching accuracy, you can enable interactive play mode by adding --interactive 1
and pressing a
to display only the last frame.
-
Repeat mode:
python -m visualize.vis_seq --add_floor 1 --body_type smplx --seq_path './mld_denoiser/mld_fps_clip_repeat_euler/checkpoint_300000/optim/inbetween/repeatseed/scale0.1_floor0.0_jerk0.0_use_pred_joints_ddim10_pace_in_circles*15_guidance5.0_seed0/*.pkl'
-
History mode:
python -m visualize.vis_seq --add_floor 1 --body_type smplx --seq_path './mld_denoiser/mld_fps_clip_repeat_euler/checkpoint_300000/optim/inbetween/historyseed/scale0.1_floor0.0_jerk0.0_use_pred_joints_ddim10_pace_in_circles*15_guidance5.0_seed0/*.pkl'
You can easily test custom in-betweening by customizing --optim_input
and --text_prompt
. The input SMPL sequence should include the attributes gender, betas, transl, global_orient, body_pose
. Example sequences can be found here.
Using model trained on the HML3D dataset:
In addition to inbetweening with the model trained on the BABEL dataset (as demonstrated above), we also provide a script for inbetweening using a model trained on the HML3D dataset [here](./demos/inbetween_hml.sh). While you can generally use the HML3D-trained model for **all optimization-based demos** below, please note the following:- The text prompt style in HML3D differs from BABEL.
- HML3D assumes 20 fps motions, whereas BABEL uses 30 fps.
- When visualizing HML3D results with the visualization script, please add
--body_type smplh
to specify the body type, as HML3D utilizes SMPL-H bodies.
We provide a script to generate human-scene interaction motions. Given an input 3D scene and the text prompts specifying the actions and durations, we control the human to reach the goal joint location starting from an initial pose while adhering to the scene contact and collision constraints. We show two examples of climbing downstairs and sitting to a chair in the demo below:
source ./demos/scene.sh
The generated sequences can be visualized using:
python -m visualize.vis_seq --add_floor 0 --seq_path './mld_denoiser/mld_fps_clip_repeat_euler/checkpoint_300000/optim/sit_use_pred_joints_ddim10_guidance5.0_seed0_contact0.1_thresh0.0_collision0.1_jerk0.1/sample_*.pkl'
python -m visualize.vis_seq --add_floor 0 --seq_path './mld_denoiser/mld_fps_clip_repeat_euler/checkpoint_300000/optim/climb_down_use_pred_joints_ddim10_guidance5.0_seed0_contact0.1_thresh0.0_collision0.1_jerk0.1/sample_*.pkl'
To use a custom 3D scene, you need to first calculate the scene SDF for evaluating human-scene collision and contact constraints.
Please ensure the 3D scene is z-up and the floor plane has zero height.
We use mesh2sdf for SDF calculation, as shown in this script.
Example configuration files for an interaction sequence can be found here. We currently initialize the human using a standing pose, with its location and orientation determined by the pelvis, left hip and right hip location specified using init_joints
.
The goal joint locations are specified using goal_joints
. The current script only use pelvis as the goal joint, you can modify the goal joints to be another joint or multiple joints.
You may also tune the optimization parameters to modulate the generation, such as increasing the learning rate to obtain more diverse results, adjusting number of optimization steps to balance quality and speed, and adjusting the loss weights.
We train a motion control policy capable of reaching dynamic goal locations by leveraging locomotion skills specified through text prompts. The motion control policy is trained for three kinds of locomotion: walking, running, and hopping on the left leg. The control policy can generate >300 frames per second. we demonstrate how to define a sequence of waypoints to be reached in the cfg files. You can run the following command to generate example motions of walking to a sequence of goals:
source ./demos/goal_reach.sh
The results can be visualized as follows:
python -m visualize.vis_seq --add_floor 1 --seq_path './policy_train/reach_location_mld/fixtext_repeat_floor100_hop10_skate100/env_test/demo_walk_path0/0.pkl'
We provide a script to generate motions with sparse/dense joint trajectory control. Below we demonstrate some examples of controlling hand wrists and 2D pelvis trajectories. This script assumes starting from a standing pose and the specified joint trajectory needs to be feasible with the starting pose. To accommodate this, we set a tolerance period (1.5 seconds in the script) at the start of the sequence. During this period, no trajectory constraints are enforced, allowing sufficient time for the human to transition smoothly and feasibly to the controlled trajectory from the standing pose. You can run the following command to generate example motions:
source ./demos/traj.sh
The generated sequences can be visualized using the four commands below. The trajectories are visualized as a sequence of spheres, with colors transitioning from dark to red to represent relative time.
-
In the punch example, there is a single trajectory point at 1.5 seconds.
-
In the other three examples, trajectory points are distributed across a range from 1.5 to 6 seconds.
You can find the utility script for creating the example control trajectories here.This script includes definitions for: frame index and location for each control trajectory point, and index of the joint to be controlled.
python -m visualize.vis_seq --add_floor 1 --translate_body 1 --vis_joint 1 --seq_path './data/traj_test/dense_frame180_walk_circle/mld_optim_global/floor1.0_skate1.0_jerk0.0_use_pred_joints_init1.0_ddim10_guidance5.0_seed0_lr0.05_steps100/*.pkl'
python -m visualize.vis_seq --add_floor 1 --translate_body 1 --vis_joint 1 --seq_path './data/traj_test/sparse_frame180_walk_square/mld_optim_global/floor1.0_skate1.0_jerk0.0_use_pred_joints_init1.0_ddim10_guidance5.0_seed0_lr0.05_steps100/*.pkl'
python -m visualize.vis_seq --add_floor 1 --translate_body 1 --vis_joint 1 --seq_path './data/traj_test/dense_frame180_wave_right_hand_circle/mld_optim_global/floor1.0_skate1.0_jerk0.0_use_pred_joints_init1.0_ddim10_guidance5.0_seed0_lr0.05_steps100/*.pkl'
python -m visualize.vis_seq --add_floor 1 --translate_body 1 --vis_joint 1 --seq_path './data/traj_test/sparse_punch/mld_optim_global/floor1.0_skate1.0_jerk0.0_use_pred_joints_init1.0_ddim10_guidance5.0_seed0_lr0.05_steps100/*.pkl'
You can test with custom trajectories by setting --input_path
to your custom control trajectories.
If you have ground truth initial bodies and joint trajectories from dataset, you can modify the script to use initial bodies from dataset instead of the rest standing pose similar to the inbetweening script.
-
Below we provide the documentation of data processing and model training using different data sources. By default, we provide commands for training on the BABEL dataset. Instructions for training on the HML3D dataset are available in the collapsible section. Additionally, guidance is provided for training on a custom motion dataset with text annotations.
-
We use wandb for training logging. You may need to set up your own wandb account and log in before running the training scripts.
-
Our training process includes stochastic factors such as random data sampling, scheduled training, and reinforcement learning-based policy training. As a result, we observed that different behaviors may occur when training on different environments.
-
You can test the trained models by changing the model path in the demo scripts.
- Please first download the BABEL and AMASS SMPL-X gendered dataset and structure the folder as in data setup section.
- Please execute the following command to preprocess the BABEL dataset and extract the motion-text data.
- For details of data preprocessing, you can check the collapsed section of training using custom dataset below.
python -m data_scripts.extract_dataset
python -m mld.train_mvae --track 1 --exp_name 'mvae_babel_smplx' --data_args.dataset 'mp_seq_v2' --data_args.data_dir './data/seq_data_zero_male' --data_args.cfg_path './config_files/config_hydra/motion_primitive/mp_h2_f8_r8.yaml' --data_args.weight_scheme 'text_samp:0.' --train_args.batch_size 128 --train_args.weight_kl 1e-6 --train_args.stage1_steps 100000 --train_args.stage2_steps 50000 --train_args.stage3_steps 50000 --train_args.save_interval 50000 --train_args.weight_smpl_joints_rec 10.0 --train_args.weight_joints_consistency 10.0 --train_args.weight_transl_delta 100 --train_args.weight_joints_delta 100 --train_args.weight_orient_delta 100 --model_args.arch 'all_encoder' --train_args.ema_decay 0.999 --model_args.num_layers 7 --model_args.latent_dim 1 256
python -m mld.train_mld --track 1 --exp_name 'mld_babel_smplx' --train_args.batch_size 1024 --train_args.use_amp 1 --data_args.dataset 'mp_seq_v2' --data_args.data_dir './data/seq_data_zero_male' --data_args.cfg_path './config_files/config_hydra/motion_primitive/mp_h2_f8_r4.yaml' --denoiser_args.mvae_path './mvae/mvae_babel_smplx/checkpoint_200000.pt' --denoiser_args.train_rollout_type 'full' --denoiser_args.train_rollout_history 'rollout' --train_args.stage1_steps 100000 --train_args.stage2_steps 100000 --train_args.stage3_steps 100000 --train_args.save_interval 100000 --train_args.weight_latent_rec 1.0 --train_args.weight_feature_rec 1.0 --train_args.weight_smpl_joints_rec 0 --train_args.weight_joints_consistency 0 --train_args.weight_transl_delta 1e4 --train_args.weight_joints_delta 1e4 --train_args.weight_orient_delta 1e4 --data_args.weight_scheme 'text_samp:0.' denoiser-args.model-args:denoiser-transformer-args
python -m control.train_reach_location_mld --track 1 --exp_name 'control_policy' --denoiser_checkpoint './mld_denoiser/mld_fps_clip_euler/checkpoint_300000.pt' --total_timesteps 200000000 --env_args.export_interval 1000 --env_args.num_envs 256 --env_args.num_steps 32 --minibatch_size 1024 --update_epochs 10 --learning_rate 3e-4 --max_grad_norm 0.1 --env_args.texts 'walk' 'run' 'hop on left leg' --env_args.success_threshold 0.3 --env_args.weight_success 1.0 --env_args.weight_dist 1.0 --env_args.weight_foot_floor 100.0 --env_args.weight_skate 100.0 --env_args.weight_orient 0.1 --policy_args.min_log_std -1.0 --policy_args.max_log_std 1.0 --policy_args.latent_dim 512 --env_args.goal_dist_max_init 5.0 --env_args.goal_schedule_interval 50000 --policy_args.use_lora 0 --policy_args.lora_rank 16 --policy_args.n_blocks 2 --policy_args.use_tanh_scale 1 --policy_args.use_zero_init 1 --init_data_path './data/stand.pkl' --env_args.weight_rotation 10.0 --env_args.weight_delta 0.0 --env_args.obs_goal_angle_clip 60.0 --env_args.obs_goal_dist_clip 5.0 --env_args.use_predicted_joints 1 --env_args.goal_angle_init 120.0 --env_args.goal_angle_delta 0.0
Train with HML3D dataset:
Please first download the HML3D and AMASS SMPL-H gendered dataset and structure the folder as in data setup section.
python -m data_scripts.extract_dataset_hml3d_smplh
python -m mld.train_mvae --track 1 --exp_name 'mvae_hml3d_smplh' --data_args.body_type 'smplh' --data_args.dataset 'hml3d' --data_args.data_dir './data/hml3d_smplh/seq_data_zero_male/' --data_args.cfg_path './config_files/config_hydra/motion_primitive/hml_mp_h2_f8_r4.yaml' --data_args.weight_scheme 'uniform' --train_args.batch_size 128 --train_args.weight_kl 1e-6 --train_args.stage1_steps 100000 --train_args.stage2_steps 50000 --train_args.stage3_steps 50000 --train_args.save_interval 50000 --train_args.weight_smpl_joints_rec 10.0 --train_args.weight_joints_consistency 10.0 --train_args.weight_transl_delta 100 --train_args.weight_joints_delta 100 --train_args.weight_orient_delta 100 --model_args.arch 'all_encoder' --train_args.ema_decay 0.999 --model_args.num_layers 7 --model_args.latent_dim 1 256
python -m mld.train_mld --track 1 --exp_name 'mld_hml3d_smplh' --train_args.batch_size 1024 --train_args.use_amp 1 --data_args.body_type 'smplh' --data_args.dataset 'hml3d' --data_args.data_dir './data/hml3d_smplh/seq_data_zero_male/' --data_args.cfg_path './config_files/config_hydra/motion_primitive/hml_mp_h2_f8_r4.yaml' --denoiser_args.mvae_path './mvae/mvae_hml3d_smplh/checkpoint_200000.pt' --denoiser_args.train_rollout_type 'full' --denoiser_args.train_rollout_history 'rollout' --train_args.stage1_steps 100000 --train_args.stage2_steps 100000 --train_args.stage3_steps 100000 --train_args.save_interval 100000 --train_args.weight_latent_rec 1.0 --train_args.weight_feature_rec 1.0 --train_args.weight_smpl_joints_rec 0 --train_args.weight_joints_consistency 0 --train_args.weight_transl_delta 1e4 --train_args.weight_joints_delta 1e4 --train_args.weight_orient_delta 1e4 --data_args.weight_scheme 'uniform' denoiser-args.model-args:denoiser-transformer-args
Train using custom dataset:
- Our model can train on custom motion dataset with text annotations.
We expect the motion data to be sequences of SMPL-X/H parameters.
We structure the text annotations for each sequence according to the BABEL annotation format. In this format, a sequence can have an arbitrary number of segment text labels. Each segment is defined by a start time (
start_t
) and an end time (end_t
), both measured in seconds. The text annotation for each segment is stored under the keyproc_label
. The segments can overlap, and the segments can also range the whole sequence as in the HML3D dataset. Please check the data preprocessing script for BABEL and HML3D for details. - Please export the dataset to a separate folder and recalcualte the mean and std statistics for motion features using the custom dataset.
- You can specify the data source when training the motion primitive VAE or latent diffusion model using
--data_args.data_dir
, and the body type using--data_args.body_type
. The configurations of motion primitive length, max rollout number in scheduled training, FPS of motion data are set in the cfg files.
The evaluation for text-conditioned temporal motion composition is based on the FlowMDM code. Please first set up the FlowMDM dependencies as follows:
- Set up the required dependencies:
source ./FlowMDM/setup.sh
- Download the processed BABEL dataset for evaluation:
After setting up the dependencies, you can run the evaluation using the following command. The FlowMDM generation part may take around 1 day.
source ./evaluation/eval_gen_composition.sh
- The evaluation results of FlowMDM will be saved at
./FlowMDM/results/babel/FlowMDM/evaluations_summary/001300000_fast_10_transLen30babel_random_seed0.json
. - The evaluation results of DART will be saved at
./FlowMDM/results/babel/Motion_FlowMDM_001300000_gscale1.5_fastbabel_random_seed0_s10/mld_fps_clip_repeat_euler_checkpoint_300000_guidance5.0_seed0/evaluations_summary/fast_10_transLen30babel_random_seed0.json
.
The generation and evaluation can be executed with the command below. The results will be displayed in the command line, and the file save path will also be indicated there.
source ./evaluation/eval_gen_inbetween.sh
The generation and evaluation can be executed with the command below. The results will be displayed in the command line, and the file save path will also be indicated there.
source ./evaluation/eval_gen_goal_reach.sh
Our code is built upon many prior projects, including but not limited to:
DNO, MDM, MLD, FlowMDM, text-to-motion, guided-diffusion, ACTOR, DIMOS
@inproceedings{Zhao:DartControl:2025,
title = {{DartControl}: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control},
author = {Zhao, Kaifeng and Li, Gen and Tang, Siyu},
booktitle = {The Thirteenth International Conference on Learning Representations (ICLR)},
year = {2025}
}
If you run into any problems or have any questions, feel free to contact Kaifeng Zhao or create an issue.