Skip to content

Commit

Permalink
Add support for data processing and gesture editing
Browse files Browse the repository at this point in the history
  • Loading branch information
mpikiran committed Jul 25, 2024
1 parent 3057d80 commit 5dfadf2
Show file tree
Hide file tree
Showing 11 changed files with 175 additions and 21 deletions.
50 changes: 34 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ This is a repository for **AMUSE**: Emotional Speech-driven 3D Body Animation vi

## News :triangular_flag_on_post:

- [2024/06/12] Code is available.
- [2024/07/25] Data processing and gesture editing scripts are available.
- [2024/06/12] Code is available.
- [2024/02/27] AMUSE has been accepted for CVPR 2024! Working on code release.
- [2023/12/08] <a href="https://arxiv.org/abs/2312.04466">ArXiv</a> is available.

Expand All @@ -61,6 +62,19 @@ This is a repository for **AMUSE**: Emotional Speech-driven 3D Body Animation vi

### Main Repo Setup

The project has been tested with the following configuration:

- **Operating System**: Linux 5.14.0-1051-oem x86_64
- **GCC Version**: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
- **CUDA Version**: CUDA 11.3
- **Python Version**: Python 3.8.15
- **GPU Configuration**:
- **Audio Model**: NVIDIA A100-SXM4-80GB
- **Motion Model**: NVIDIA A100-SXM4-40GB, Tesla V100-32GB

**Note**: The audio model requires a larger GPU. Multiple GPU support is implemented for the audio model; however, it was not used in the final version.


```bash
git clone https://github.com/kiranchhatre/amuse.git
cd amuse/dm/utils/
Expand Down Expand Up @@ -147,17 +161,22 @@ Once the above setup is correctly done, you can execute the following:
python main.py --fn infer_gesture
```

- [ ] **edit_gesture**
COMING SOON
- [x] **edit_gesture**
```bash
cd $AMUSEPATH/scripts
python main.py --fn infer_gesture
python main.py --fn edit_gesture
```
For extensive editing options, please refer to the `process_loader` function in `infer_ldm.py` and experiment with different configurations in `emotion_control`, `style_transfer`, and `style_Xemo_transfer`. While editing gestures directly from speech is challenging, it offers intriguing possibilities. The task involves numerous combinations, and not all may yield optimal results. Figures A.11 and A.12 in supplementary material illustrate the inherent complexities and variations in this process.
Click the image below to watch the video on YouTube:
<div align="center">
<a href="https://youtu.be/48vw2NfWkJg" target="_blank">
<img src="https://img.youtube.com/vi/48vw2NfWkJg/maxresdefault.jpg" alt="Video Thumbnail">
</a>
</div>


- [x] **bvh2smplx_**
Convert BVH to SMPLX (only with provided BMAP presets from AMUSE website download page if possible).
Highly experimental, no support.
Place BVH file inside `$AMUSEPATH/data/beat-rawdata-eng/beat_rawdata_english/<<actor_id>>`, where actor_id is between 1 and 30. The converted file will be in `$AMUSEPATH/viz_dump/smplx_conversions`.
Convert BVH to SMPL-X using the provided BMAP presets from the AMUSE website download page. Note that this feature is experimental and not officially supported. Place the BVH file inside `$AMUSEPATH/data/beat-rawdata-eng/beat_rawdata_english/<<actor_id>>`, where `actor_id` is a number between 1 and 30. The converted file will be located in `$AMUSEPATH/viz_dump/smplx_conversions`.
```bash
cd $AMUSEPATH/scripts
python main.py --fn bvh2smplx_
Expand All @@ -168,20 +187,13 @@ Once the above setup is correctly done, you can execute the following:
<img width="50%" src="docs/static/BVH2SMPLX.gif">
</p>

- [ ] **prepare_data**
Train AMUSE on BEAT 0.2.1 or BEAT-X or custom dataset.
COMING SOON: Conversion script, dataloader LMDB file creation.
- [x] **prepare_data**
Prepare data and create an LMDB file for training AMUSE. We provide the AMUSE-BEAT version on the project webpage. To train AMUSE on a custom dataset, you will need aligned motion and speech files. The motion data should be in an animation NPZ file compatible with the SMPL-X format.
```bash
cd $AMUSEPATH/scripts
python main.py --fn prepare_data
```

- [ ] **other**
COMING SOON
```bash
```


---

## Citation
Expand All @@ -200,6 +212,12 @@ Once the above setup is correctly done, you can execute the following:

<br/>

## Acknowledgments

We would like to extend our gratitude to the authors and contributors of the following open-source projects, whose work has significantly influenced and supported our implementation: [EVP](https://github.com/jixinya/EVP), [Motion Diffusion Model](https://github.com/GuyTevet/motion-diffusion-model), [Motion Latent Diffusion](https://github.com/ChenFengYe/motion-latent-diffusion), [AST](https://github.com/YuanGongND/ast), [ACTOR](https://github.com/Mathux/ACTOR), and [SMPL-X](https://github.com/vchoutas/smplx). We also wish to thank [SlimeVRX](https://github.com/SlimeVRX) for their collaboration on the development of the `bvh2smplx_` task. For a more detailed list of acknowledgments, please refer to our paper.

<br/>

## Contact

For any inquiries, please feel free to contact [amuse@tue.mpg.de](mailto:amuse@tue.mpg.de). Feel free to use this project and contribute to its improvement.
11 changes: 7 additions & 4 deletions scripts/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,14 +113,17 @@ def main(args):
trainer = trainer(config, device, train_loader, val_loader, model_path, tag, logger_cfg, model, debug=debug)
trainer.train_dtw_ast()

elif args.fn[0] in ["train_gesture", "infer_gesture"]:
elif args.fn[0] in ["train_gesture", "infer_gesture", "prepare_data", "edit_gesture"]:

if args.fn[0] == "prepare_data": # Prepare LMDB dataloader
pass

if "ablation" in config["TRAIN_PARAM"]["wav_dtw_mfcc"]: audio_ablation = config['TRAIN_PARAM']['wav_dtw_mfcc']['ablation']
else: audio_ablation = None
latent_diffusion_dm = full_data.latent_diffusion_dm_v2(device, verbose=True, audio_ablation=audio_ablation)
import sys; sys.exit("AMUSE: LMDB data prepared!")

else: # Train gesture generation model

assert (args.fn[0] == "train_gesture" and not pretrained_infer) or (args.fn[0] == "infer_gesture" and pretrained_infer), f"Arg: {args.fn[0]} and pretrained_infer: {pretrained_infer} mismatch!"
assert (args.fn[0] == "train_gesture" and not pretrained_infer) or (args.fn[0] in ["infer_gesture", "edit_gesture"] and pretrained_infer), f"Arg: {args.fn[0]} and pretrained_infer: {pretrained_infer} mismatch!"
smplx_data_training = config["TRAIN_PARAM"]["latent_diffusion"]["smplx_data"]
if not pretrained_infer:
assert smplx_data_training, "smplx_data must be True!"
Expand Down
69 changes: 69 additions & 0 deletions scripts/overrides/edit_gesture.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
DATA_PARAM:
Bvh:
bvh2smplbvh: False
TRAIN_PARAM:
tag: latent_diffusion
motion_extractor:
use: False
tag:
task: train
metrics_only: False # False True
pretrained_infer: True
wav_dtw_mfcc:
noise: True
ablation: full
ablation_version: v1
frame_based_feats: True
diffusion:
lmdb_cache: BEAT-cache/2023-10-28_30F_fing_smplx_MOSH_full_v1_feat_based_300
latent_diffusion:
smplx_rep: 6D
pretrained_ast: wav_dtw_mfcc_20231022-044436_actors
pretrained_lpdm: LPDM_20231028-210758_actors_smplx
pretrained_prior_lpdm_e: best
pretrained_ldm_lpdm_e: best
smpl_viz_mode: BLENDER_EEVEE # BLENDER_EEVEE, CYCLES
half_body: True
train_upper_body: False
skip_trans: False
vtex_displacement: False
zero_trans: False
freeze_init_LoBody: False
motion_feature_extractor:
model:
test:
emotion_control_list:
use: True # False True
overwrite: "amusepp_all_tests/SUPMATMETRIC-LPDM_20231028-210758_actors_smplx"
actor: "miranda"
audios: "/home/kchhatre/Work1/code/amuse/viz_dump/test/e_speech"
renders: "/home/kchhatre/Work1/code/amuse/viz_dump/test/e_gesture"
emotion_control:
use: False # False True
overwrite: "amusepp_all_tests/SUPMATMETRIC-LPDM_20231028-210758_actors_smplx"
actor: "[wayne]" # wayne, scott, solomon, lawrence, stewart, sophie, miranda, kieks, zhao, lu, jorge, daiki, ayana, katya
content_emotion: "[neutral]"
take_element: "first" # first last random
style_transfer:
use: False # False True
overwrite:
actors: "[lu-lawrence]"
emotion: "[angry]"
style_Xemo_transfer:
use: False # False True
overwrite: "amusepp_all_tests/SUPMAT-styleXem-LPDM_20231028-210758_actors_smplx"
actors: "[scott-lu]"
emotion: "[happy-angry]"
audio_list:
use: False # False True
short_audio_list: False # False True
overwrite:
processed_audios: 15
vidlist: "vidlist.csv"
diff_only: False # False True
baselines:
run: False # False True
prepare_dm: False # False True
renders:
task: custom_renders # custom_renders YT_monologues_renders
subtask:
66 changes: 65 additions & 1 deletion scripts/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
from torch.autograd import Variable
from einops import rearrange, repeat
from pytorch3d import transforms as p3d_tfs
from moviepy.editor import VideoFileClip, concatenate_videoclips
from moviepy.editor import VideoFileClip, concatenate_videoclips, clips_array

from dm.utils.bvh_utils import *
from dm.utils.wav_utils import *
Expand Down Expand Up @@ -167,6 +167,7 @@ def __init__(self, config, device, train_loader, val_loader=None, model_path=Non
if self.style_Xemo_transfer:
self.style_Xemo_transfer_actors = self.config["TRAIN_PARAM"]["test"]["style_Xemo_transfer"]["actors"]
self.style_Xemo_transfer_emotion = self.config["TRAIN_PARAM"]["test"]["style_Xemo_transfer"]["emotion"]
self.demo_emotion_control = self.config["TRAIN_PARAM"]["test"]["emotion_control_list"]["use"] if "emotion_control_list" in self.config["TRAIN_PARAM"]["test"].keys() else False
else:
# Latent Diffusion
cfg_name = self.config["TRAIN_PARAM"]["latent_diffusion"]["arch"]
Expand Down Expand Up @@ -1032,6 +1033,69 @@ def eval_prior_latdiff_forward_backward_v1(self, baseline, ldm_epoch, audio_list

print(f"END VISUALIZATION: EMOTION CONTROL {rep_i+1}/{self.config['TRAIN_PARAM']['test']['replication_times']} =====>")
else: print(f"END EVALUATION METRICS ONLY: EMOTION CONTROL {rep_i+1}/{self.config['TRAIN_PARAM']['test']['replication_times']} =====>")

if self.demo_emotion_control:

actor = self.config["TRAIN_PARAM"]["test"]["emotion_control_list"]["actor"]
print(f"DEMO EMOTION CONTROL EDITS for {actor} =====>")
audios = list(Path(self.config["TRAIN_PARAM"]["test"]["emotion_control_list"]["audios"]).glob("*.wav"))
src_a, tgt_a = [x for x in audios if "_source" in x.stem][0], [x for x in audios if "_target" in x.stem][0]
target_path = Path(self.config["TRAIN_PARAM"]["test"]["emotion_control_list"]["renders"])

rst = []
src_a_pydub = AudioSegment.from_wav(str(src_a))
src_a_arr, _ = torchaudio.load(str(src_a))
src_a_arr = src_a_arr - src_a_arr.mean()
src_a_con, src_a_emo, src_a_sty = self.model.process_single_seq(src_a_arr, framerate=16000, baseline=baseline)

tgt_a_arr, _ = torchaudio.load(str(tgt_a))
tgt_a_arr = tgt_a_arr - tgt_a_arr.mean()
_, tgt_a_emo, _ = self.model.process_single_seq(tgt_a_arr, framerate=16000, baseline=baseline)

feats_rst = self.model.diffusion_backward(1, src_a_con, src_a_emo, src_a_sty) # Gesture generation
poses, trans = feats_rst["poses"], feats_rst["trans"]
poses = rearrange(poses, "b t j d -> b t (j d)")
feats_rst = torch.cat((poses,trans), dim=-1)
rst.append(
{
"feats": feats_rst,
"audio": src_a_pydub,
"info": f"Original {actor}"
})

feats_rst = self.model.diffusion_backward(1, src_a_con, tgt_a_emo, src_a_sty) # Gesture editing
poses, trans = feats_rst["poses"], feats_rst["trans"]
poses = rearrange(poses, "b t j d -> b t (j d)")
feats_rst = torch.cat((poses,trans), dim=-1)
rst.append(
{
"feats": feats_rst,
"audio": src_a_pydub,
"info": f"Emotion edited {actor}"
})

video_dump_r = target_path / f"Custom_audios_{self.stamp}_E{ldm_epoch}" / f"rep{rep_i}"
assert self.viz_type in ["CaMN"], "[LDM EVAL] Invalid viz type: [%s]" % self.viz_type
for i, sample_dict in enumerate(rst):
print(f"VISUALIZATION: LIST AUDIOS {i} =====>")
video_dump = video_dump_r / f"rst_{i}"
self.visualizer.animate_ldm_sample_v1(sample_dict, video_dump, self.smplx_data, self.skip_trans, without_txt=False)

videos = []
for i in range(2):
video_dump = video_dump_r / f"rst_{i}"
for video_file in video_dump.rglob("*_single_subject_video.mp4"):
videos.append(str(video_file))
combined_video_file = video_dump_r / "combined.mp4"
subprocess.call([
"ffmpeg", "-i", videos[0], "-i", videos[1],
"-filter_complex", "[0:v][1:v]hstack=inputs=2[v];[0:a]aresample=async=1[a]",
"-map", "[v]", "-map", "[a]",
"-c:v", "libx264", "-c:a", "aac",
str(combined_video_file)
])

print(f"END VISUALIZATION: DEMO EMOTION CONTROL {rep_i+1}/{self.config['TRAIN_PARAM']['test']['replication_times']}, see {combined_video_file} =====>")

def _dump_args(self):
if not self.debug:
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added viz_dump/test/e_speech/9_miranda_source.wav
Binary file not shown.
Binary file added viz_dump/test/e_speech/9_miranda_target.wav
Binary file not shown.

0 comments on commit 5dfadf2

Please sign in to comment.