Add support for data processing and gesture editing

kiranchhatre · Jul 25, 2024 · 5dfadf2 · 5dfadf2
1 parent 3057d80
commit 5dfadf2
Show file tree

Hide file tree

Showing 11 changed files with 175 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -51,7 +51,8 @@ This is a repository for **AMUSE**: Emotional Speech-driven 3D Body Animation vi
 
 ## News :triangular_flag_on_post:
 
-- [2024/06/12] Code is available. 
+- [2024/07/25] Data processing and gesture editing scripts are available. 
+- [2024/06/12] Code is available.
 - [2024/02/27] AMUSE has been accepted for CVPR 2024! Working on code release.
 - [2023/12/08] <a href="https://arxiv.org/abs/2312.04466">ArXiv</a> is available.
 
@@ -61,6 +62,19 @@ This is a repository for **AMUSE**: Emotional Speech-driven 3D Body Animation vi
 
 ### Main Repo Setup
 
+The project has been tested with the following configuration:
+
+- **Operating System**: Linux 5.14.0-1051-oem x86_64
+- **GCC Version**: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
+- **CUDA Version**: CUDA 11.3
+- **Python Version**: Python 3.8.15
+- **GPU Configuration**:
+  - **Audio Model**: NVIDIA A100-SXM4-80GB
+  - **Motion Model**: NVIDIA A100-SXM4-40GB, Tesla V100-32GB
+
+**Note**: The audio model requires a larger GPU. Multiple GPU support is implemented for the audio model; however, it was not used in the final version.
+
+
 ```bash
 git clone https://github.com/kiranchhatre/amuse.git
 cd amuse/dm/utils/
@@ -147,17 +161,22 @@ Once the above setup is correctly done, you can execute the following:
   python main.py --fn infer_gesture
   ```
 
-- [ ] **edit_gesture**  
-  COMING SOON
+- [x] **edit_gesture**  
   ```bash
   cd $AMUSEPATH/scripts
-  python main.py --fn infer_gesture
+  python main.py --fn edit_gesture
   ```
+  For extensive editing options, please refer to the `process_loader` function in `infer_ldm.py` and experiment with different configurations in `emotion_control`, `style_transfer`, and `style_Xemo_transfer`. While editing gestures directly from speech is challenging, it offers intriguing possibilities. The task involves numerous combinations, and not all may yield optimal results. Figures A.11 and A.12 in supplementary material illustrate the inherent complexities and variations in this process.
+  Click the image below to watch the video on YouTube:
+  <div align="center">
+  <a href="https://youtu.be/48vw2NfWkJg" target="_blank">
+    <img src="https://img.youtube.com/vi/48vw2NfWkJg/maxresdefault.jpg" alt="Video Thumbnail">
+  </a>
+  </div>
+
 
 - [x] **bvh2smplx_**  
-  Convert BVH to SMPLX (only with provided BMAP presets from AMUSE website download page if possible).  
-  Highly experimental, no support.
-  Place BVH file inside `$AMUSEPATH/data/beat-rawdata-eng/beat_rawdata_english/<<actor_id>>`, where actor_id is between 1 and 30. The converted file will be in `$AMUSEPATH/viz_dump/smplx_conversions`.
+  Convert BVH to SMPL-X using the provided BMAP presets from the AMUSE website download page. Note that this feature is experimental and not officially supported. Place the BVH file inside `$AMUSEPATH/data/beat-rawdata-eng/beat_rawdata_english/<<actor_id>>`, where `actor_id` is a number between 1 and 30. The converted file will be located in `$AMUSEPATH/viz_dump/smplx_conversions`.
   ```bash
   cd $AMUSEPATH/scripts
   python main.py --fn bvh2smplx_
@@ -168,20 +187,13 @@ Once the above setup is correctly done, you can execute the following:
   <img width="50%" src="docs/static/BVH2SMPLX.gif">
 </p>
 
-- [ ] **prepare_data**  
-  Train AMUSE on BEAT 0.2.1 or BEAT-X or custom dataset.
-  COMING SOON: Conversion script, dataloader LMDB file creation.
+- [x] **prepare_data**  
+  Prepare data and create an LMDB file for training AMUSE. We provide the AMUSE-BEAT version on the project webpage. To train AMUSE on a custom dataset, you will need aligned motion and speech files. The motion data should be in an animation NPZ file compatible with the SMPL-X format.
   ```bash
   cd $AMUSEPATH/scripts
   python main.py --fn prepare_data
   ```
 
-- [ ] **other**  
-  COMING SOON
-  ```bash
-  ```
-
-
 ---
 
 ## Citation
@@ -200,6 +212,12 @@ Once the above setup is correctly done, you can execute the following:
 
 <br/>
 
+## Acknowledgments
+
+We would like to extend our gratitude to the authors and contributors of the following open-source projects, whose work has significantly influenced and supported our implementation: [EVP](https://github.com/jixinya/EVP), [Motion Diffusion Model](https://github.com/GuyTevet/motion-diffusion-model), [Motion Latent Diffusion](https://github.com/ChenFengYe/motion-latent-diffusion), [AST](https://github.com/YuanGongND/ast), [ACTOR](https://github.com/Mathux/ACTOR), and [SMPL-X](https://github.com/vchoutas/smplx). We also wish to thank [SlimeVRX](https://github.com/SlimeVRX) for their collaboration on the development of the `bvh2smplx_` task. For a more detailed list of acknowledgments, please refer to our paper. 
+
+<br/>
+
 ## Contact
 
 For any inquiries, please feel free to contact [amuse@tue.mpg.de](mailto:amuse@tue.mpg.de). Feel free to use this project and contribute to its improvement.
diff --git a/scripts/main.py b/scripts/main.py
@@ -113,14 +113,17 @@ def main(args):
         trainer = trainer(config, device, train_loader, val_loader, model_path, tag, logger_cfg, model, debug=debug)
         trainer.train_dtw_ast()                                                                                                                        
 
-    elif args.fn[0] in ["train_gesture", "infer_gesture"]:
+    elif args.fn[0] in ["train_gesture", "infer_gesture", "prepare_data", "edit_gesture"]:
 
         if args.fn[0] == "prepare_data":  # Prepare LMDB dataloader
-            pass
-
+            if "ablation" in config["TRAIN_PARAM"]["wav_dtw_mfcc"]: audio_ablation = config['TRAIN_PARAM']['wav_dtw_mfcc']['ablation']
+            else: audio_ablation = None
+            latent_diffusion_dm = full_data.latent_diffusion_dm_v2(device, verbose=True, audio_ablation=audio_ablation)
+            import sys; sys.exit("AMUSE: LMDB data prepared!")
+
         else: # Train gesture generation model
 
-            assert (args.fn[0] == "train_gesture" and not pretrained_infer) or (args.fn[0] == "infer_gesture" and pretrained_infer), f"Arg: {args.fn[0]} and pretrained_infer: {pretrained_infer} mismatch!"
+            assert (args.fn[0] == "train_gesture" and not pretrained_infer) or (args.fn[0] in ["infer_gesture", "edit_gesture"] and pretrained_infer), f"Arg: {args.fn[0]} and pretrained_infer: {pretrained_infer} mismatch!"
             smplx_data_training = config["TRAIN_PARAM"]["latent_diffusion"]["smplx_data"]
             if not pretrained_infer:
                 assert smplx_data_training, "smplx_data must be True!"

diff --git a/scripts/overrides/edit_gesture.yaml b/scripts/overrides/edit_gesture.yaml
@@ -0,0 +1,69 @@
+DATA_PARAM:
+  Bvh: 
+    bvh2smplbvh: False
+TRAIN_PARAM:
+  tag: latent_diffusion
+  motion_extractor:
+    use: False
+    tag: 
+    task: train 
+    metrics_only: False # False True
+  pretrained_infer: True
+  wav_dtw_mfcc:
+    noise: True
+    ablation: full 
+    ablation_version: v1 
+    frame_based_feats: True 
+  diffusion:
+    lmdb_cache: BEAT-cache/2023-10-28_30F_fing_smplx_MOSH_full_v1_feat_based_300  
+  latent_diffusion:
+    smplx_rep: 6D  
+    pretrained_ast: wav_dtw_mfcc_20231022-044436_actors 
+    pretrained_lpdm: LPDM_20231028-210758_actors_smplx
+    pretrained_prior_lpdm_e: best 
+    pretrained_ldm_lpdm_e: best 
+    smpl_viz_mode: BLENDER_EEVEE # BLENDER_EEVEE, CYCLES
+    half_body: True 
+    train_upper_body: False 
+    skip_trans: False 
+    vtex_displacement: False 
+    zero_trans: False 
+    freeze_init_LoBody: False 
+  motion_feature_extractor:
+    model: 
+  test:
+    emotion_control_list:
+      use: True # False True            
+      overwrite: "amusepp_all_tests/SUPMATMETRIC-LPDM_20231028-210758_actors_smplx"
+      actor: "miranda"
+      audios: "/home/kchhatre/Work1/code/amuse/viz_dump/test/e_speech"
+      renders: "/home/kchhatre/Work1/code/amuse/viz_dump/test/e_gesture"
+    emotion_control:
+      use: False # False True            
+      overwrite: "amusepp_all_tests/SUPMATMETRIC-LPDM_20231028-210758_actors_smplx"       
+      actor: "[wayne]" # wayne, scott, solomon, lawrence, stewart, sophie, miranda, kieks, zhao, lu, jorge, daiki, ayana, katya
+      content_emotion: "[neutral]"
+      take_element: "first" # first last random
+    style_transfer:
+      use: False # False True
+      overwrite: 
+      actors: "[lu-lawrence]"
+      emotion: "[angry]"
+    style_Xemo_transfer:
+      use: False # False True
+      overwrite: "amusepp_all_tests/SUPMAT-styleXem-LPDM_20231028-210758_actors_smplx"
+      actors: "[scott-lu]"
+      emotion: "[happy-angry]"
+    audio_list:
+      use: False # False True
+      short_audio_list: False # False True
+      overwrite: 
+      processed_audios: 15
+      vidlist: "vidlist.csv"
+    diff_only: False # False True
+  baselines:
+    run: False # False True
+    prepare_dm: False # False True  
+    renders:
+      task: custom_renders # custom_renders YT_monologues_renders
+      subtask:
diff --git a/scripts/trainer.py b/scripts/trainer.py
@@ -27,7 +27,7 @@
 from torch.autograd import Variable
 from einops import rearrange, repeat
 from pytorch3d import transforms as p3d_tfs
-from moviepy.editor import VideoFileClip, concatenate_videoclips
+from moviepy.editor import VideoFileClip, concatenate_videoclips, clips_array
 
 from dm.utils.bvh_utils import *
 from dm.utils.wav_utils import *
@@ -167,6 +167,7 @@ def __init__(self, config, device, train_loader, val_loader=None, model_path=Non
                 if self.style_Xemo_transfer: 
                     self.style_Xemo_transfer_actors = self.config["TRAIN_PARAM"]["test"]["style_Xemo_transfer"]["actors"]
                     self.style_Xemo_transfer_emotion = self.config["TRAIN_PARAM"]["test"]["style_Xemo_transfer"]["emotion"]
+                self.demo_emotion_control = self.config["TRAIN_PARAM"]["test"]["emotion_control_list"]["use"] if "emotion_control_list" in self.config["TRAIN_PARAM"]["test"].keys() else False
             else:
                 # Latent Diffusion
                 cfg_name = self.config["TRAIN_PARAM"]["latent_diffusion"]["arch"]
@@ -1032,6 +1033,69 @@ def eval_prior_latdiff_forward_backward_v1(self, baseline, ldm_epoch, audio_list
 
                         print(f"END VISUALIZATION: EMOTION CONTROL {rep_i+1}/{self.config['TRAIN_PARAM']['test']['replication_times']} =====>")
                     else: print(f"END EVALUATION METRICS ONLY: EMOTION CONTROL {rep_i+1}/{self.config['TRAIN_PARAM']['test']['replication_times']} =====>")
+
+                if self.demo_emotion_control:
+
+                    actor = self.config["TRAIN_PARAM"]["test"]["emotion_control_list"]["actor"]
+                    print(f"DEMO EMOTION CONTROL EDITS for {actor} =====>")
+                    audios = list(Path(self.config["TRAIN_PARAM"]["test"]["emotion_control_list"]["audios"]).glob("*.wav"))
+                    src_a, tgt_a = [x for x in audios if "_source" in x.stem][0], [x for x in audios if "_target" in x.stem][0]
+                    target_path = Path(self.config["TRAIN_PARAM"]["test"]["emotion_control_list"]["renders"])
+
+                    rst = []
+                    src_a_pydub = AudioSegment.from_wav(str(src_a))
+                    src_a_arr, _ = torchaudio.load(str(src_a))
+                    src_a_arr = src_a_arr - src_a_arr.mean()
+                    src_a_con, src_a_emo, src_a_sty = self.model.process_single_seq(src_a_arr, framerate=16000, baseline=baseline)
+
+                    tgt_a_arr, _ = torchaudio.load(str(tgt_a))
+                    tgt_a_arr = tgt_a_arr - tgt_a_arr.mean()
+                    _, tgt_a_emo, _ = self.model.process_single_seq(tgt_a_arr, framerate=16000, baseline=baseline)
+
+                    feats_rst = self.model.diffusion_backward(1, src_a_con, src_a_emo, src_a_sty) # Gesture generation
+                    poses, trans = feats_rst["poses"], feats_rst["trans"]
+                    poses = rearrange(poses, "b t j d -> b t (j d)")
+                    feats_rst = torch.cat((poses,trans), dim=-1)
+                    rst.append(
+                    {
+                        "feats": feats_rst,
+                        "audio": src_a_pydub,
+                        "info": f"Original {actor}"
+                    })
+
+                    feats_rst = self.model.diffusion_backward(1, src_a_con, tgt_a_emo, src_a_sty) # Gesture editing
+                    poses, trans = feats_rst["poses"], feats_rst["trans"]
+                    poses = rearrange(poses, "b t j d -> b t (j d)")
+                    feats_rst = torch.cat((poses,trans), dim=-1)
+                    rst.append(
+                    {
+                        "feats": feats_rst,
+                        "audio": src_a_pydub,
+                        "info": f"Emotion edited {actor}"
+                    })
+
+                    video_dump_r = target_path / f"Custom_audios_{self.stamp}_E{ldm_epoch}" / f"rep{rep_i}"
+                    assert self.viz_type in ["CaMN"], "[LDM EVAL] Invalid viz type: [%s]" % self.viz_type
+                    for i, sample_dict in enumerate(rst):
+                        print(f"VISUALIZATION: LIST AUDIOS {i} =====>")
+                        video_dump = video_dump_r / f"rst_{i}"
+                        self.visualizer.animate_ldm_sample_v1(sample_dict, video_dump, self.smplx_data, self.skip_trans, without_txt=False)
+
+                    videos = []
+                    for i in range(2): 
+                        video_dump = video_dump_r / f"rst_{i}"
+                        for video_file in video_dump.rglob("*_single_subject_video.mp4"): 
+                            videos.append(str(video_file))
+                    combined_video_file = video_dump_r / "combined.mp4"
+                    subprocess.call([
+                        "ffmpeg", "-i", videos[0], "-i", videos[1], 
+                        "-filter_complex", "[0:v][1:v]hstack=inputs=2[v];[0:a]aresample=async=1[a]", 
+                        "-map", "[v]", "-map", "[a]", 
+                        "-c:v", "libx264", "-c:a", "aac", 
+                        str(combined_video_file)
+                    ])
+
+                    print(f"END VISUALIZATION: DEMO EMOTION CONTROL {rep_i+1}/{self.config['TRAIN_PARAM']['test']['replication_times']}, see {combined_video_file} =====>")
 
     def _dump_args(self):
         if not self.debug:

diff --git a/viz_dump/test/e_gesture/Custom_audios_20240725-204840_E6000/rep0/combined.mp4 b/viz_dump/test/e_gesture/Custom_audios_20240725-204840_E6000/rep0/combined.mp4
diff --git a/...ustom_audios_20240725-204840_E6000/rep0/rst_0/seq_0/miranda_seq_0_eLulTM_motion_smplx.npz b/...ustom_audios_20240725-204840_E6000/rep0/rst_0/seq_0/miranda_seq_0_eLulTM_motion_smplx.npz
diff --git a/...ustom_audios_20240725-204840_E6000/rep0/rst_0/seq_0/seq_0_eLulTM_single_subject_video.mp4 b/...ustom_audios_20240725-204840_E6000/rep0/rst_0/seq_0/seq_0_eLulTM_single_subject_video.mp4
diff --git a/...ustom_audios_20240725-204840_E6000/rep0/rst_1/seq_0/miranda_seq_0_4uawtw_motion_smplx.npz b/...ustom_audios_20240725-204840_E6000/rep0/rst_1/seq_0/miranda_seq_0_4uawtw_motion_smplx.npz
diff --git a/...ustom_audios_20240725-204840_E6000/rep0/rst_1/seq_0/seq_0_4uawtw_single_subject_video.mp4 b/...ustom_audios_20240725-204840_E6000/rep0/rst_1/seq_0/seq_0_4uawtw_single_subject_video.mp4
diff --git a/viz_dump/test/e_speech/9_miranda_source.wav b/viz_dump/test/e_speech/9_miranda_source.wav
diff --git a/viz_dump/test/e_speech/9_miranda_target.wav b/viz_dump/test/e_speech/9_miranda_target.wav