This is a project about talking faces. We use 576X576 sized facial images for training, which can generate 2k, 4k, 6k, and 8k digital human videos.
We have optimized in the following areas:
- Using Hubert for audio processing, there is a significant improvement compared to wav2lip-96 and wav2lip-288.
- Optimized dataset processing, eliminating the need to manually cut videos into seconds.
- We have optimized the network structure to better extract features,Our idea is not to train the discriminator separately, but to train the generator directly..
- We trained the base model with a high-definition dataset of hundreds of people. Although its generalization ability is not strong, the effect is very good after single or multi person fine-tuning.
Video | Project Page | Code
This project is not yet mature enough. We will gradually release the code, first release the data processing code, then release the inference code, and when the time is ripe, we will release the training code.
The code is mainly borrowed from wav2lip, wav2lip-288, wav2lip-384, ER-NeRF, etc. Thank you for their wonderful work.
Project made by Lu Rui from Langzizhixin Technology company in Chengdu, China, 2024.
At present, the video preprocessing, facial cropping, and audio Hubert processing codes have been completed. Welcome everyone to contribute code related to network structure, training, and inference.