PyTorch implementation of selected VGG models. Useful for feature extraction from face images and speech waveforms.
| Model | Modality | Checkpoint | Checkpoint notes | Other comments |
|---|---|---|---|---|
| VGG-M [1] | Face | Download | Face recognition model trained on VGG-Face [1]. | The model's code is adapted from the official implementation. I have added some functions to facilitate pre-processing and feature extraction. |
| VGG-M [2] | Voice | Download | Speaker recognition model trained on VoxCeleb1 [2] | The original code is in Matlab and I re-implemented it here in PyTorch. |
| VGGVoxResNet [3] | Voice | Download | Speaker verification model trained on VoxCeleb2 [3] | The original code is in Matlab and I re-implemented it here in PyTorch. |
| SVHF [4] | Face + Voice | Download | Static binary voice-to-face matching model trained on VoxCeleb2 [3] [4] | The original code, for static binary voice-to-face matching, is in Matlab. I re-implemented it here in PyTorch and added functionality for face-to-voice matching, as well as dynamic images. |
Prerequisites:
- NumPy https://numpy.org/
- PyTorch https://pytorch.org/
- OpenCV https://pypi.org/project/opencv-python/
- SciPy https://scipy.org/
- Librosa https://librosa.org/doc/latest/index.html
To install from source, run the following:
pip install -e .
Face feature extraction:
import torch
from vgg_pytorch.preprocessing import preprocess_image
from vgg_pytorch.models import VGG_M_face_bn_dag
device = "cuda:0"
path_to_checkpoint = "<insert path to pre-trained model checkpoint>"
path_to_image = "<insert path to image>"
# Load model
model = VGG_M_face_bn_dag()
model.to(device)
model.eval()
model.load_state_dict(torch.load(path_to_checkpoint, map_location = device))
# Pre-process input image
x = preprocess_image(path_to_image)
# Extract features
with torch.no_grad():
z = model.extract_features(x.unsqueeze(0).to(device))Voice feature extraction:
import torch
from vgg_pytorch.preprocessing import preprocess_audio
from vgg_pytorch.models import VGGMVox, VGGVoxResNet
device = "cuda:0"
path_to_checkpoint = "<insert path to pre-trained model checkpoint>"
path_to_audio = "<insert path to an audio file>"
# Load model
model = VGGVoxResNet()
model.to(device)
model.eval()
model.load_state_dict(torch.load(path_to_checkpoint, map_location = device))
# Pre-process input audio
x = preprocess_audio(path_to_audio)
# Extract features
with torch.no_grad():
z = model.extract_features(x.unsqueeze(0).to(device))Cross-modal face and voice feature extraction:
import torch
from vgg_pytorch.preprocessing import preprocess_image, preprocess_audio
from vgg_pytorch.models import SVHF
device = "cuda:0"
path_to_checkpoint = "<insert path to pre-trained model checkpoint>"
path_to_face = "<insert path to image>"
path_to_audio = "<insert path to an audio file>"
# Load model
model = SVHF()
model.eval()
model.to(device)
model.load_state_dict(torch.load(path_to_ckpt, map_location = device))
# Pre-process inputs
x_f = preprocess_image(path_to_face)
x_a = preprocess_audio(path_to_audio)
# Extract features
with torch.no_grad():
z_f = model.face_net(x_f.unsqueeze(0).to(device))
z_a = model.voice_net(x_a.unsqueeze(0).to(device))If you use this code, please cite the original publications of the authors:
@InProceedings{Parkhi15,
author = "Omkar M. Parkhi and Andrea Vedaldi and Andrew Zisserman",
title = "Deep Face Recognition",
booktitle = "British Machine Vision Conference",
year = "2015",
}
@InProceedings{Nagrani17,
author = "Nagrani, A. and Chung, J.~S. and Zisserman, A.",
title = "VoxCeleb: a large-scale speaker identification dataset",
booktitle = "INTERSPEECH",
year = "2017",
}
@InProceedings{Nagrani17,
author = "Chung, J.~S. and Nagrani, A. and Zisserman, A.",
title = "VoxCeleb2: Deep Speaker Recognition",
booktitle = "INTERSPEECH",
year = "2018",
}
@InProceedings{Nagrani18a,
author = "Nagrani, A. and Albanie, S. and Zisserman, A.",
title = "Seeing Voices and Hearing Faces: Cross-modal biometric matching",
booktitle = "IEEE Conference on Computer Vision and Pattern Recognition",
year = "2018",
}Please also refer to the following official implementations:
- VGG-M face model: http://www.robots.ox.ac.uk/~albanie/models/pytorch-mcn/vgg_m_face_bn_dag.py
- VGGVox models in Matlab: https://github.com/a-nagrani/VGGVox/tree/master
- SVHF model in Matlab: https://github.com/a-nagrani/SVHF-Net
- Converting VGG Matlab models to PyTorch: https://github.com/albanie/pytorch-mcn
| [1] | (1, 2) Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, "Deep face recognition", British Machine Vision Conference, 2015 |
| [2] | (1, 2) Arsha Nagrani, Joon S. Chung, Andrew Zisserman, "VoxCeleb: a large-scale speaker identification dataset", Interspeech, 2017 |
| [3] | (1, 2, 3) Joon S. Chung, Arsha Nagrani, Andrew Zisserman, "VoxCeleb2: Deep Speaker Recognition", Interspeech, 2018 |
| [4] | (1, 2) Arsha Nagrani, Samuel Albanie, Andrew Zisserman, "Seeing Voices and Hearing Faces: Cross-modal biometric matching", IEEE CVPR, 2018 |