The aim of this project is to train a state of art face recognizer using TensorFlow 2.0. The architecture chosen is a modified version of ResNet50 and the loss function used is ArcFace, both originally developed by deepinsight in mxnet.
The dataset used for training is the CASIA-Webface MS1M-ArcFace dataset used in insightface, and it is available their dataset zoo. The images are aligned using mtcnn and cropped to 112x112.
The results of the training are evaluated with lfw, cfp_ff, cfp_fp and age_db30, using the same metrics as deepinsight.
The full training and evaluation code is provided, as well as some trained weights.
A Dockerfile is also provided with all prerequisites installed.
UPDATE Increased single GPU batch size by updating the gradients after several inferences (96 in TESLA P100).
UPDATE Added Batch Renormalization and Group Normalization for training using smaller batches.
UPDATE Added regularization loss decay coefficient for reducing the impact of the regularization loss when the inference loss becomes smaller.
UPDATE Added multi gpu training code. It uses the experimental central storage strategy, which stores all the variables in the CPU and allows increasing the batch size on each GPU (128 for each TESLA P100).
UPDATE Added model C, trained with 3 gpus TESLA P100.
If you are not using the provided Dockerfile, you will need to install the following packages:
pip3 install tensorflow-gpu==2.0.0b1 pillow mxnet matplotlib==3.0.3 opencv-python==3.4.1.15 scikit-learn
Download the CASIA-Webface MS1M-ArcFace dataset from insightface model zoo and unzip it to the dataset folder.
Convert the dataset to the tensorflow format:
cd dataset
mkdir converted_dataset
python3 convert_dataset.py
For training using 1 GPU:
python3 train.py
For training using multiple GPU:
python3 train_multigpu.py
The training process can be followed loading the generated log file (in output/logs) with tensorboard.
The model can be evaluated using the lfw, cfp_ff, cfp_fp and age_db30 databases. The metrics are the same used in insightface.
Before launching the test, you may change the checkpoint path in the evaluation.py script.
python3 evaluation.py
model name | train db | normalization layer | reg loss | batch size | gpus | total_steps | download |
---|---|---|---|---|---|---|---|
model A | casia | batch normalization | uncontrolled | 16*8 | 1 | 150k | model a |
dbname | accuracy |
---|---|
lfw | 0.9772 |
cfp_ff | 0.9793 |
cfp_fp | 0.8786 |
age_db30 | 0.8752 |
model name | train db | normalization layer | reg loss | batch size | gpus | total_steps | download |
---|---|---|---|---|---|---|---|
model B | ms1m | batch renormalization | uncontrolled | 16*8 | 1 | 768k | model b |
dbname | accuracy |
---|---|
lfw | 0.9962 |
cfp_ff | 0.9964 |
cfp_fp | 0.9329 |
age_db30 | 0.9547 |
model name | train db | normalization layer | reg loss | batch size | gpus | download |
---|---|---|---|---|---|---|
model C | ms1m | batch renormalization | uncontrolled | 384 | 3 | model c |
dbname | accuracy |
---|---|
lfw | 0.9967 |
cfp_ff | 0.9970 |
cfp_fp | 0.9403 |
age_db30 | 0.9652 |
The batch size must be bigger but the gpu is exhausted. -> Now using batch-> Now using 2 GPU with batch size 128 on each GPU with the central storage strategy.12896 by updating the gradients after several inferences.Further training of the net to improve accuracy.Add batch renormalization for training using small batches.(link)Add group normalization for training using small batches.(link)Train the model with a bigger dataset.- Add quantization awareness to training. This is not yet possible in TensorFlow 2.0 because it was part of the contrib module, which has been removed in the new version, as commented in this issue.
- Test other network architectures.