Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MobileFaceNet training pipeline #214

Closed
nttstar opened this issue May 15, 2018 · 44 comments
Closed

MobileFaceNet training pipeline #214

nttstar opened this issue May 15, 2018 · 44 comments
Labels

Comments

@nttstar
Copy link
Collaborator

nttstar commented May 15, 2018

No description provided.

@nttstar
Copy link
Collaborator Author

nttstar commented May 16, 2018

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

@tianxingyzxq
Copy link

Can you share mobilenet v2 training pipeline?

@AllenMas
Copy link

what is the accuracy on LFW and AgeDB after trained by softmax, can you share the training log?

@AleximusOrloff
Copy link

Hi, can I ask in this thread?
Which type of filling have you used during network creation? xavier or something else?
I'm newbie to MXNet, trying to reproduce your result in Torch7

@youyicloud
Copy link

I used mxnet to calculate the cosine distance of the value of fc1 output, the output is wrong. The model is downloaded from the above Baidu cloud, and then the picture is used by two different men and women, has been aligned with the lfww mtcnn picture of.

`#coding=utf-8
import mxnet as mx
import numpy as np
import math
import cv2
from collections import namedtuple
from sklearn import preprocessing
Batch= namedtuple('Batch', ['data'])

image_size = (112,112)
batch_size = 2

def load_model(model_prefix):
sym, arg_params, aux_params = mx.model.load_checkpoint(model_prefix, 0)
all_layers = sym.get_internals()
sym = all_layers['fc1_output']
model = mx.mod.Module(symbol=sym, label_names = None)
model.bind(data_shapes=[('data', (2, 3, image_size[0], image_size[1]))])
model.set_params(arg_params, aux_params)
return model

def dis(x,y):
return np.dot(x, y)/np.linalg.norm(x)/np.linalg.norm(y)

def test(model_prefix):
img_path_1 = "./img_test/41.jpg"
img_path_2 = "./img_test/31.jpg"
model = load_model(model_prefix)
img1 = cv2.cvtColor(cv2.imread(img_path_1), cv2.COLOR_BGR2RGB)
img1 = cv2.resize(img1, (112, 112), interpolation=cv2.INTER_CUBIC)
img2 = cv2.cvtColor(cv2.imread(img_path_2), cv2.COLOR_BGR2RGB)
img2 = cv2.resize(img2, (112, 112), interpolation=cv2.INTER_CUBIC)
img1 = np.transpose(img1, axes=(2, 0, 1))
img2 = np.transpose(img2, axes=(2, 0, 1))
data_batch = []
data_batch.append(img1)
data_batch.append(img2)
data_batch = np.array(data_batch)
print(data_batch.shape)
print(img2.shape)
model.forward(Batch([mx.nd.array(data_batch)]))
prob = model.get_outputs()[0].asnumpy()
print(dis(prob[0],prob[1]))

model_prefix = "../../models/model"
test(model_prefix)`

#############Here are the output########

[00:19:53] src/nnvm/legacy_json_util.cc:190: Loading symbol saved by previous version v1.0.0. Attempting to upgra de... [00:19:53] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded! (2, 3, 112, 112) (3, 112, 112) -0.9996472

could you tell me what have I missed? @nttstar

@nttstar
Copy link
Collaborator Author

nttstar commented May 26, 2018

why you thought the result was wrong?

@youyicloud
Copy link

youyicloud commented May 26, 2018 via email

@nttstar
Copy link
Collaborator Author

nttstar commented May 27, 2018

If the images were already aligned, why you resized them again in your code?

@youyicloud
Copy link

I have just croped the image by the boxes, I need to resize the image to the input shape. I hava found you code in deploy dir, I am analyzing my mistakes by comparing my code with your code, thank you a lot !

@BUAA-21Li
Copy link

# #The model I got is too big
i used the code:
CUDA_VISIBLE_DEVICES='0' python -u train_softmax.py --network y1 --ckpt 2 --loss-type 0 --lr-steps 120000,140000 --wd 0.00004 --fc7-wd-mult 10 --per-batch-size 512 --emb-size 128 --data-dir ../datasets/faces_ms1m_112x112 --prefix ../models/MobileFaceNet/model-y1-softmax
to got my model。but i found this model is almost 40M. i have no idea why i got so much big model comparing to yours? PLAESE HELP ME

@AleximusOrloff
Copy link

@BUAA-21Li
your model is too big cause of last fc layer, before softmax layer.

@wayen820
Copy link

wayen820 commented Jun 2, 2018

@BUAA-21Li use deploy/model_slim.py to delete last layer

@Audi16
Copy link

Audi16 commented Jun 6, 2018

Why have you pre-trained a model with softmax loss when training MobileFaceNet with Arcface loss, but training other networks from the scratch?

@BUAA-21Li
Copy link

@wayen820 THANKS ! I have solved it!

@qidiso
Copy link

qidiso commented Jun 10, 2018

**now we get more higher accuray using my modified mobilenet network

[lfw][12000]Accuracy-Flip: 0.99617+-0.00358
[agedb_30][12000]Accuracy-Flip: 0.96017+-0.00893 .

@BUAA-21Li
Copy link

@youyicloud is your problem solved? my code is similar to yours and the consine distances from samples are all around -0.99,no matter positive or negative samples.

@youyicloud
Copy link

@BUAA-21Li you can use deploy/test.py and load mobilefacenet model, then you can use the consine distance or the Euclidean Distance. It can output the right answer~

@BUAA-21Li
Copy link

@youyicloud thank you for your reply.Have you analyzed why your code failed getting correct result.

@rmaria
Copy link

rmaria commented Jun 28, 2018

In the article, you have reported results for LResNet100E-IR (for m=0.5):
LFW: 99.83 , CFP-FP: 94.04, AgeDB-30 98.08

With the Mobilenet (m=?) you report the accuracies:
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

What is the expected accuracy drop of this model on MegaFace Challenge 1 (Table 9 from the article)?

@EdwardChou
Copy link

Thanks for your code. Recently I was trying to reproduce the mobile facenet model by your instructions, yet I encountered some question as following, would you please give me some hints. (P.S. the training dataset was combined faces_ms1m_112x112 with my private dataset, using scripts like "im2rec.py", "face2rec2.py" and "dataset_merge.py".)


root@656688c713aa:/proj/insightface/src# CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir ../datasets/xl_marked --network y1 --loss-type 0 --prefix ../mobile_facenet --per-batch-size 128 --lr-steps "240000,360000,440000" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
gpu num: 4
num_layers 1
image_size [112, 112]
num_classes 381
Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ckpt=2, ctx_num=4, cutoff=0, data_dir='../datasets/xl_marked', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0, fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_w=112, loss_type=0, lr=0.1, lr_steps='240000,360000,440000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.1, margin_s=32.0, max_steps=140002, mom=0.9, network='y1', num_classes=381, num_layers=1, per_batch_size=128, power=1.0, prefix='../mobile_facenet', pretrained='', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000, version_act='prelu', version_input=1, version_output='E', version_se=0, version_unit=3, wd=4e-05)
init mobilefacenet 1
('version_output:', 'E')
Traceback (most recent call last):
File "train_softmax.py", line 488, in
main()
File "train_softmax.py", line 485, in main
train_net(args)
File "train_softmax.py", line 334, in train_net
sym, arg_params, aux_params = get_symbol(args, arg_params, aux_params)
File "train_softmax.py", line 170, in get_symbol
embedding = fmobilefacenet.get_symbol(args.emb_size, bn_mom = args.bn_mom, version_output=args.version_output)
File "symbols/fmobilefacenet.py", line 51, in get_symbol
assert version_output=='GDC' or version_output=='GNAP'
AssertionError


@shangleyi
Copy link

@EdwardChou add "--version-output GNAP" to argument

@EdwardChou
Copy link

@shangleyi Thanks for reply. After append "--version-output GNAP" to argument, run, and another error pop out, yet I am using the correct input size, namely 112*112 input images. This is pretty wired.

expected [3,160,160], got [3,112,112]

The complete log is as following:

root@656688c713aa:/proj/insightface/src# CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir ../datasets/marked_face_crop --network y1 --loss-type 0 --prefix ../mobile_facenet --per-batch-size 128 --lr-steps "240000,360000,440000" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002 --version-output GNAP
gpu num: 4
num_layers 1
image_size [112, 112]
num_classes 381
Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ckpt=2, ctx_num=4, cutoff=0, data_dir='../datasets/marked_face_crop', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0, fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_w=112, loss_type=0, lr=0.1, lr_steps='240000,360000,440000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.1, margin_s=32.0, max_steps=140002, mom=0.9, network='y1', num_classes=381, num_layers=1, per_batch_size=128, power=1.0, prefix='../mobile_facenet', pretrained='', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000, version_act='prelu', version_input=1, version_output='GNAP', version_se=0, version_unit=3, wd=4e-05)
init mobilefacenet 1
('version_output:', 'GNAP')
INFO:root:loading recordio ../datasets/marked_face_crop/train.rec...
header0 label [  9369.  18696.]
id2range 9327
9368
rand_mirror 1
lr_steps [240000, 360000, 440000]
call reset()
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mxnet/python/mxnet/io.py", line 396, in prefetch_func
    self.next_batch[i] = self.iters[i].next()
  File "/proj/insightface/src/image_iter.py", line 215, in next
    batch_data[i][:] = self.postprocess_data(datum)
  File "/mxnet/python/mxnet/ndarray/ndarray.py", line 437, in __setitem__
    self._set_nd_basic_indexing(key, value)
  File "/mxnet/python/mxnet/ndarray/ndarray.py", line 691, in _set_nd_basic_indexing
    value.copyto(self)
  File "/mxnet/python/mxnet/ndarray/ndarray.py", line 1876, in copyto
    return _internal._copyto(self, out=other)
  File "<string>", line 25, in _copyto
  File "/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/mxnet/python/mxnet/base.py", line 146, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [13:43:04] src/operator/nn/./../tensor/../elemwise_op_common.h:123: Check failed: assign(&dattr, (*vec)[i]) Incompatibleattr in node  at 0-th output: expected [3,160,160], got [3,112,112]

Stack trace returned 10 entries:
[bt] (0) /mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f5416c1559a]
[bt] (1) /mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f5416c16138]
[bt] (2) /mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseAttr<nnvm::TShape, &mxnet::op::shape_is_none, &mxnet::op::shape_assign, true, &mxnet::op::shape_string[abi:cxx11], -1, -1>(nnvm::NodeAttrs const&, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, nnvm::TShape const&)::{lambda(std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, unsigned long, char const*)#1}::operator()(std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, unsigned long, char const*) const+0xbf1) [0x7f5416e6da61]
[bt] (3) /mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<1, 1>(nnvm::NodeAttrs const&, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*)+0x24a) [0x7f5416e6ff7a]
[bt] (4) /mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0xb4d) [0x7f54191c0e1d]
[bt] (5) /mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x35f) [0x7f5419198d8f]
[bt] (6) /mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, charconst**)+0xe7b) [0x7f541968d4eb]
[bt] (7) /mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x3ff) [0x7f541968ecaf]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f5494337e40]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f54943378ab]



[13:43:06] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, thiscan take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/mxnet/python/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.25 vs. 0.001953125). Is this intended?
  optimizer_params=optimizer_params)
Killed

@shangleyi
Copy link

@EdwardChou How did you prepare train.rec

@EdwardChou
Copy link

Hi, @shangleyi
This is my way to generate train.rec.

cd PROJ_DIR/src/data

download im2rec.py, modify script follow #265

# 160*160*3 -> 112*112*3
python im2rec.py --list --resize 112 --recursive ./my_data IMG_DIR

echo "100,112,112" > property

Modify line to "with open('IMG_DIR' + fullpath, 'rb') as fin:"

python face2rec2.py  . 

# Move generated dataset to PROJ_DIR/datasets/MY_DATASET
python dataset_merge.py --include "../../datasets/faces_ms1m_112x112/,../../datasets/MY_DATASET/" --output "../../datasets/MY_MERGE_DATASET/"

@shangleyi
Copy link

@EdwardChou I used face2rec2.py directly without using im2rec.py and it worked. Maybe you should write a script which resizes the images and then use face2rec2.py directly. I'm not so sure about im2rec.py.

@shangleyi
Copy link

training dataset: ms1m, ms1m-v2. private dataset
lfw: 99.583, cfp_fp: 95.357, agedb_30: 96.533
training process: https://github.com/shangleyi/insightface-training-note/blob/master/README.md

@EdwardChou
Copy link

@shangleyi Thanks you so much. My problem is exactly the resize function in im2rec.py doesn't work. So I resize the images with another script. Currently the training procedure following instruction above looks good. You save my day!

@sunjunlishi
Copy link

Is there any training file corresponding to Caffe? I want to use Caffe training.
(有没有对应 caffe 的 训练的文件,我想用caffe训练)

@erichouyi
Copy link

dataset: emore
network backbone: mobilefacenet + GNAP block
loss function: arcface(m=0.5)
training pipeline: finetune (lr drop at 100K, 140K, 160K), batch-size:512
one epoch 52: LFW-99.60%, CFP-FP-93.46%, AgeDB-95.45%

@EdwardChou
Copy link

EdwardChou commented Oct 10, 2018

Hi, @nttstar I encounter some strange thing when I finetune mobile-facenet model (2nd step of 2-step pipeline) and would like to ask for your help. My training acc got stuck in 0.51~0.53 while accuracy of lfw, agedb-30 reach 95%. Similar to #187

My finetune param is like:

Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ckpt=2,            ctx_num=4, cutoff=0, data_dir='../datasets/x', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0,    fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_w=112, loss_type=4, lr=0.1,         lr_steps='100000,140000,160000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.5, margin_s=64.0, max_steps=0, mom=0.  9, network='y1', num_classes=94491, num_layers=1, per_batch_size=128, power=1.0, prefix='../xz/xz_mobile_facenet', pretrained='../xz_mobile_facenet,70',               rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000,          version_act='prelu', version_input=1, version_output='GNAP', version_se=0, version_unit=3, wd=4e-05)

and the result is like:

 INFO:root:Epoch[145] Batch [1780]   Speed: 851.07 samples/sec   acc=0.529687
 INFO:root:Epoch[145] Batch [1800]   Speed: 866.48 samples/sec   acc=0.529980
 INFO:root:Epoch[145] Batch [1820]   Speed: 725.38 samples/sec   acc=0.519043
 INFO:root:Epoch[145] Batch [1840]   Speed: 919.19 samples/sec   acc=0.527051
 INFO:root:Epoch[145] Batch [1860]   Speed: 996.87 samples/sec   acc=0.525586
 INFO:root:Epoch[145] Batch [1880]   Speed: 1021.45 samples/sec  acc=0.521094
 lr-batch-epoch: 0.0001 1894 145
 testing verification..
 (12000, 128)
 infer time 39.693939
 [lfw][1082000]XNorm: 11.132285
 [lfw][1082000]Accuracy-Flip: 0.99517+-0.00398
 testing verification..
 (14000, 128)
 infer time 42.053231
 [cfp_fp][1082000]XNorm: 9.771846
 [cfp_fp][1082000]Accuracy-Flip: 0.88900+-0.02205
 testing verification..
 (12000, 128)
 infer time 34.666512
 [agedb_30][1082000]XNorm: 11.260081
 [agedb_30][1082000]Accuracy-Flip: 0.95383+-0.00796
 saving 541

I have seen your training log attach in baiduyun, the log shows the acc of your model reach 0.5 after 15 epoch, which is the same to my experiment result. Yet the your log stop at 24 epoch when the highest acc reach 0.55. Did you conduct further experiment to reach higher accuracy? Or there is something wrong with the calculation of training acc? Looking for your help, Thanks.

@jiankang1991
Copy link

Hi guys,
for the first step in the training pipeline, usually how many epochs do you use to get a reasonable accuracy of LFW, such as 99%?
I trained for a long time, the accuracy is always around 91%.

@clhne
Copy link

clhne commented Feb 18, 2019

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

@nttstar
按照你的配置,请问训练了多长时间?达到 LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91这样的准确率。

@clhne
Copy link

clhne commented Feb 18, 2019

Hi guys,
for the first step in the training pipeline, usually how many epochs do you use to get a reasonable accuracy of LFW, such as 99%?
I trained for a long time, the accuracy is always around 91%.

@karlTUM
CPU E5-2650, v4
GPU 2x RTX2080Ti
Epoch 15, batch_size 32, lr 0.001

INFO:root:Epoch[15] Batch [32040-32060] Speed: 274.12 samples/sec acc=0.865625
INFO:root:Epoch[15] Batch [32060-32080] Speed: 272.38 samples/sec acc=0.839063
INFO:root:Epoch[15] Batch [32080-32100] Speed: 272.94 samples/sec acc=0.855469
INFO:root:Epoch[15] Batch [32100-32120] Speed: 272.41 samples/sec acc=0.839063
INFO:root:Epoch[15] Batch [32120-32140] Speed: 272.01 samples/sec acc=0.852344
INFO:root:Epoch[15] Batch [32140-32160] Speed: 267.44 samples/sec acc=0.855469
INFO:root:Epoch[15] Batch [32160-32180] Speed: 273.78 samples/sec acc=0.853125
INFO:root:Epoch[15] Batch [32180-32200] Speed: 274.96 samples/sec acc=0.851562
INFO:root:Epoch[15] Batch [32200-32220] Speed: 273.08 samples/sec acc=0.842187
INFO:root:Epoch[15] Batch [32220-32240] Speed: 273.76 samples/sec acc=0.849219
lr-batch-epoch: 0.0001 32249 15
testing verification..
(12000, 512)
infer time 25.010638999999994
[lfw][924000]XNorm: 23.051082
[lfw][924000]Accuracy-Flip: 0.99700+-0.00296
testing verification..
(14000, 512)
infer time 29.09600100000001
[cfp_fp][924000]XNorm: 23.878208
[cfp_fp][924000]Accuracy-Flip: 0.92786+-0.01553
testing verification..
(12000, 512)
infer time 24.954134000000025
[agedb_30][924000]XNorm: 23.627240
[agedb_30][924000]Accuracy-Flip: 0.97650+-0.01031
saving 462

@clhne
Copy link

clhne commented Feb 18, 2019

512

similar issue.
我这边epoch 17, acc已经达到了0.9,但后面提升就很慢了
请问:

  1. CPU型号是?几颗?
  2. 显卡型号是?用了几张?

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

@nttstar
请问log文件是怎么自动生成的呢?谢谢~

@Talgin
Copy link

Talgin commented Jul 4, 2019

Hi, @shangleyi
This is my way to generate train.rec.

cd PROJ_DIR/src/data

download im2rec.py, modify script follow #265

# 160*160*3 -> 112*112*3
python im2rec.py --list --resize 112 --recursive ./my_data IMG_DIR

echo "100,112,112" > property

Modify line to "with open('IMG_DIR' + fullpath, 'rb') as fin:"

python face2rec2.py  . 

# Move generated dataset to PROJ_DIR/datasets/MY_DATASET
python dataset_merge.py --include "../../datasets/faces_ms1m_112x112/,../../datasets/MY_DATASET/" --output "../../datasets/MY_MERGE_DATASET/"

Hi, have you managed to get correct merged dataset?
We also tried to merge the two datasets: faces_emore and faces_glint with dataset_merge.py with the following code:
python dataset_merge.py --include /home/ti/Downloads/DATASETS/faces_emore,/home/ti/Downloads/DATASETS/faces_glint --output /home/ti/Downloads/DATASETS/merge --model /home/ti/Downloads/insightface/models/model-r100-ii/model,0
But after the merge completed the resulting dataset had the same property and .rec and .idx sizes as faces_emore dataset.
What is wrong with our parameters?

Thank you!

@shangleyi
Copy link

It has been a year and I can hardly remember what did I do, but did you try adding the quotation marks?

@jinwu07
Copy link

jinwu07 commented Jul 4, 2019

Trained mobileFaceNet on emore, here is the result:

Called with argument: Namespace(batch_size=224, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ce_loss=False, ckpt=1, color=0, ctx_num=1, cutoff=0, data_dir='../datasets/faces_emore', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0, fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_size='112,112', image_w=112, images_filter=0, loss_type=4, lr=0.1, lr_steps='200000,280000,320000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.5, margin_s=64.0, max_steps=0, mom=0.9, network='y1', num_classes=85742, num_layers=1, per_batch_size=224, power=1.0, prefix='../models/y1-arcface-emore/model', pretrained='../models/y1-softmax-emore/model,234', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000, version_act='prelu', version_input=1, version_multiplier=1.0, version_output='E', version_se=0, version_unit=3, wd=4e-05)

testing verification..
(12000, 128)
infer time 5.607243
[lfw][346000]XNorm: 11.406996
[lfw][346000]Accuracy-Flip: 0.99600+-0.00442
testing verification..
(14000, 128)
infer time 6.47071
[cfp_fp][346000]XNorm: 9.418514
[cfp_fp][346000]Accuracy-Flip: 0.94729+-0.01445
testing verification..
(12000, 128)
infer time 5.542683
[agedb_30][346000]XNorm: 11.237676
[agedb_30][346000]Accuracy-Flip: 0.96300+-0.00942

@capilano
Copy link

capilano commented Jul 10, 2019

What does accuracy_flip mean? Does it have to do with using features of flipped images during training?(as described in one of the mobileface papers?)
Or flipping during post processing while calculating embedding distance?

@NOON47
Copy link

NOON47 commented Jul 23, 2019

@nttstar你好,如何finetune自己的数据,能提供保护fc7的预训练模型吗?

@EdwardVincentMa
Copy link

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

your max-steps is 140002(140k), but you said 120k, lr-steps is 240000(240k), 360000(360k), ... , which is right?

@yichaojin
Copy link

yichaojin commented May 18, 2020

dataset: emore
network backbone: mobilefacenet + GNAP block
loss function: arcface(m=0.5)
training pipeline: finetune (lr drop at 100K, 140K, 160K), batch-size:512
one epoch 52: LFW-99.60%, CFP-FP-93.46%, AgeDB-95.45%
@erichouyi

what's your acc on train data

@bahar3474
Copy link
Contributor

bahar3474 commented Nov 4, 2020

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

Which version of ms1m did you use? I trained mobilefacenet with the ms1m-refine-v1 dataset and the same config (except that I used 2 GPUs with per_batch_size=256) but the maximum accuracy on LFW in 180K iterations was 0.99400.

@CasonTsai
Copy link

@bahar3474 hello,excuse me ,Where is the train_SOFTmax file? There is no such file in the branch of the new version

@bahar3474
Copy link
Contributor

@CasonTsai
Hi. I used this version of code:
https://github.com/deepinsight/insightface/blob/08265c749a7af6f1d7e9057df55a3eb2b171ddcb/src/train_softmax.py
Two months ago they refined the repo structure and I don't know where you can find it in new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests