Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asian training dataset(from glint) discussion. #256

Closed
nttstar opened this issue Jun 14, 2018 · 160 comments
Closed

Asian training dataset(from glint) discussion. #256

nttstar opened this issue Jun 14, 2018 · 160 comments
Labels

Comments

@nttstar
Copy link
Collaborator

nttstar commented Jun 14, 2018

  1. Download dataset from http://trillionpairs.deepglint.com/data (after signup). msra is a cleaned subset of MS1M from glint while celebrity is the asian dataset.
  2. Generate lst file by calling src/data/glint2lst.py. For example:
python glint2lst.py /data/glint_data msra,celebrity > glint.lst

or generate the asian dataset only by:

python glint2lst.py /data/glint_data celebrity > glint_cn.lst
  1. Call face2rec2.py to generate .rec file.
  2. Merge the dataset with existing one by calling src/data/dataset_merge.py without setting param model which will combine all IDs from those two datasets.

Finally you will get a dataset contains about 180K IDs.

Use src/eval/gen_glint.py to prepare test feature file by using pretrained insightface model.

You can also post your private testing results here.

@aaaaaaaak
Copy link

Thanks for Sharing

@cysin
Copy link

cysin commented Jun 17, 2018

@nttstar will you train new models with these data?

@406747925
Copy link

业界良心

@meanmee
Copy link

meanmee commented Jun 19, 2018

有格林的人在吗,我是楼下比特大陆的,下载速度太慢了,可以直接去楼上直接拷贝吗?

@lmmcc
Copy link

lmmcc commented Jun 19, 2018

is there exist any same person in msra & celebrity datasets?

@aa12356jm
Copy link

前几天刚听了格林的讲座,公开了这个数据集,数据集刚下载好,几百个GB,没想到这里这么快就出现了,感谢

@JianbangZ
Copy link

After test, this dataset is pretty clean, but still containing 0.3%~0.8% noise.
Also, we found their ms1m and Asian parts still have about 15-30 overlaps, though I guess it doesn't matter when the scale is already so large.
Another findings is that this dataset suffers long tail a lot. Take the asian part for example, only 18K identites out of 10K have over 25 images per class, and only few thousand identities have over 60 images.

@meanmee
Copy link

meanmee commented Jun 20, 2018

@aa12356jm could you share it on BaiduYun?

@zhenglaizhang
Copy link

@nttstar I download the dataset from glint, it looks like the face is similarity transformed and resized to 400x400, so for arcface, how to crop/resize this to 112x112?

@nttstar
Copy link
Collaborator Author

nttstar commented Jun 20, 2018

@zhenglaizhang I already provided the scripts.

@HaoLiuHust
Copy link

@JianbangZ do you have some idea to solve these problems?

@starimpact
Copy link

awesome !

@devymex
Copy link

devymex commented Jun 21, 2018

Thanks DeepGlint!

@xxllp
Copy link

xxllp commented Jun 22, 2018

有意义

@vzvzx
Copy link

vzvzx commented Jun 23, 2018

@nttstar the download address is crashed.

@libohit
Copy link

libohit commented Jun 24, 2018

@nttstar @JianbangZ how can you download glint asian face dataset? I can not find how to register and signup.

@aaaaaaaak
Copy link

aaaaaaaak commented Jun 25, 2018

@nttstar @JianbangZ 为何我下载的亚洲人脸数据集只能解压出1.7G 2000+id的人脸 这个90+G的.tar.gz文件该怎么处理呢 能否指导一下 多谢

@anguoyang
Copy link

there is no lmk files in the dataset:
"lmk_file = os.path.join(input_dir, "%s_lmk.txt"%(ds))"
is it correct?

@Wisgon
Copy link

Wisgon commented Jun 25, 2018

The same problem with @libohit , I can't sign in http://trillionpairs.deepglint.com/data, the button of "sign in" is dark!

@anguoyang
Copy link

@Wisgon maybe you need to use another browser

@wangchust
Copy link

wangchust commented Jun 26, 2018

Can anyone share a copy of lmk files? Their official site seems to be maintaining. I cound download nothing.

@goodpp
Copy link

goodpp commented Jun 26, 2018

现在下载不了,是什么情况@—@

@Wisgon
Copy link

Wisgon commented Jun 26, 2018

I can't download the dataset, when I click the Download button, there is some error appear:
This XML file does not appear to have any style information associated with it. The document tree is shown below. <Error><Code>InvalidAccessKeyId</Code><Message>The OSS Access Key Id you provided is disabled.</Message><RequestId>5B31AE6FF68A5D785875635D</RequestId><HostId>dgplaygroundopen.oss-cn-qingdao.aliyuncs.com</HostId><OSSAccessKeyId>LTAIKdTReMdV71Zi</OSSAccessKeyId></Error>

@shineway14
Copy link

I can't sign in http://trillionpairs.deepglint.com/data, the button of "sign in" is dark!

@Wisgon
Copy link

Wisgon commented Jun 26, 2018

@shineway14 You can use http://trillionpairs.deepglint.com/login to sign in, when you finsh fill in the blanks, press enter instead of the 'log in' button.
BTW, http://trillionpairs.deepglint.com/register to register

@meanmee
Copy link

meanmee commented Jun 28, 2018

@nttstar what is the exactly script to merge msra and celeb?

@goodpp
Copy link

goodpp commented Jun 28, 2018

@aaaaaaaak 我最新用BT下载的亚洲人脸数据集是正常的,和官方提供的数据一致,我给你参考下我的
目录大小 98G ./asian-celeb
人数 ls -lR| grep "^d" | wc -l 93979
图片数 ls -lR |grep "^-"| grep ".jpg" |wc -l 2830146

@jackytu256
Copy link

jackytu256 commented Jul 2, 2018

HI all,
I have already done the step1, which is to get glint_cn file; however, I got a error while trying to do step 2. The error code is following and please please help me to fix this issue. Thanks.

OpenCV Error: Assertion failed (src.cols > 0 && src.rows > 0) in warpAffine, file /build/buildd/opencv-2.4.8+dfsg1/modules/imgproc/src/imgwarp.cpp, line 3445
Traceback (most recent call last):
  File "face2rec2.py", line 256, in <module>
    image_encode(args, i, item, q_out)
  File "face2rec2.py", line 99, in image_encode
    img = face_preprocess.preprocess(img, bbox = item.bbox, landmark=item.landmark, image_size='%d,%d'%(args.image_h, args.image_w))
  File "../common/face_preprocess.py", line 107, in preprocess
    warped = cv2.warpAffine(img,M,(image_size[1],image_size[0]), borderValue = 0.0)
cv2.error: /build/buildd/opencv-2.4.8+dfsg1/modules/imgproc/src/imgwarp.cpp:3445: error: (-215) src.cols > 0 && src.rows > 0 in function warpAffine

@anguoyang
Copy link

@jackytu256 请问你下载的包里面有lmk文件么?
lmk_file = os.path.join(input_dir, "%s_lmk"%(ds))

@mlourencoeb
Copy link

Hello @Talgin

emore is based on MSCELEB just like non asian component of faces_glint. I would merge emore with asia part only, but I could be wrong.

@Talgin
Copy link

Talgin commented Jul 5, 2019

@mlourencoeb,
Thank you!
I'm not sure but maybe faces_glint is combination of emore and asian dataset? :) But I'll try to merge them :)

@HuanJiML
Copy link

HuanJiML commented Jul 9, 2019

@zhouwei5113 have you solved your problem? I got also really low score on trillionpairs.

@shiyuanyin
Copy link

@nttstar
作者你好,我想改动一个新的结构,是在SE的地方改动的,有点困惑,mxnet 的symbol,不能直接得到bchw的值,
pytorch 的SGE,一个实现架构语句, 对应你提供的模型SE代码位置修改的话,symbol每一层bn3 后边的bchw,我直接得不到,我要mxnet,实现这句话,b, c, h, w = x.size(), x = x.reshape(b * self.groups, -1, h, w) 我对mxnet 不是那么熟悉,不知道作者你有没有好的方式实现这句reshape
我在frestnet.py修改的地方
bn3 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=2e-5, momentum=bn_mom, name=name + '_bn3')
#if use_se:
if usr_sge:
得到 bn3的 bchw
然后reshape

下面是对应pytorch 实现

class SpatialGroupEnhance(nn.Module): # 3 2 1 hw is half, 311 is same size
def init(self, groups = 64):
super(SpatialGroupEnhance, self).init()
self.groups = groups
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.weight = Parameter(torch.zeros(1, groups, 1, 1))
self.bias = Parameter(torch.ones(1, groups, 1, 1))
self.sig = nn.Sigmoid()

def forward(self, x): # (b, c, h, w)
    b, c, h, w = x.size()
    x = x.view(b * self.groups, -1, h, w)  ##reshape
    xn = x * self.avg_pool(x)  # x * global pooling(h,w change 1)
    xn = xn.sum(dim=1, keepdim=True) #(b,1,h,w)
    t = xn.view(b * self.groups, -1)  
    t = t - t.mean(dim=1, keepdim=True)  
    std = t.std(dim=1, keepdim=True) + 1e-5
    t = t / std  # normalize  -mean/std
    t = t.view(b, self.groups, h, w)
    t = t * self.weight + self.bias
    t = t.view(b * self.groups, 1, h, w)
    x = x * self.sig(t)   #in order to sigmod facter,this is group factor (0-1)
    x = x.view(b, c, h, w) #get to varying degrees of importance,Restoration dimension
    return x

@shiyuanyin
Copy link

@nttstar
本身的resnet 50 IR 结构添加SGE模块,预训练模型下载的作者的resnet50 ,glint数据
,训练测试结果是这样,变化不大,

testing verification..
(12000, 512)
infer time 7.123213
[lfw][8000]XNorm: 22.401950
[lfw][8000]Accuracy-Flip: 0.99800+-0.00287
testing verification..
(14000, 512)
infer time 8.335358
[cfp_fp][8000]XNorm: 21.203882
[cfp_fp][8000]Accuracy-Flip: 0.95300+-0.01448
testing verification..
(12000, 512)
infer time 7.040614
[agedb_30][8000]XNorm: 23.488769
[agedb_30][8000]Accuracy-Flip: 0.98000+-0.00749

@SueeH
Copy link

SueeH commented Jul 31, 2019

@mlourencoeb,
Thank you!
I'm not sure but maybe faces_glint is combination of emore and asian dataset? :) But I'll try to merge them :)

any conclusion about thedataset ? Is face_glint = emore + asian_celeb?
Ihave same issue in #789

@Talgin
Copy link

Talgin commented Aug 5, 2019

Hi @nttstar ,
We are training on faces_glint + our_custom_dataset... now it's almost 10 days, and the thing I want to answer is why our accuracy is not changing, it is acc=~0.30-0.31. At the beginning loss value started from ~46.6-9 and after 2 days decreased to ~7.2-7.5, and acc was 0.0000 and began to rise, but after 20th epoch it stopped and the results you can see from the picture below. It is now 45th epoch, but nothing changed.
Our parameters are:
Loss: arcface
default.end_epoch = 1000
default.lr = 0.001
default.wd = 0.0005
default.mom = 0.9
default.per_batch_size: 64
default.ckpt = 3

network = r100

We are using 4 Tesla P100 GPU's.
You can see the progress from below screenshot:
Screenshot from 2019-08-02 16-37-06

@nttstar could you tell us what is the problem? We have merged the datasets according to your instructions with dataset_merge.py and no error happened :)

@Talgin
Copy link

Talgin commented Aug 5, 2019

Hi @SueeH ,
Sorry for late reply I think this info is noted in their paper:
Screenshot from 2019-08-05 11-27-03

They say that face_glint (DeepGlint-Face) includes MS1M-DeepGlint and Asian-DeepGlint. As far as I know and reading this (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8698884) MS1M-DeepGlint is refined version of MS1M (provided by DeepGlint Corp.) and on http://trillionpairs.deepglint.com/overview they say:

  • MS-Celeb-1M-v1c with 86,876 ids/3,923,399 aligned images cleaned from MS-Celeb-1M dataset. This dataset has been excluded from both LFW and Asian-Celeb.
  • Asian-Celeb 93,979 ids/2,830,146 aligned images. This dataset has been excluded from both LFW and MS-Celeb-1M-v1c.

So, I think that emore (MS1MV2) is another refined version of what is included into faces_glint dataset from MS1M (because MS1M-DeepGlint has 2K more ids than MS1MV2, but less images (3.9M to 5.8M)).

@EdwardVincentMa
Copy link

  1. Download dataset from http://trillionpairs.deepglint.com/data (after signup). msra is a cleaned subset of MS1M from glint while celebrity is the asian dataset.
  2. Generate lst file by calling src/data/glint2lst.py. For example:
python glint2lst.py /data/glint_data msra,celebrity > glint.lst

or generate the asian dataset only by:

python glint2lst.py /data/glint_data celebrity > glint_cn.lst
  1. Call face2rec2.py to generate .rec file.
  2. Merge the dataset with existing one by calling src/data/dataset_merge.py without setting param model which will combine all IDs from those two datasets.

Finally you will get a dataset contains about 180K IDs.

Use src/eval/gen_glint.py to prepare test feature file by using pretrained insightface model.

You can also post your private testing results here.

兄弟,我也上海的,MobileFaceNet+arcloss训练webface数据集或face-ms1m总是会Nan,不知道你试过没有,即便lr调成0.0001,20几轮后(epoch 等于24的时候)就Nan了。

@pake2070
Copy link

Anyone can share configure training Asian Faces ? thanks

@pake2070
Copy link

I did step by step but get error about key image :
my configure : CUDA_VISIBLE_DEVICES='0,1' python3 -u src/train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 32 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002

but get key error for asian dataset:

image

@maywander
Copy link

@Edwardmark I meet the same problem with you. Did you get good results on deepglint at last?

@Edwardmark
Copy link

@maywander no, I didn't. At last , I use the emore data instead.

@maywander
Copy link

so the models trained from emore perform better on trillionpairs test platform?@Edwardmark

@Edwardmark
Copy link

@maywander yes, and I don't know why.

@anguoyang
Copy link

能正常生产glint.lst文件,但是调用face2rec.py总出错,请问有人知道怎么设置参数么?谢谢

@anguoyang
Copy link

感觉代码有问题

@anguoyang
Copy link

No such file or directory: '..../insightface/src/data/property'

@zhouyongxiu
Copy link

@nttstar I use glint dataset to train the model but only get 77% acc in the glint test, could you share your train log which can get 86% acc.

@cocoza4
Copy link

cocoza4 commented Apr 12, 2020

How many iterations does it take to train this combined dataset from scratch using the any provided models until it converges?

@John1231983
Copy link

Thanks for valuate discussion, anyone has improvement in Megaface and IJBC when working in the merged dataset? Thanks

@aravinthmuthu
Copy link

@nttstar Thanks for the great work.
Could you please share train.lst for ms1mv2?

@mlourencoeb
Could you please share the intersection list between emore and asian glint?

Thanks in advance.

@ngocson1804
Copy link

ngocson1804 commented Jun 15, 2021

  1. Download dataset from http://trillionpairs.deepglint.com/data (after signup). msra is a cleaned subset of MS1M from glint while celebrity is the asian dataset.
  2. Generate lst file by calling src/data/glint2lst.py. For example:
python glint2lst.py /data/glint_data msra,celebrity > glint.lst

or generate the asian dataset only by:

python glint2lst.py /data/glint_data celebrity > glint_cn.lst
  1. Call face2rec2.py to generate .rec file.
  2. Merge the dataset with existing one by calling src/data/dataset_merge.py without setting param model which will combine all IDs from those two datasets.

Finally you will get a dataset contains about 180K IDs.

Use src/eval/gen_glint.py to prepare test feature file by using pretrained insightface model.

You can also post your private testing results here.

Can you share me src/ source code folder. I cannot find it. Thank you

@ghost
Copy link

ghost commented Jul 13, 2021

Hi @SueeH ,
Sorry for late reply I think this info is noted in their paper:
Screenshot from 2019-08-05 11-27-03

They say that face_glint (DeepGlint-Face) includes MS1M-DeepGlint and Asian-DeepGlint. As far as I know and reading this (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8698884) MS1M-DeepGlint is refined version of MS1M (provided by DeepGlint Corp.) and on http://trillionpairs.deepglint.com/overview they say:

  • MS-Celeb-1M-v1c with 86,876 ids/3,923,399 aligned images cleaned from MS-Celeb-1M dataset. This dataset has been excluded from both LFW and Asian-Celeb.
  • Asian-Celeb 93,979 ids/2,830,146 aligned images. This dataset has been excluded from both LFW and MS-Celeb-1M-v1c.

So, I think that emore (MS1MV2) is another refined version of what is included into faces_glint dataset from MS1M (because MS1M-DeepGlint has 2K more ids than MS1MV2, but less images (3.9M to 5.8M)).

Hi @Talgin, I trained a model with Asia faces but I got the same error, how can you solve it?

@huynhtruc0309
Copy link

  1. Download dataset from http://trillionpairs.deepglint.com/data (after signup). msra is a cleaned subset of MS1M from glint while celebrity is the asian dataset.
  2. Generate lst file by calling src/data/glint2lst.py. For example:
python glint2lst.py /data/glint_data msra,celebrity > glint.lst

or generate the asian dataset only by:

python glint2lst.py /data/glint_data celebrity > glint_cn.lst
  1. Call face2rec2.py to generate .rec file.
  2. Merge the dataset with existing one by calling src/data/dataset_merge.py without setting param model which will combine all IDs from those two datasets.

Finally you will get a dataset contains about 180K IDs.
Use src/eval/gen_glint.py to prepare test feature file by using pretrained insightface model.
You can also post your private testing results here.

Can you share me src/ source code folder. I cannot find it. Thank you

I have the same problem. Have you found it?

@AmesianX
Copy link


[Asian-celeb dataset]

  • Training data(Asian-celeb)

The dataset consists of the crawled images of celebrities on the he web.The ima images are covered under a Creative Commons Attribution-NonCommercial 4.0 International license (Please read the license terms here. e. http://creativecommons.org/licenses/by-nc/4.0/).


[train_msra.tar.gz]

MD5:c5b668f2204c400099b14f367069aef5

Content: Train dataset called MS-Celeb-1M-v1c with 86,876 ids/3,923,399 aligned images cleaned from MS-Celeb-1M dataset.

This dataset has been excluded from both LFW and Asian-Celeb.

Format: *.jpg

Google: https://drive.google.com/file/d/1aaPdI0PkmQzRbWErazOgYtbLA1mwJIfK/view?usp=sharing

[msra_lmk.tar.gz]

MD5:7c053dd0462b4af243bb95b7b31da6e6

Content: A list of five-point landmarks for the 3,923,399 images in MS-Celeb-1M-v1c.

Format: .....

while is the path of images in tar file train_msceleb.tar.gz.

Label is an integer ranging from 0 to 86,875.

(x,y) is the coordinate of a key point on the aligned images.

left eye
right eye
nose tip
mouth left
mouth right

Google: https://drive.google.com/file/d/1FQ7P4ItyKCneNEvYfJhW2Kff7cOAFpgk/view?usp=sharing

[train_celebrity.tar.gz]

MD5:9f2e9858afb6c1032c4f9d7332a92064

Content: Train dataset called Asian-Celeb with 93,979 ids/2,830,146 aligned images.

This dataset has been excluded from both LFW and MS-Celeb-1M-v1c.

Format: *.jpg

Google: https://drive.google.com/file/d/1-p2UKlcX06MhRDJxJukSZKTz986Brk8N/view?usp=sharing

[celebrity_lmk.tar.gz]

MD5:9c0260c77c13fbb32692fc06a5dbfaf0

Content: A list of five-point landmarks for the 2,830,146 images in Asian-Celeb.

Format: .....

while is the path of images in tar file train_celebrity.tar.gz.

Label is an integer ranging from 86,876 to 196,319.

(x,y) is the coordinate of a key point on the aligned images.

left eye
right eye
nose tip
mouth left
mouth right

Google: https://drive.google.com/file/d/1sQVV9epoF_8jS3ge6DqbilpWk3UNE8U7/view?usp=sharing

[testdata.tar.gz]

MD5:f17c4712f7562ea6d45f0a158e59b792

Content: Test dataset with 1,862,120 aligned images.

Format: *.jpg

Google: https://drive.google.com/file/d/1ghzuEQqmUFN3nVujfrZfBx_CeGUpWzuw/view?usp=sharing

[testdata_lmk.tar]

MD5:7e4995eb9976a2cfd2b23db05d76572c

Content: A list of five-point landmarks for the 1,862,120 images in testdata.tar.gz.

Features should be extracted in the same sequence and with the same amount with this list.

Format: .....

while is the path of images in tar file testdata.tar.gz.

(x,y) is the coordinate of a key point on the aligned images.

left eye
right eye
nose tip
mouth left
mouth right

Google: https://drive.google.com/file/d/1lYzqnPyHXRVgXJYbEVh6zTXn3Wq4JO-I/view?usp=sharing

[feature_tools.tar.gz]

MD5:227b069d7a83aa43b0cb738c2252dbc4

Content: Feature format transform tool and a sample feature file.

Format: We use the same format as Megaface(http://megaface.cs.washington.edu/) except that we merge all files into a single binary file.

Google: https://drive.google.com/file/d/1bjZwOonyZ9KnxecuuTPVdY95mTIXMeuP/view?usp=sharing

@phu-minh
Copy link

  1. Download dataset from http://trillionpairs.deepglint.com/data (after signup). msra is a cleaned subset of MS1M from glint while celebrity is the asian dataset.
  2. Generate lst file by calling src/data/glint2lst.py. For example:
python glint2lst.py /data/glint_data msra,celebrity > glint.lst

or generate the asian dataset only by:

python glint2lst.py /data/glint_data celebrity > glint_cn.lst
  1. Call face2rec2.py to generate .rec file.
  2. Merge the dataset with existing one by calling src/data/dataset_merge.py without setting param model which will combine all IDs from those two datasets.

Finally you will get a dataset contains about 180K IDs.
Use src/eval/gen_glint.py to prepare test feature file by using pretrained insightface model.
You can also post your private testing results here.

Can you share me src/ source code folder. I cannot find it. Thank you

I have the same problem. Have you found it?

I have the same issues. :( Cannot find the .py files that help us working with the dataset given on Baiducloud.

@nttstar nttstar closed this as completed Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests