Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing the training results on a megadepth dataset #253

Open
FlyFish-space opened this issue Mar 20, 2023 · 23 comments
Open

Reproducing the training results on a megadepth dataset #253

FlyFish-space opened this issue Mar 20, 2023 · 23 comments

Comments

@FlyFish-space
Copy link

Thank you very much for your excellent work.
I recently reproduce the training results on 4 3090 GPUs for 30 epochs based on README. The batch size each GPU is 2. I trained and tested on the D2 Net-undistorted megadepth dataset, and the results are as follows:
auc@5: 44.1 acu@10: 60.28 auc@20: 72.93
At the same time, I also saw that in the previous issue, it was recommended to set the image sizes of both val and test to 640, but the results did not improve.
What is the reason for this decline in accuracy?

@benjaminkelenyi
Copy link

Hello, I'm facing the same issue.
image
The loss is very fluctuating...
Screenshot from 2023-04-26 15-25-35

@chicleee
Copy link

chicleee commented Jun 1, 2023

Hi, Have you made any progress on this issue?

@benjaminkelenyi
Copy link

benjaminkelenyi commented Jun 5, 2023 via email

@chen9run
Copy link

chen9run commented Jun 7, 2023

Hello, Thanks for your reply. Yes, I fixed the issue! Thank you!

On Thu, 1 Jun 2023 at 09:42, Xi Li @.> wrote: Hi, Have you made any progress on this issue? — Reply to this email directly, view it on GitHub <#253 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOHDZZFSQDSDNVZG5TUFHTXJA2VVANCNFSM6AAAAAAWA7GPK4 . You are receiving this because you commented.Message ID: @.>
-- Benjamin Kelenyi Student | Computer Science | Technical University m: +40743586598 e: @.*** a: Str. G. Baritiu nr. 26-28, 400027 Cluj-Napoca, Romania https://www.facebook.com/benjamin.kelenyi https://www.facebook.com/benjamin.kelenyi https://www.linkedin.com/in/benjamin-kelenyi-aa322710a/
Hi,have you find the reeason?

@Mysophobias
Copy link

hello,my results are similar to yours,Have you tried changing your TRAIN_IMG_SIZE to 840.

@Master-cai
Copy link

Master-cai commented Jul 28, 2023

I'm training outdoor_ds with the default setting(image size 640), I use 4 3090 GPUs too. Also, i use the origin megadepth data to train as the undistorted image is not accessible now.

After 11 epoches training, I got the following results(val):
auc@5: 45.6 acu@10: 62.4 auc@20: 75.1

it does not seem to grow anymore. I will try to train it 30 epoches and test the model on the test set(it may takes another two days).

Has anyone else already reproduced the results using a similar setting? Would setting TRAIN_IMG_SIZE to 840 help?

@Master-cai
Copy link

After 30 epoches training, I reproduced the test on megadepth and got:
'auc@10': 0.6676607412455137,
'auc@20': 0.7952598445093988,
'auc@5': 0.4983204021567033,
'prec@5e-04': 0.9549532078302655

3 points lower than the reported accuracy.

@Mysophobias
Copy link

1690765297249
@Master-cai Hello,This is the training result I obtained when I set 'TRAIN_IMG_SIZE' to 640. Your training result is much better than mine. Have you tried setting 'TRAIN_IMG_SIZE' to 840?

@Master-cai
Copy link

1690765297249 @Master-cai Hello,This is the training result I obtained when I set 'TRAIN_IMG_SIZE' to 640. Your training result is much better than mine. Have you tried setting 'TRAIN_IMG_SIZE' to 840?

no, i use the default settings. Your results are very similar to mine with 11 epoches training. So what device you use and how long you train it ?

@Mysophobias
Copy link

1690765297249 @Master-cai Hello,This is the training result I obtained when I set 'TRAIN_IMG_SIZE' to 640. Your training result is much better than mine. Have you tried setting 'TRAIN_IMG_SIZE' to 840?

no, i use the default settings. Your results are very similar to mine with 11 epoches training. So what device you use and how long you train it ?

I also used 4 Nvidia RTX 3090 GPUs and trained for approximately 100 hours. I have tried using D2-net to process the dataset, and these are the validation results I saved during the training process. I am really eager to know if setting 'TRAIN_IMG_SIZE' to 840 would improve the accuracy after training.
1690768809960

@Master-cai
Copy link

@Mysophobias I didn't process the megadepth via D2-net, and your ckpts seems similar to mine. I have no idea about your bad test results. I just used the default reproduce_test\outdoor_ds.sh script to test.

As to the image size=840, I think it might help as it was officially recommended after all. 3090 is enough to train with 840 and you can try it.

@Mysophobias
Copy link

``

@Mysophobias I didn't process the megadepth via D2-net, and your ckpts seems similar to mine. I have no idea about your bad test results. I just used the default reproduce_test\outdoor_ds.sh script to test.

As to the image size=840, I think it might help as it was officially recommended after all. 3090 is enough to train with 840 and you can try it.

Based on the code comments in configs/data/megadepth_trainval_840.py it is indicated that 32GB of GPU memory is required for training. Of course, I have also attempted training on four 24GB 3090GPUs, but it was not successful. I will try again later. Anyway, thank you.

@Master-cai
Copy link

@Mysophobias 3090 can train it with physical bs=1. I use accumulate grad=2 to make the bs=1 * 2 * 4=8, which is suggested by the author. i have trained it for one epoch but now i'm not available with GPUs💔. I hope my experience can help you and It would be nice if you could share the final results.

@xmlyqing00
Copy link

May I ask how could you train the model on MegaDepth? I got stuck in getting the training images from D2-net. I noticed the authors of LoFTR says the differences are subtle, but I don't know how to create the symbol link. Do I need to download the MegaDepth SfM dataset?

Best,
yq

@Master-cai
Copy link

@xmlyqing00 I think this issue can help.

@xmlyqing00
Copy link

@xmlyqing00 I think this issue can help.

Thanks, I just fixed the training on MegaDepth

@RunyuZhu
Copy link

RunyuZhu commented Jan 15, 2024

I'm training outdoor_ds with the default setting(image size 640), I use 4 3090 GPUs too. Also, i use the origin megadepth data to train as the undistorted image is not accessible now.

After 11 epoches training, I got the following results(val): auc@5: 45.6 acu@10: 62.4 auc@20: 75.1

it does not seem to grow anymore. I will try to train it 30 epoches and test the model on the test set(it may takes another two days).

Has anyone else already reproduced the results using a similar setting? Would setting TRAIN_IMG_SIZE to 840 help?

can i ask about your device's memery capacity?
i train loftr on my device with a single 3090ti 24G, 13 i7, and 128G memery, i set batch size to 1, n_gpus_per_node=1, nums_workers=0. but the process got killed when training at epoch 2, i find that loftr nearly run out of my memry(full of swap & 125/126g main memery used).
so, i beg for your info of device, and have you ever met this issue? it will be so nice of you if you could give me some tips.
thanks.
zhu

@Master-cai
Copy link

@RunyuZhu That`s weird, I use 4 3090ti and 128GB memery(8GB swap) to get that results, nums_workers is 4. But memory consumption does indeed increase over time. I never met this bug, sorry i cannot help you. I suggest you to look at the system log and make sure that the process is killed due to OOM and if there are some other processes occupy a large amount of memory.

@RunyuZhu
Copy link

@RunyuZhu That`s weird, I use 4 3090ti and 128GB memery(8GB swap) to get that results, nums_workers is 4. But memory consumption does indeed increase over time. I never met this bug, sorry i cannot help you. I suggest you to look at the system log and make sure that the process is killed due to OOM and if there are some other processes occupy a large amount of memory.

thanks for your reply and precious suggestion!
i will run it again with a bigger nums_workers or batch_size, and log the info to locate the issue.
thanks again!
zhu

@WJJLBJ
Copy link

WJJLBJ commented Mar 16, 2024

@xmlyqing00 I think this issue can help.

hello, how do you fix the problem in line 47 in LoFTR/src/datasets/megadepth.py ? line 47 in offical code is self.scene_info = np.load(npz_path, allow_pickle=True), which is different from this issue

@WJJLBJ
Copy link

WJJLBJ commented Mar 16, 2024

@xmlyqing00 I think this issue can help.

可以问下您是怎么解决loftr无法下载d2-net预处理数据的问题的吗?这个issue里面的做法有帮助嘛 我看LoFTR/src/datasets/megadepth.py 里面的第47行并不是他给的那个 而是 self.scene_info = np.load(npz_path, allow_pickle=True) 想问下你是怎么改动这个文件的呀 多谢!

@Master-cai
Copy link

@WJJLBJ 直接使用原始图像,根据issue里给的做法处理

@WJJLBJ
Copy link

WJJLBJ commented Mar 17, 2024

@WJJLBJ 直接使用原始图像,根据issue里给的做法处理

多谢多谢 问题解决了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants