Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing results with multiple GPUs #14

Closed
RuijieZhu94 opened this issue Apr 3, 2024 · 5 comments
Closed

Reproducing results with multiple GPUs #14

RuijieZhu94 opened this issue Apr 3, 2024 · 5 comments
Assignees

Comments

@RuijieZhu94
Copy link

Hi Yuedong, thank you for open source your great work!

When I trained the model using 3 Nvidia RTX 3090s (batch size 4 per GPU), I got significantly worse results on the re10k.

psnr 22.12379274863242
ssim 0.7298626045353773
lpips 0.22073094525619313

Will fewer batchsize or multi-GPU training significantly affect the performance of the model?
By the way, I use the official weights and can get results consistent with the paper.

psnr 26.386906073201686
ssim 0.8690403559103327
lpips 0.12837660807718004
@donydchen
Copy link
Owner

donydchen commented Apr 3, 2024

Hi @RuijieZhu94, thanks for your interest in our work.

Yes, there is a small bug regarding feature extraction due to code cleaning. It is mainly related to (batch, view) dimension conversion, it does not affect the testing since testing keeps batch_size=1. We have already corrected it in our last commit (297338f). We have re-trained the model (after fixing the aforementioned bug) using both single GPU and multi-GPUs configurations, and they both reproduced the results of the released model.

Would you mind updating the code following our last commit (297338f) and re-training the model? Let us keep this commit open for you to update the results. For a quicker debugging process, your model should reach around PSNR=23 at step 10K with the updated code, which is around PSNR=20 at step 10K if it contains the aforementioned feature extraction bug.

By the way, we use batch_size=14 by default (a smaller batch_size might slightly harm the performance but should not be that much). And the LPIPS weight is 0.05; the lr scheduler is 1cycle with lr=2.e-4, as we have updated in this commit (660f49c). Make sure you have also synchronised your code base (if you have made any changes) with the aforementioned commits.

@donydchen donydchen self-assigned this Apr 3, 2024
@RuijieZhu94
Copy link
Author

Hi Yuedong, thanks for your prompt reply, I will retrain this model in the next few days.

@RuijieZhu94
Copy link
Author

Hi Yuedong, I retrained this model with bs=12, and got the result:

psnr 26.31555430801481
ssim 0.8676635705885196
lpips 0.12932708359573464

Thank you for your help.

@boxuLibrary
Copy link

@RuijieZhu94 Could you share the link of the training dataset? I reach out the author of the pixelsplat for link. However, i can not open the link.

@RuijieZhu94
Copy link
Author

@RuijieZhu94 Could you share the link of the training dataset? I reach out the author of the pixelsplat for link. However, i can not open the link.

Please contact me by email: ruijiezhu@mail.ustc.edu.cn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants