Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_loss and val_loss nan #2629

Open
lsw2022 opened this issue Dec 2, 2024 · 6 comments
Open

train_loss and val_loss nan #2629

lsw2022 opened this issue Dec 2, 2024 · 6 comments
Assignees

Comments

@lsw2022
Copy link

lsw2022 commented Dec 2, 2024

Hi, thank you for this wonderful model.
I'd like to ask about the training issue I encountered. So, I just finished my training on nnunetv2 .However, when I checked the training progress, I saw several nan in my training.Can you tell me what caused the problem?I look forward to a reply, thanks!
1
2

@seziegler
Copy link
Member

Hi @lsw2022 ,
can you please share your dataset.json?
When you ran the preprocessing, did you use the --verify_dataset_integrity flag?

@ankushjindal5
Copy link

In my case, this issue stems from mixed precision training. verify dataset integrity gave no errors. you can try --fp32 flag to disable mixed precision, however this will increase GPU memory usage so you might need to adjust your batch size accordingly.

@laure0406
Copy link

In my case, this issue stems from mixed precision training. verify dataset integrity gave no errors. you can try --fp32 flag to disable mixed precision, however this will increase GPU memory usage so you might need to adjust your batch size accordingly.

Hi !!
I encounter a problem of nan as well. I just saw your reply and I was wondering, this flag you used it with the preprocessing ? (I am a begginer...)
Thank you so much for your answer !!

@ankushjindal5
Copy link

No, use it with the training command. Your preprocessed data is fine.

@laure0406
Copy link

No, use it with the training command. Your preprocessed data is fine.

Thank you so much. I am gonna try this right away.

@laure0406
Copy link

No, use it with the training command. Your preprocessed data is fine.

nnUNetv2_train 100 3d_fullres 0 -tr nnUNetTrainer_250epochs --fp32
This doesn't work... "unrecognized argument --fp32"...
Did I misunderstand what you meant ?
Ty for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants