FIX: print logs when training early end due to an exception when training with nnUNetTrainerBenchmark_5epochs and nnUNetTrainerBenchmark_5epochs_noDataLoading#2926
Open
Leirbag-gabrieL wants to merge 1 commit intoMIC-DKFZ:masterfrom
Conversation
…ning with nnUNetTrainerBenchmark_5epochs and nnUNetTrainerBenchmark_5epochs_noDataLoading
Author
|
Found the solution to my issue here : pytorch/pytorch#119054 Apparently something is wrong with the ptxas bundled by default with triton. Using my own, by defining the environment variable |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi,
First time using nnUNet and I encountered unexpected behavior while trying to do a dry run of my code.
The training early ended while using benmarking trainers and said
fastest_epoch = 'Not enough VRAM!':/While actually, there were something else going on (I have a 40 GB GPU but 8GB config doesn't fit in ?)
The lack of error logs made me waste a bit of time because I did not know what was actually wrong.
The logging implementation was hiding this error:
For now, I did not fix my issue, but now it is correctly reported!