Skip to content

FIX: print logs when training early end due to an exception when training with nnUNetTrainerBenchmark_5epochs and nnUNetTrainerBenchmark_5epochs_noDataLoading#2926

Open
Leirbag-gabrieL wants to merge 1 commit intoMIC-DKFZ:masterfrom
Leirbag-gabrieL:master

Conversation

@Leirbag-gabrieL
Copy link

@Leirbag-gabrieL Leirbag-gabrieL commented Oct 24, 2025

Hi,

First time using nnUNet and I encountered unexpected behavior while trying to do a dry run of my code.
The training early ended while using benmarking trainers and said fastest_epoch = 'Not enough VRAM!' :/

While actually, there were something else going on (I have a 40 GB GPU but 8GB config doesn't fit in ?)
The lack of error logs made me waste a bit of time because I did not know what was actually wrong.
The logging implementation was hiding this error:

backend='inductor' raised:
RuntimeError: Triton Error [CUDA]: device kernel image is invalid

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

For now, I did not fix my issue, but now it is correctly reported!

…ning with nnUNetTrainerBenchmark_5epochs and nnUNetTrainerBenchmark_5epochs_noDataLoading
@FabianIsensee FabianIsensee self-assigned this Oct 24, 2025
@Leirbag-gabrieL
Copy link
Author

Found the solution to my issue here : pytorch/pytorch#119054

Apparently something is wrong with the ptxas bundled by default with triton. Using my own, by defining the environment variable
TRITON_PTXAS_PATH="/usr/local/cuda-11.6/bin/ptxas" fixed my bug :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants