FIX: print logs when training early end due to an exception when training with nnUNetTrainerBenchmark_5epochs and nnUNetTrainerBenchmark_5epochs_noDataLoading by Leirbag-gabrieL · Pull Request #2926 · MIC-DKFZ/nnUNet

Leirbag-gabrieL · 2025-10-24T09:20:54Z

Hi,

First time using nnUNet and I encountered unexpected behavior while trying to do a dry run of my code.
The training early ended while using benmarking trainers and said fastest_epoch = 'Not enough VRAM!' :/

While actually, there were something else going on (I have a 40 GB GPU but 8GB config doesn't fit in ?)
The lack of error logs made me waste a bit of time because I did not know what was actually wrong.
The logging implementation was hiding this error:

backend='inductor' raised:
RuntimeError: Triton Error [CUDA]: device kernel image is invalid

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

For now, I did not fix my issue, but now it is correctly reported!

…ning with nnUNetTrainerBenchmark_5epochs and nnUNetTrainerBenchmark_5epochs_noDataLoading

Leirbag-gabrieL · 2025-10-24T15:06:57Z

Found the solution to my issue here : pytorch/pytorch#119054

Apparently something is wrong with the ptxas bundled by default with triton. Using my own, by defining the environment variable
TRITON_PTXAS_PATH="/usr/local/cuda-11.6/bin/ptxas" fixed my bug :)

FIX: print logs when training early end due to an exception when trai…

9ec1ba2

…ning with nnUNetTrainerBenchmark_5epochs and nnUNetTrainerBenchmark_5epochs_noDataLoading

FabianIsensee self-assigned this Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: print logs when training early end due to an exception when training with nnUNetTrainerBenchmark_5epochs and nnUNetTrainerBenchmark_5epochs_noDataLoading#2926

FIX: print logs when training early end due to an exception when training with nnUNetTrainerBenchmark_5epochs and nnUNetTrainerBenchmark_5epochs_noDataLoading#2926
Leirbag-gabrieL wants to merge 1 commit intoMIC-DKFZ:masterfrom
Leirbag-gabrieL:master

Leirbag-gabrieL commented Oct 24, 2025 •

edited

Loading

Uh oh!

Leirbag-gabrieL commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Leirbag-gabrieL commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Leirbag-gabrieL commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Leirbag-gabrieL commented Oct 24, 2025 •

edited

Loading