-
Notifications
You must be signed in to change notification settings - Fork 11k
Add env TORCH_AMD_CUDNN_ENABLED (third try) #10678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add env TORCH_AMD_CUDNN_ENABLED (third try) #10678
Conversation
|
Related to this subject, the issue with cudnn might be close to be solved in the near future, at least with torch nightly builds, once they will be compiled with cudnn v. 9.15. Keeping an eye on some discussions at the Pytorch repository I came to the conclusion that it might be the case that it will be even a way to use a separate cudnn 9.15 installation instead of the one compiled inside torch, beginning with stable 2.9.1. What exactly will be that way is above my level of knowledge. I opened an issue at Pytorch repository to ask some clarifications because the subject is way over my head. I too would like to see the issue solved as soon as possible so I am kinda anxious. |
|
Why make it an environmental variable when you can just make it a launch arg like --force-cudnn-enabled? In cli_args.py, you'd add something like: The model_management.py already imports the args. |
Update: today I managed to run ComfyUI with last CUDNN version 9.16. I got a performance leap with one of my usual WAN 2.2 I2V workflows from about 45-50 s/it to about 35 s/it. For details about what I did look at the link at pytorch repository I mentioned in my previous post here. |
45-50 it/s down to 35 it/s seems like significant slow down, not speedup to me. But maybe my math is just wrong and works different these days. |
@VladanZ |
Any idea if there could be such a workaround for Linux ? |
If you look at the linked above pytorch opened issue (and at torch 2.9.1 release notes https://github.com/pytorch/pytorch/releases), beginning with pytorch 2.9.1 the "official" workaround was intended to be by simply installing with pip latest Edit: note that if you manage to make Pytorch to use the latest cudnn there is no need any more to edit comfyui files on current git version, it will not use conv3d workaround on cudnn >= 9.15 |
|
This is about an AMD workaround. I don't think nvidia cudnn or cuda library versions can be related. The root cause could be a regression in miopen in the latest rocm releases. I have confirmed at least some operations have regressed since 6.4, though I don't think this particular operation is what motivated disabling cudnn/miopen. |
To offset the substantial effects of #10302, this PR provides (and informs the user of) an environment variable that can be set to nullify the unilateral decision made in #10302 to disable cudNN for all AMD users.
@comfyanonymous
You keep insisting that because cudnn=False runs faster for you, it must therefore be forced on everyone. That is not engineering. That is theology.
Let us review what you have done. Your pull request simply hard-codes torch.backends.cudnn.enabled = False for all RDNA3 and RDNA4 users. You wrote that you "have no idea why it helps but it does" on your system. That may be true for your test box, your driver, your kernel. But issue #10447 shows another user whose performance collapsed the moment cudnn was disabled. Issue #10460 shows the same pattern. For them, your patch breaks what once worked. That alone should end this argument: if a change helps some and harms others, the correct path is configurability, not decree.
Then you said "nothing else in Comfy uses env vars to enable or disable stuff." False. Comfy already reads them: COMFYUI_DIR, proxies, path expansions, HTTP settings. Users have asked for .env support and config overrides repeatedly. Pretending this tool never touches environment variables is historical revision, not justification. The absence of a precedent is not a reason to block a useful one.
ComfyUI runs on wildly different hardware and software combinations: Linux, Windows, ROCm 6.x, Torch 2.x, FlashAttention, tuned kernels, patched builds. The very nature of this ecosystem demands flexibility. A developer who locks a single behavior across such diversity is courting regression. Hardware changes. Drivers update. A fix today becomes a bottleneck tomorrow. Your forced flag will age like milk.
The purpose of an env var is precisely this: to give users an escape hatch when automatic detection fails or when blanket assumptions crumble. A flag such as COMFYUI_CUDNN_ENABLED=1 or 0 would let everyone test, measure, and choose without touching the source. It adds no maintenance cost. It adds resilience. It adds honesty.
If you truly believe in your optimization, you can keep it as the default, as this PR allows. It simply adds a message and the required fuctionality to support it:
"cudnn disabled for AMD; set COMFYUI_CUDNN_ENABLED=1 to re-enable."
That informs without coercing. That is how grown-up software behaves.
Right now, your stance is that your environment defines truth. It does not. It defines your truth. And until you allow others to define theirs, what you are enforcing is not a performance improvement, but a constraint masquerading as genius.