Conversation
|
Oh, AMD support has entered the chat 🚀 |
|
Made some adjustments and can confirm that this works on Windows (native ROCm 7 via TheRock) as well. Built in the console. So we can get past these warnings:
I got curious and checked what Dependencies reports. Out of the three My custom-built Now that it loads, I'm curious whether it actually works as intended or just errors out. \ I’m experiencing GPU hangs. After some debugging, I suspect it’s related to VMM + ROCm on Windows. Summary:
Suspected root cause: The AMD Windows WDDM driver may not fully support access to memory allocated via the VMM APIs. |
|
If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us. |
Now that ComfyUI x AMD is official, and this PR paves the way for ROCm Linux users to use it, it would be great to have |
|
@asagi4 Just wanted to check in - is there any update or further progress on this PR? |
eb2e747 to
e95bb5c
Compare
|
@tvukovic-amd Well I can't do much beyond run hipify and make it compile. I don't know enough about ROCM to debug any issues. I rebased against master to get it to compile again, but it's untested. |
|
With latest master it seems to be completely broken. all VRAM allocations fail with |
|
Hold up a minute. After @asagi4 confirmed that the latest updates break I'm sure
\ After further benchmarking, some workloads still trigger GPU hangs, while others run fine. Previously, neither of them ran successfully. It seems that the new |
|
@0xDELUXA you mean you can run hipify without changes to master? How did you manage that? |
Using the script in my fork: https://github.com/0xDELUXA/comfy-aimdo_win-rocm/blob/master/build-rocm-windows.bat |
|
Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all |
ROCm: |
|
I managed to locally fix things so that aimdo works for me again. apparently vrambuf_create somehow works on CUDA without aligning to chunk size but with HIP (on Linux?) it fails. I don't know why it works on Windows. |
|
I haven’t encountered any OOMs in my workflows, but occasionally the GPU hangs at 100% usage. It would be great if Windows and Linux ROCm were even more similar. |
e95bb5c to
9c4c215
Compare
|
with these changes things work for me again on Linux. Or at least one workflow ran successfully. Previously pretty much all allocations failed with "invalid argument" when mapping new vram allocations, presumably because the vram buffers weren't aligned to the defined chunk size. |
|
Hm, with the latest changes to master the fixing has gotten a bit more complicated because aimdo's overriding functions have mismatching result types from cuda functions and hipify / clang doesn't like that. For example, they're defined to return int in the header, but the actual function prototype says cudaError_t. In addition, the actual aimdo implementations return CUresults... I'll try to see what happens if I just fix the return types and cast the return values, but that seems like something that should be fixed regardless of ROCm, since I don't think relying on implicit casts from integers is very good behaviour. @rattus128 what do you think? |
9c4c215 to
51d4d2f
Compare
|
Now it compiles, loads and appears to work again. Haven't stress-tested though. |
|
Have you run any workload that exceeds VRAM and would OOM without Does the original example.py work on your system? Another thing is that the ROCm documentation states that VMM is “under development” on Windows. Some APIs are even marked as beta on Linux too, so I can’t really do anything to get it to work reliably on Windows. |
|
@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument". I wonder if since the pointer it's working with is I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything. |
I see. I don’t really think the
Not a problem - the build script from my fork, on Windows, as you said, "at least it compiles and runs, so it's a start." |
|
I'm rather curious about how your AMD Linux implementation behaves. Could you try running example.py pls? My output on Windows is this. |
|
@0xDELUXA I can't run it at all because it tries to import a function called vbars_analyze that doesn't seem to exist anywhere. |
|
I needed to modify it as well, and this one works for me. Commented out |
|
I fixed the script and it gives me this: |
|
I see. I've also added some debug output, but shouldn't the script also print |
|
It might be that it runs like that because everything fits into VRAM. If I change the layer counts, at some point I just get OOMs. I don't think it's properly offloading anything automatically. |
e367b71 to
c52958f
Compare
I wouldnt feel discouraged with what we have so far here. What you have working so far is way beyond my expectations for this stage of the effort. We will get there. |
Thanks! I think @asagi4 and I are motivated to keep pushing forward and see this through with your support. |
|
@asagi4 @0xDELUXA thanks for sharing your findings! Given that the findings are scattered across multiple comments, It would be great to summarize them and provide a single minimal reproducer with/without |
I don’t have issues with My biggest concern is |
|
Is there a minimal reproducer for that triton error? |
Both work via Here's my node for FA-2, which gives the same error but with more detailed traceback. This node might be helpful as well for monitoring Oh, and we also need to manually edit line When started simply with |
Here's a gist of the reproducer for the memory problem I'm seeing. Just compile and run without parameters to test the behaviour with hipMemAddressFree, and then run with any parameter to test the behaviour without. I currently have ROCm 7.11 preview installed from the RPM packages on RHEL 9. |
|
Yeah, it would be great to see whether these are only local issues or OS-wide ones. |
I can't really say anything about Linux ROCm, so this might be a silly suggestion, but could you try installing the very latest wheels from TheRock? Theoretically, these should be more up-to-date than the ROCm 7.11 preview packages. At least, that’s the situation on Windows right now. |
I just tried, it doesn't make a difference. |
I see. Nvm then |
|
@asagi4 are you on Discord? If so, you should join the |
|
Hey, tested this out on ubuntu with a 7900xtx rocm 7.2 with Anima and LTX2 It appears to work with the following config. Start parameters : TRITON_CACHE_AUTOTUNING=1 MIOPEN_FIND_MODE=2 FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" TORCH_BLAS_PREFER_HIPBLASLT=1 python main.py --disable-api-nodes --reserve-vram 1 --disable-pinned-memory --use-quad-cross-attention Torch compile is likely not fully compatible, causes slow down. (I believe there are issues with dynamic_vram + compile on nvidia as well) Observed issues. Also tried LTX2 GGUF, model and clip which doesnt work but i think thats intended. Requested to load LTXAVTEModel_ File "/home/adminl/anaconda3/envs/C_312_rs/lib/python3.12/site-packages/comfy_aimdo/model_vbar.py", line 78, in fault |
|
@sleppyrobot I haven't tested with ROCM 7.2, but I have an initialization script like this before I run comfyui: Maybe that'll give you some clues. Flash attention at least appears to work. (with --use-flash-attention) Models that don't fit in your VRAM won't work on Linux until we figure out how to get aimdo to actually free an VRAM chunks. |
@sleppyrobot can you try https://gist.github.com/asagi4/bd6a1fb2a37601a19271749772393534? |
I think we should use the latest available ROCm version, as some issues might have already been resolved. How come Linux ROCm has an issue that Windows doesn’t? It’s like, I don’t know, 10 years older XD |
|
I dont really want to change my pytorch-rocm version for stability reasons. Given the black output images works for you, my previous experience makes me thing its very likely a pytorch version issue on my end. At least your aware of the ongoing issue that occurs with "Models that don't fit", this is really the main benefit of aimdo. |
Contribution Agreement
This is not really intended for merging as is, but for reference. hipify-clang can convert the CUDA code to HIP code pretty easily with a few fixes, and it actually allows you to run aimdo on ROCM.
You might have to make sure your Python venv is using your system ROCM libraries for this to work.
It does not work perfectly (I'm still getting pytorch OOMs when it should be freeing memory) but workflows can run and produce good output.
I am not able to test, but the HIP code should be compilable as is on nvidia platforms too. If you run build-rocm on an nvidia platform, hipcc and hipconfig should set it up to link against cuda instead of ROCM and the result should be basically identical to the CUDA implementation.