Skip to content

[DRAFT] Barebones ROCM support#2

Open
asagi4 wants to merge 15 commits intoComfy-Org:masterfrom
asagi4:hack/rocm-support
Open

[DRAFT] Barebones ROCM support#2
asagi4 wants to merge 15 commits intoComfy-Org:masterfrom
asagi4:hack/rocm-support

Conversation

@asagi4
Copy link

@asagi4 asagi4 commented Feb 5, 2026

Contribution Agreement

  • I agree that my contributions are licensed under the GPLv3.
  • I grant Comfy Org the rights to relicense these contributions as outlined in CONTRIBUTING.md.

This is not really intended for merging as is, but for reference. hipify-clang can convert the CUDA code to HIP code pretty easily with a few fixes, and it actually allows you to run aimdo on ROCM.

You might have to make sure your Python venv is using your system ROCM libraries for this to work.

It does not work perfectly (I'm still getting pytorch OOMs when it should be freeing memory) but workflows can run and produce good output.

I am not able to test, but the HIP code should be compilable as is on nvidia platforms too. If you run build-rocm on an nvidia platform, hipcc and hipconfig should set it up to link against cuda instead of ROCM and the result should be basically identical to the CUDA implementation.

@0xDELUXA
Copy link

0xDELUXA commented Feb 6, 2026

Oh, AMD support has entered the chat 🚀

@0xDELUXA
Copy link

0xDELUXA commented Feb 7, 2026

Made some adjustments and can confirm that this works on Windows (native ROCm 7 via TheRock) as well. Built aimdo.dll locally, installed this custom wheel, and got:

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB)
DynamicVRAM support detected and enabled

in the console.

So we can get past these warnings:
No working comfy-aimdo install detected. DynamicVRAM support disabled. Falling back to legacy ModelPatcher. VRAM estimates may be unreliable especially on Windows
NOTE: comfy-aimdo is currently only support for Nvidia GPUs

pip install comfy-aimdo automatically installs the Windows (Nvidia-only) package. It does include an aimdo.dll, but AMD gets the following error:

comfy-aimdo failed to load: E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll: Could not find module 'E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll' (or one of its dependencies). Try using the full path with constructor syntax.

I got curious and checked what Dependencies reports. Out of the three .dlls it requires, we AMD users are missing nvcuda.dll.

My custom-built aimdo.dll, which actually loads on AMD, replaces the nvcuda.dll dependency with amdhip6_7.dll.

Now that it loads, I'm curious whether it actually works as intended or just errors out.

\

I’m experiencing GPU hangs. After some debugging, I suspect it’s related to VMM + ROCm on Windows.

Summary:
VMM allocation APIs report success, but the GPU cannot reliably access the allocated memory.

  1. All hipMemCreate, hipMemMap, and hipMemSetAccess calls return success.
  2. hipMemsetD8 also returns success (the async operation is queued).
  3. hipDeviceSynchronize completes without errors.
  4. PyTorch kernel hangs when attempting to use the memory.

Suspected root cause: The AMD Windows WDDM driver may not fully support access to memory allocated via the VMM APIs.

@tvukovic-amd
Copy link

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

@0xDELUXA
Copy link

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

Now that ComfyUI x AMD is official, and this PR paves the way for ROCm Linux users to use it, it would be great to have comfy-aimdo running on ROCm Windows too. Theoretically, what is preventing it from working? I've tried many things, but it seems there’s something I haven’t been able to figure out.

@tvukovic-amd
Copy link

@asagi4 Just wanted to check in - is there any update or further progress on this PR?

@asagi4
Copy link
Author

asagi4 commented Feb 19, 2026

@tvukovic-amd Well I can't do much beyond run hipify and make it compile. I don't know enough about ROCM to debug any issues.

I rebased against master to get it to compile again, but it's untested.

@asagi4
Copy link
Author

asagi4 commented Feb 19, 2026

With latest master it seems to be completely broken. all VRAM allocations fail with aimdo: hip_src/vrambuf.c:56:ERROR:VRAM Allocation failed (non OOM) and torch throws an OOM exception immediately.

@0xDELUXA
Copy link

0xDELUXA commented Feb 20, 2026

Hold up a minute.

After @asagi4 confirmed that the latest updates break comfy-aimdo on AMD (Linux), I decided to try building the version checked out from the master branch. I have a very long, workaround-upon-workaround (mainly for hipify, else it just doesn't work) build script that I use on Windows. And somehow it magically avoids the GPU hang issue I was getting when comfy-aimdo was enabled.

I'm sure comfy-aimdo is actually being taken into consideration here, based on the console output (filtered):

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB) DynamicVRAM support detected and enabled
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 0 patches attached.
Model Initializing ...
Model Initialization complete!
Prompt executed in X seconds

\

After further benchmarking, some workloads still trigger GPU hangs, while others run fine. Previously, neither of them ran successfully. It seems that the new Model Initializing... phase is quite heavy on AMD, which is where it occasionally hangs.

@asagi4
Copy link
Author

asagi4 commented Feb 20, 2026

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

@0xDELUXA
Copy link

0xDELUXA commented Feb 20, 2026

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

Using the script in my fork: https://github.com/0xDELUXA/comfy-aimdo_win-rocm/blob/master/build-rocm-windows.bat

@asagi4
Copy link
Author

asagi4 commented Feb 20, 2026

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

@0xDELUXA
Copy link

0xDELUXA commented Feb 20, 2026

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

ROCm: 7.12.0a20260218
PyTorch: 2.12.0a0+rocm7.12.0a20260218
OS: WIndows 11

@asagi4
Copy link
Author

asagi4 commented Feb 20, 2026

I managed to locally fix things so that aimdo works for me again.
I think vrambuf_create has some alignment issue that appears with HIP
diff for hipified source here

diff -ru hip_src/vrambuf.c hip_src_fixed2/vrambuf.c
--- hip_src/vrambuf.c   2026-02-20 20:34:56.698464966 +0200
+++ hip_src_fixed2/vrambuf.c    2026-02-20 20:32:52.685112770 +0200
@@ -7,8 +7,16 @@
 SHARED_EXPORT
 void *vrambuf_create(int device, size_t max_size) {
     VramBuffer *buf;
+    if ((max_size / VRAM_CHUNK_SIZE) * VRAM_CHUNK_SIZE < max_size) {
+       log(ERROR, "??? alignment %zu\n", max_size);
+       max_size = ((max_size / VRAM_CHUNK_SIZE) + 1) * VRAM_CHUNK_SIZE;
+       log(ERROR, "??? fixed alignment %zu\n", max_size);
+    }

-    buf = (VramBuffer *)calloc(1, sizeof(*buf) + sizeof(hipMemGenericAllocationHandle_t) * max
_size / VRAM_CHUNK_SIZE);
+    size_t size = 0;
+    size = sizeof(*buf) + (sizeof(hipMemGenericAllocationHandle_t) * (max_size / VRAM_CHUNK_SI
ZE));
+    log(ERROR, "vrambuf_create calloc %zu\n", size)
+    buf = (VramBuffer *)calloc(1, size);
     if (!buf) {
         return NULL;
     }
@@ -53,7 +61,7 @@
         }
         if ((err = three_stooges(buf->base_ptr + buf->allocated, to_allocate, buf->device, &ha
ndle)) != hipSuccess) {
             if (err != hipErrorOutOfMemory) {
-                log(ERROR, "VRAM Allocation failed (non OOM): %d\n", err);
+                log(ERROR, "VRAM Allocation failed (non OOM): %s\n", hipGetErrorString(err));
                 return false;
             }
             log(DEBUG, "Pytorch allocator attempt exceeds available VRAM ...\n");

apparently vrambuf_create somehow works on CUDA without aligning to chunk size but with HIP (on Linux?) it fails. I don't know why it works on Windows.

@0xDELUXA
Copy link

0xDELUXA commented Feb 20, 2026

I haven’t encountered any OOMs in my workflows, but occasionally the GPU hangs at 100% usage. It would be great if Windows and Linux ROCm were even more similar.

@asagi4
Copy link
Author

asagi4 commented Feb 20, 2026

with these changes things work for me again on Linux. Or at least one workflow ran successfully. Previously pretty much all allocations failed with "invalid argument" when mapping new vram allocations, presumably because the vram buffers weren't aligned to the defined chunk size.

@asagi4
Copy link
Author

asagi4 commented Feb 22, 2026

Hm, with the latest changes to master the fixing has gotten a bit more complicated because aimdo's overriding functions have mismatching result types from cuda functions and hipify / clang doesn't like that.

For example, they're defined to return int in the header, but the actual function prototype says cudaError_t. In addition, the actual aimdo implementations return CUresults...

I'll try to see what happens if I just fix the return types and cast the return values, but that seems like something that should be fixed regardless of ROCm, since I don't think relying on implicit casts from integers is very good behaviour.

@rattus128 what do you think?

@asagi4
Copy link
Author

asagi4 commented Feb 22, 2026

Now it compiles, loads and appears to work again.

Haven't stress-tested though.

@0xDELUXA
Copy link

0xDELUXA commented Feb 22, 2026

Have you run any workload that exceeds VRAM and would OOM without comfy-aimdo?

Does the original example.py work on your system?

Another thing is that the ROCm documentation states that VMM is “under development” on Windows. Some APIs are even marked as beta on Linux too, so I can’t really do anything to get it to work reliably on Windows.

@asagi4
Copy link
Author

asagi4 commented Feb 22, 2026

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

@0xDELUXA
Copy link

0xDELUXA commented Feb 22, 2026

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I see. I don’t really think the comfy-aimdo dev has much insight into the AMD side, so it’s just us. I assume there will still be things that work reliably on Nvidia but not as well on AMD.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

Not a problem - the build script from my fork, on Windows, as you said, "at least it compiles and runs, so it's a start."

@0xDELUXA
Copy link

0xDELUXA commented Feb 23, 2026

I'm rather curious about how your AMD Linux implementation behaves. Could you try running example.py pls? My output on Windows is this.

@asagi4
Copy link
Author

asagi4 commented Feb 23, 2026

@0xDELUXA I can't run it at all because it tries to import a function called vbars_analyze that doesn't seem to exist anywhere.

@0xDELUXA
Copy link

0xDELUXA commented Feb 23, 2026

I needed to modify it as well, and this one works for me. Commented out vbars_analyze, etc.

@asagi4
Copy link
Author

asagi4 commented Feb 23, 2026

I fixed the script and it gives me this:

Init complete
aimdo: hip_src/control.c:67:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 7900 XTX (VRAM: 24560 MB)
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=131072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xabacef0
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xabacef0
##################### Run the first model #######################
Some weights will be loaded and stay there for all iterations
Some weights will be offloaded

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[First Load] Populated weight at offset: 400.0M
[First Load] Populated weight at offset: 800.0M
[First Load] Populated weight at offset: 1200.0M
[First Load] Populated weight at offset: 1600.0M
[First Load] Populated weight at offset: 2000.0M
[First Load] Populated weight at offset: 2400.0M
[First Load] Populated weight at offset: 2800.0M
[First Load] Populated weight at offset: 3200.0M
[First Load] Populated weight at offset: 3600.0M
[First Load] Populated weight at offset: 4000.0M
[First Load] Populated weight at offset: 4400.0M
[First Load] Populated weight at offset: 4800.0M
[First Load] Populated weight at offset: 5200.0M
[First Load] Populated weight at offset: 5600.0M
[First Load] Populated weight at offset: 6000.0M
[First Load] Populated weight at offset: 6400.0M
[First Load] Populated weight at offset: 6800.0M
[First Load] Populated weight at offset: 7200.0M
[First Load] Populated weight at offset: 7600.0M
[First Load] Populated weight at offset: 8000.0M
[First Load] Populated weight at offset: 8400.0M
[First Load] Populated weight at offset: 8800.0M
[First Load] Populated weight at offset: 9200.0M
[First Load] Populated weight at offset: 9600.0M
[First Load] Populated weight at offset: 10000.0M
[First Load] Populated weight at offset: 10400.0M
[First Load] Populated weight at offset: 10800.0M
[First Load] Populated weight at offset: 11200.0M
[First Load] Populated weight at offset: 11600.0M
[First Load] Populated weight at offset: 12000.0M
[First Load] Populated weight at offset: 12400.0M
[First Load] Populated weight at offset: 12800.0M
[First Load] Populated weight at offset: 13200.0M
[First Load] Populated weight at offset: 13600.0M
[First Load] Populated weight at offset: 14000.0M
[First Load] Populated weight at offset: 14400.0M
[First Load] Populated weight at offset: 14800.0M
[First Load] Populated weight at offset: 15200.0M
[First Load] Populated weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    16400 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     7820 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 16000 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=3072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xb135160
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xb135160
##################### Run the second model #######################
Everything will be loaded and will displace some weights of the first model

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!
[First Load] Populated weight at offset: 603.2421875M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17824 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6396 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 3544] Ptr: 0x7fa5bb000000 | Size:  622592k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     608 MB
##################### Run the first model again #######################
Some weights will still be loaded from before and be there first iteration
Some weights will get re-loaded on the first interation
The rest will be offloaded again

aimdo: hip_src/model-vbar.c:234:DEBUG:vbar_prioritize vbar=0xabacef0
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[No Load Needed] Reusing weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17616 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6604 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'```
Some of the ERROR logs from aimdo aren't actually errors, they're just things I added that I wanted to log without enabling debug logging.

@0xDELUXA
Copy link

0xDELUXA commented Feb 23, 2026

I see. I've also added some debug output, but shouldn't the script also print [Offloaded] alongside [First Load] and [No Load Needed], considering the Some weights will be offloaded and The rest will be offloaded again comments included in the script by rattus128?
Based on the outputs, this is the main difference between comfy-aimdo on AMD Linux/Windows at present.
Which AMD GPU do you have, btw? Mine has 16 GB VRAM, if yours has more, that could explain the offload difference.

@asagi4
Copy link
Author

asagi4 commented Feb 23, 2026

It might be that it runs like that because everything fits into VRAM. If I change the layer counts, at some point I just get OOMs. I don't think it's properly offloading anything automatically.

@asagi4 asagi4 force-pushed the hack/rocm-support branch from e367b71 to c52958f Compare March 2, 2026 09:40
@rattus128
Copy link
Collaborator

It would be great to have more AMD users testing on both Windows and Linux. I’ve brought up comfy-aimdo support and this PR in the AMD Dev Community Discord, but there hasn’t been much engagement so far. I believe that the more Nvidia-only features AMD can port, the stronger the overall ecosystem becomes. Unfortunately, not everyone seems to share this perspective.

I wouldnt feel discouraged with what we have so far here. What you have working so far is way beyond my expectations for this stage of the effort. We will get there.

@0xDELUXA
Copy link

0xDELUXA commented Mar 2, 2026

I wouldnt feel discouraged with what we have so far here. What you have working so far is way beyond my expectations for this stage of the effort. We will get there.

Thanks! I think @asagi4 and I are motivated to keep pushing forward and see this through with your support.

@jammm
Copy link

jammm commented Mar 2, 2026

@asagi4 @0xDELUXA thanks for sharing your findings! Given that the findings are scattered across multiple comments, It would be great to summarize them and provide a single minimal reproducer with/without hipMemAddressFree and the expected result on Windows/Linux, so it'll be easier for the HIP runtime teams to work on it and fix.

@0xDELUXA
Copy link

0xDELUXA commented Mar 2, 2026

@asagi4 @0xDELUXA thanks for sharing your findings! It would be great to summarize them and provide a minimal reproducer with/without hipMemAddressFree and the expected result on Windows/Linux, so it'll be easier for the HIP runtime teams to work on it and fix.

I don’t have issues with hipMemAddressFree on Windows.

My biggest concern is comfy-aimdo breaking triton-windows with:
ValueError: Pointer argument (at 0) cannot be accessed from Triton (CPU tensor?).
This prevents us from using SageAttention or FlashAttention with comfy-aimdo. Only SDPA works, and its performance is much worse under heavy workloads compared to when Sage and Flash are available.
This doesn’t happen with the upstream Triton on Linux, as @asagi4 hasn’t mentioned this issue.
I assume triton-windows doesn’t have this same issue on Nvidia either.

@jammm
Copy link

jammm commented Mar 2, 2026

Is there a minimal reproducer for that triton error?

@0xDELUXA
Copy link

0xDELUXA commented Mar 2, 2026

Is there a minimal reproducer for that triton error?

# Open PowerShell

# Cd to ComfyUI, activate venv
venv\Scripts\activate

# Clone repo, checkout PR 2, build locally, and install
git clone https://github.com/Comfy-Org/comfy-aimdo.git
cd comfy-aimdo
git fetch origin pull/2/head:pr-2
git checkout pr-2
./build-rocm-local_win.ps1
pip install .

# Download a `post26` triton-windows wheel and install it. For ex. from this run:
# https://github.com/triton-lang/triton-windows/actions/runs/22558044670
# `post25` and earlier versions break SageAttention
pip install triton_windows-3.6.0+gitdc332243.post26-cp312-cp312-win_amd64.whl

# Install SageAttention V1
pip install "sageattention<2"

# Cd back to ComfyUI root, and start it w/ Sage
cd ..
python main.py --use-sage-attention

# Or build and install FlashAttention-2
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/
pip install einops packaging psutil ninja wheel setuptools
$env:FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
python setup.py install

# Cd back to ComfyUI root, and start it w/ Flash
cd ..
python main.py --use-flash-attention

Both work via triton-windows, so both give the following error:
ValueError: Pointer argument (at 0) cannot be accessed from Triton (CPU tensor?).

Here's my node for FA-2, which gives the same error but with more detailed traceback.

This node might be helpful as well for monitoring aimdo. Or not.

Oh, and we also need to manually edit line 195 of main.py (because of this recent change) from:
if enables_dynamic_vram() and comfy.model_management.is_nvidia():
to, for example:
if enables_dynamic_vram() and (comfy.model_management.is_nvidia() or comfy.model_management.is_amd()):

When started simply with python main.py on gfx1200, the startup console prints Using PyTorch attention, and it doesn’t error out with comfy-aimdo enabled, since SDPA doesn’t require triton-windows at all.

@asagi4
Copy link
Author

asagi4 commented Mar 2, 2026

@asagi4 @0xDELUXA thanks for sharing your findings! Given that the findings are scattered across multiple comments, It would be great to summarize them and provide a single minimal reproducer with/without hipMemAddressFree and the expected result on Windows/Linux, so it'll be easier for the HIP runtime teams to work on it and fix.

Here's a gist of the reproducer for the memory problem I'm seeing. Just compile and run without parameters to test the behaviour with hipMemAddressFree, and then run with any parameter to test the behaviour without.
https://gist.github.com/asagi4/bd6a1fb2a37601a19271749772393534

I currently have ROCm 7.11 preview installed from the RPM packages on RHEL 9.
Compare the last lines:

Running with hipMemAddressFree
Memory free: 24524MiB in_use: 36MiB
Alloc count: 1024
Finished allocating 1024 chunks
Memory free: 22330MiB in_use: 2230MiB
Freed half, alloc count: 512
Memory free: 23354MiB in_use: 1206MiB
Finished freeing 1024 chunks
Alloc count: 0
Memory free: 24378MiB in_use: 182MiB
Memory free: 24378MiB in_use: 182MiB
Alloc count: 1024
Finished allocating 1024 chunks
Memory free: 22330MiB in_use: 2230MiB
Freed half, alloc count: 512
Memory free: 23354MiB in_use: 1206MiB
Finished freeing 1024 chunks
Alloc count: 0
Memory free: 24378MiB in_use: 182MiB
Running without hipMemAddressFree
Memory free: 24524MiB in_use: 36MiB
Alloc count: 1024
Finished allocating 1024 chunks
Memory free: 22330MiB in_use: 2230MiB
Freed half, alloc count: 512
Memory free: 22330MiB in_use: 2230MiB
Finished freeing 1024 chunks
Alloc count: 0
Memory free: 22330MiB in_use: 2230MiB
Memory free: 22330MiB in_use: 2230MiB
Alloc count: 1024
Finished allocating 1024 chunks
Memory free: 20282MiB in_use: 4278MiB
Freed half, alloc count: 512
Memory free: 20282MiB in_use: 4278MiB
Finished freeing 1024 chunks
Alloc count: 0
Memory free: 20282MiB in_use: 4278MiB

@0xDELUXA
Copy link

0xDELUXA commented Mar 2, 2026

Yeah, it would be great to see whether these are only local issues or OS-wide ones.

@0xDELUXA
Copy link

0xDELUXA commented Mar 2, 2026

I currently have ROCm 7.11 preview installed from the RPM packages on RHEL 9.

I can't really say anything about Linux ROCm, so this might be a silly suggestion, but could you try installing the very latest wheels from TheRock? Theoretically, these should be more up-to-date than the ROCm 7.11 preview packages. At least, that’s the situation on Windows right now.
For example, this page doesn’t even list Windows ROCm for my gfx1200, but TheRock has been providing wheels for this GPU for about six months now.
On the other hand, the RX 7900 XTX does have Windows ROCm support, according to the same page. I have no idea why this is the case.

@asagi4
Copy link
Author

asagi4 commented Mar 2, 2026

I currently have ROCm 7.11 preview installed from the RPM packages on RHEL 9.

I can't really say anything about Linux ROCm, so this might be a silly suggestion, but could you try installing the very latest wheels from TheRock?

I just tried, it doesn't make a difference.

@0xDELUXA
Copy link

0xDELUXA commented Mar 2, 2026

I just tried, it doesn't make a difference.

I see. Nvm then

@0xDELUXA
Copy link

0xDELUXA commented Mar 3, 2026

@asagi4 are you on Discord? If so, you should join the AMD Developer Community - there are people there who have given feedback on aimdo / Linux.

@sleppyrobot
Copy link

Hey, tested this out on ubuntu with a 7900xtx rocm 7.2 with Anima and LTX2

It appears to work with the following config.

Start parameters : TRITON_CACHE_AUTOTUNING=1 MIOPEN_FIND_MODE=2 FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" TORCH_BLAS_PREFER_HIPBLASLT=1 python main.py --disable-api-nodes --reserve-vram 1 --disable-pinned-memory --use-quad-cross-attention

Torch compile is likely not fully compatible, causes slow down. (I believe there are issues with dynamic_vram + compile on nvidia as well)

Observed issues.
Image output swaps between black or distorted images.
When using FA2, i got screen artifacts during the sampling phase.

Also tried LTX2 GGUF, model and clip which doesnt work but i think thats intended.
The FP8 with core node loaders, also failed to load with a different error. Below is the top and end of the error.

Requested to load LTXAVTEModel_
Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25965MB Staged. 0 patches attached.
aimdo: src/vrambuf.c:68:INFO:VRAM Allocation failed (OOM)
aimdo: src/vrambuf.c:68:INFO:VRAM Allocation failed (OOM)
aimdo: src/model-vbar.c:315:ERROR:VRAM Allocation failed
!!! Exception during processing !!! Fault failed: 2
Traceback (most recent call last):

File "/home/adminl/anaconda3/envs/C_312_rs/lib/python3.12/site-packages/comfy_aimdo/model_vbar.py", line 78, in fault
raise RuntimeError(f"Fault failed: {res}")
RuntimeError: Fault failed: 2

@asagi4
Copy link
Author

asagi4 commented Mar 3, 2026

@sleppyrobot I haven't tested with ROCM 7.2, but I have an initialization script like this before I run comfyui:

TORCHVER=$(uv pip show torch | grep Version | awk '{print $2}')
TRITONVER=$(uv pip show triton | grep Version | awk '{print $2}')
if [ -z "$TORCHVER" ]; then
  echo "Torch version not found?"
  exit 1
fi
if [ -z "$TRITONVER" ]; then
  TRITONVER=$(uv pip show triton-rocm | grep Version | awk '{print $2}')
  TRITONVER="rocm-${TRITONVER}"
fi
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
#export FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE
# supposedly best for gfx1100?
export FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":1,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'
export TORCHINDUCTOR_CACHE_DIR="${HOME}/.torchinductor_cache/torch${TORCHVER}"
export TORCHINDUCTOR_FX_GRAPH_CACHE=1
export TRITON_CACHE_DIR="${HOME}/.triton/cache_${TORCHVER}_${TRITONVER}"
mkdir -p "$TORCHINDUCTOR_CACHE_DIR"
export PYTORCH_ROCM_ARCH="gfx1100"
export TRITON_USE_ROCM="ON"
export PYTORCH_TUNABLEOP_TUNING=${TUNING:-0}
export PYTORCH_TUNABLEOP_ENABLED=${TUNING:-1}
export PYTORCH_TUNABLEOP_VERBOSE=${VERBOSE:-1}
export PYTORCH_TUNABLEOP_FILENAME="/home/sd/.config/rocm_tunables_${TORCHVER}.csv"
echo "Saving tunables to $PYTORCH_TUNABLEOP_FILENAME"
export PYTORCH_TUNABLEOP_RECORD_UNTUNED=0
export PYTORCH_TUNABLEOP_UNTUNED_FILENAME=/home/sd/.config/rocm_untuned.csv
export MIOPEN_FIND_MODE=${MIMODE:-"FAST"}
echo "Set MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
export TORCHINDUCTOR_SEARCH_AUTOTUNE_CACHE=1
export PYTORCH_MIOPEN_SUGGEST_NHWC=0
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export TORCH_BLAS_PREFER_HIPBLASLT=1

export ROCM_ENABLE_FP16_EX=1
export ROCM_ENABLE_BF16_EX=1

Maybe that'll give you some clues. Flash attention at least appears to work. (with --use-flash-attention)

Models that don't fit in your VRAM won't work on Linux until we figure out how to get aimdo to actually free an VRAM chunks.

@0xDELUXA
Copy link

0xDELUXA commented Mar 3, 2026

Models that don't fit in your VRAM won't work on Linux until we figure out how to get aimdo to actually free an VRAM chunks.

@sleppyrobot can you try https://gist.github.com/asagi4/bd6a1fb2a37601a19271749772393534?

@0xDELUXA
Copy link

0xDELUXA commented Mar 3, 2026

@sleppyrobot I haven't tested with ROCM 7.2

I think we should use the latest available ROCm version, as some issues might have already been resolved.

How come Linux ROCm has an issue that Windows doesn’t? It’s like, I don’t know, 10 years older XD

@sleppyrobot
Copy link

I dont really want to change my pytorch-rocm version for stability reasons. Given the black output images works for you, my previous experience makes me thing its very likely a pytorch version issue on my end.

At least your aware of the ongoing issue that occurs with "Models that don't fit", this is really the main benefit of aimdo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants