GPU Benchmark! 4070 4080 4090 3080 3090 #2970

metamountain · 2024-03-04T16:25:08Z

metamountain
Mar 4, 2024

Could we Do a benchmark for GPU´s ?
Need to get new hardware .
To keep it simple we could use the default load worklow on 1024*1024 with sdxl 1.0.
Seed 1
I know that ram comes in play when workflows get more complex , but for the start simple ist best.
Please post the second run. first run ist a little bit corupted by loading data.

theory pudget:

toms hardware 1111:

metamountain · 2024-03-04T16:28:08Z

metamountain
Mar 4, 2024
Author

2070 rtx mobile
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.05it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 24.22 seconds

0 replies

epaulekas · 2024-03-05T12:50:32Z

epaulekas
Mar 5, 2024

Nvidia RTX 4090 FE

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 6.17it/s]
Prompt executed in 3.61 seconds

(edit: used default resolution first go around)

0 replies

Robson1970 · 2024-03-06T18:06:12Z

Robson1970
Mar 6, 2024

CPU vs GPU 1024x1024 2 steps

Intel 13900KS 24,25 sec.
Nvidia 4090 0,42 sec.

https://files.catbox.moe/f4wb53.mp4

CPU.vs.GPU.test_2.mp4

1 reply

metamountain Mar 6, 2024
Author

Not quite a benchmark, you Gorilla.

metamountain · 2024-03-06T22:31:10Z

metamountain
Mar 6, 2024
Author

RTX 3090
100%|████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00, 3.61it/s]
Prompt executed in 6.16 seconds

0 replies

JorgeR81 · 2024-03-07T01:38:54Z

JorgeR81
Mar 7, 2024

Just for laughs ...
GTX 1070
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [01:03<00:00, 3.16s/it]
Prompt executed in 72.44 seconds

1 reply

BrechtCorbeel Jun 23, 2024

My 1070 is my favorite, was my longtime friend still in good use, pretty good for what it was, 3060 is 2 faster with only half as many additional cores, but with the 4090 I have now, uncomparable.

TheMadHatt3r · 2024-03-07T02:07:25Z

TheMadHatt3r
Mar 7, 2024

RTX 3070 OC Drivers:~528.xx

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00, 2.26it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 10.89 seconds

0 replies

YulienPohl · 2024-03-08T16:45:59Z

YulienPohl
Mar 8, 2024

My poor old GPU 😢
GTX 1050 Ti
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [02:51<00:00, 8.55s/it]
Prompt executed in 202.83 seconds

0 replies

ltdrdata · 2024-03-09T02:28:31Z

ltdrdata
Mar 9, 2024
Collaborator

RTX 4080
(Ubuntu 22.04, pytorch 2.2.0)

got prompt
100%|██████████| 20/20 [00:05<00:00,  3.53it/s]
Prompt executed in 6.46 seconds

0 replies

TheMadHatt3r · 2024-03-09T18:33:41Z

TheMadHatt3r
Mar 9, 2024

Would love to get some 3060 / 4060 data to better complete the regression... Then I'll post proposed for every 3000/4000 card.
Some older 2000 could be helpful to understand how the smaller VRAM scales as well.

0 replies

UnusMundusWave · 2024-03-11T06:42:04Z

UnusMundusWave
Mar 11, 2024

RTX 3090 driver:551.61

RTX 3090 driver:531.18

0 replies

ecpknymt · 2024-03-20T07:18:14Z

ecpknymt
Mar 20, 2024

RTX 4060ti 16GB

0 replies

TheMadHatt3r · 2024-03-21T23:30:22Z

TheMadHatt3r
Mar 21, 2024

RTX 4070 12Gb

got prompt
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00, 3.21it/s]
Prompt executed in 7.13 seconds

0 replies

askmyteapot · 2024-03-28T15:56:15Z

askmyteapot
Mar 28, 2024

WSL2 (6.1.x kernel)
Pytorch 2.3.0 from source
Xformers recent build (0.25)

3060Ti

Requested to load SDXL
Loading 1 new model
100%|███████████████████████████████████████████████████████████████████████████████████| 20/20 [00:09<00:00,  2.05it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 13.29 seconds

Tesla P40

got prompt
100%|███████████████████████████████████████████████████████████████████████████████████| 20/20 [00:45<00:00,  2.28s/it]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 49.26 seconds

No xformers

3060ti

Requested to load SDXL
Loading 1 new model
100%|███████████████████████████████████████████████████████████████████████████████████| 20/20 [00:09<00:00,  2.11it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 13.00 seconds

Tesla P40

got prompt
100%|███████████████████████████████████████████████████████████████████████████████████| 20/20 [00:45<00:00,  2.27s/it]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 48.97 seconds

0 replies

stepahin · 2024-04-20T13:43:16Z

stepahin
Apr 20, 2024

4090 FE, Ryzen 7900X, Samsung 980 2TB M.2
Win 11, ComfyUI cu121

got prompt
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 7.61it/s]
Prompt executed in 3.08 seconds

2 replies

kilobait3 Mar 6, 2025

не менее 9,1 должно быть. скиньте дрова моложе 2024г. совсем другая карта будет. 4090 будет как изначально с 94TF на fp16 и 32.

stepahin Apr 4, 2025

Моложе 2024г это как? Этот пост итак из апреля 2024. Ну и тут ни у кого не вижу на 4090 чтобы было 9 it/s

RysDawid · 2024-04-22T06:38:51Z

RysDawid
Apr 22, 2024

RTX 3070 Laptop GPU
Asus ZenBook Duo

Loading 1 new model 100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:11<00:00, 1.73it/s] Requested to load AutoencoderKL Loading 1 new model Prompt executed in 15.44 seconds

0 replies

Fluffinko · 2024-12-11T00:57:10Z

Fluffinko
Dec 11, 2024

win11 wsl2 ubuntu
Total VRAM 24576 MB, total RAM 90510 MB
pytorch version: 2.5.1+cu118
xformers version: 0.0.28.post3
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
Using xformers cross attention

100%|███████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00, 4.03it/s]
Prompt executed in 5.49 seconds

0 replies

to-sora · 2024-12-24T03:08:23Z

to-sora
Dec 24, 2024

got prompt
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 7.10it/s]
Prompt executed in 3.22 seconds ( include text encoder , ensured no cache)
4090D , linux mint , 7950X 16-Core, 128 , 1024 * 1024 , SDXL seed 1

0 replies

Hangover3832 · 2025-01-03T15:20:08Z

Hangover3832
Jan 3, 2025

Total VRAM 16376 MB, total RAM 65461 MB
pytorch version: 2.3.1+cu121
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.2.2+cu121 with CUDA 1201 (you have 2.3.1+cu121)
    Python  3.11.8 (you have 3.11.6)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
xformers version: 0.0.25.post1
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4080 : cudaMallocAsync
Using pytorch attention

100%|█████████████████████| 20/20 [00:03<00:00,  5.10it/s]
Prompt executed in 4.50 seconds

0 replies

mcpdigital · 2025-01-07T07:40:25Z

mcpdigital
Jan 7, 2025

RTX 4070 Ti Super
Total VRAM 16376 MB, total RAM 81806 MB
pytorch version: 2.5.1+cu121
Python 3.12.4
NVIDIA Driver: 566.36
CPU: AMD Ryzen 7 5700X3D 8-Core Processor - Arch: AMD64 - OS: Windows 11

4.8 seconds,
4.7 seconds
4.9 seconds

0 replies

zarushi · 2025-02-06T01:41:16Z

zarushi
Feb 6, 2025

RTX 3060 12GB
AMD Ryzen 5 5600X / 64GB / Win 11 / Python 3.10.11 / PyTorch 2.6.0 / CUDA 12.6
100%|███████████████████████████████████████████| 20/20 [00:13<00:00, 1.46it/s]
Prompt executed in 14.92 seconds

0 replies

bigcat88 · 2025-02-22T16:29:03Z

bigcat88
Feb 22, 2025

Made new benchmarks on a more recent version of ComfyUI and PyTorch 2.6(for AMD 2.7 nightly was used)

This is already the second series of benchs, we tried to take into account the shortcomings of the first benchmarks and made that for each Task a new prompt is given, so that the text encoder is also tested.

Similar benchmark suites were also removed, we tried to use 30 steps for generation everywhere
(if this is not a lighting version of the model)

Full interactive results link for those who is interested

5 replies

FeepingCreature Feb 22, 2025

If you're getting 8s on 7900 XTX SDXL default, it sounds like AMD Go Fast didn't work out for you?

bigcat88 Feb 22, 2025

Go Fast made Flux and SD3.5 faster

But then I updated ROCM, Go Fast break, and in Comfy landed a commit with PyTorch cross-attention support, and I stayed with it(it does not affect SDXL, but makes faster DIT models, the same as Go Fast).

And we can't add Go Fast support because there is no convenient installation option for different versions of Rocm/PyTorch and different versions of Python.

note: aesthetic it is a playground-v2.5-1024px-aesthetic model

FeepingCreature Feb 22, 2025

Yeah... fucking ROCM. AMD really make it as hard as possible to help them not suck.

FeepingCreature Mar 13, 2025

Fyi, if you're open to using torch.compile, I've figured out a way to get 7900 XTX SDXL down to 4.07 seconds, on a level with the 4090, with very minimal jank. It needs some patching at the moment.

FeepingCreature Mar 15, 2025

@bigcat88 Ohey, nice seeing you on the PR: you're the reason I added --use-flash-attention :) Please remember it (and PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=1) when you run the benchmark the next time!

cool9-MK · 2025-02-24T15:39:26Z

cool9-MK
Feb 24, 2025

RTX 5070ti, overclocked (+7% roughly).

100%█████████████████████████| 20/20 [00:04<00:00, 4.46it/s]
Prompt executed in 4.91 seconds

Update 8 march 2025: Torch has been updated to version 2.7, testing on the new Comfi build:
100%|██████████████████████████| 20/20 [00:03<00:00, 5.40it/s]
Prompt executed in 4.12 seconds

2 replies

kilobait3 Mar 6, 2025

как то слабовато. 4070ти супер быстрее получается. может еще что изменят в дровах. 5070 должна быть быстрее на 25-30%.

cool9-MK Mar 6, 2025

Comfy пока даже не поддерживает 5xxx карты, я тестировал на сборке, которая у них как временное решение проблемы. Возможно, скорость будет выше, когда поддержку завезут.

BTekV4 · 2025-03-06T01:46:49Z

BTekV4
Mar 6, 2025

AMD Radeon 6800XT (2400MHz)

** Platform: Linux
** Python version: 3.12.9 (main, Feb 4 2025, 14:38:38) [GCC 11.4.0]
Total VRAM 16368 MB, total RAM 31983 MB
pytorch version: 2.5.1+rocm6.1
AMD arch: gfx1030
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon Graphics : native
Using pytorch attention
ComfyUI version: 0.3.18

got prompt
loaded completely 9535.265881347656 4897.0483474731445 True
100%|███████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:14<00:00, 1.41it/s]
Prompt executed in 15.91 seconds
got prompt
loaded completely 9534.265881347656 4897.0483474731445 True
100%|███████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:14<00:00, 1.40it/s]
Prompt executed in 15.85 seconds
got prompt
loaded completely 9534.265881347656 4897.0483474731445 True
100%|███████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:14<00:00, 1.40it/s]
Prompt executed in 15.95 seconds
got prompt
loaded completely 9534.265881347656 4897.0483474731445 True
100%|███████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:14<00:00, 1.40it/s]
Prompt executed in 15.94 seconds

0 replies

FeepingCreature · 2025-03-13T14:49:13Z

FeepingCreature
Mar 13, 2025

7900 XTX with torch.compile node, ROCM FlashAttention, tuning and some unpublished ComfyUI fixes (soon):

100%|████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  5.74it/s]
Prompt executed in 4.07 seconds

I didn't know that was even possible on this card.

4 replies

Hakim3i Mar 13, 2025

Any tutorial?

FeepingCreature Mar 13, 2025

Not offhand... I just opened a PR at #7223, you wanna grab that patch, run ComfyUI with --use-flash-attention, grab the ROCM FlashAttention pip via pip uninstall flash-attn; pip install -U git+https://github.com/gel-crabs/flash-attention-gfx11@headdim512, set PYTORCH_TUNABLEOP_ENABLED=1 and add the TorchCompileModel node in ComfyUI. That should get you (after a lengthy compile phase) near-optimal speeds on ROCM to the best of my current knowledge.

Hakim3i Mar 13, 2025

Thanks what ROCM version and what OS are you using?

FeepingCreature Mar 13, 2025

Ubuntu 24.10, ROCM 6.3.4, Pytorch nightly. Also the card is in compute mode, gently overclocked to 3000Mhz, undervolted by -75, and the desktop session is off.

If you're looking for reasons why yours is slower, the Pytorch nightly would be the biggest of those imo.

xos84292-hub · 2025-03-15T13:50:55Z

xos84292-hub
Mar 15, 2025

7950X3D 48GB DDR5 6200 / RTX 5090D 32GB
Windows 11 24h2
pytorch version: 2.6.0+cu128.nv
(NO xformers)

model weight dtype torch.float16, manual cast: None
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
Requested to load SDXL
loaded completely 9.5367431640625e+25 4897.0483474731445 True
100%|████████████████████████| 20/20 [00:02<00:00, 9.98it/s]
Requested to load AutoencoderKL
loaded completely 9.5367431640625e+25 159.55708122253418 True
Prompt executed in 5.52 seconds
got prompt
100%|████████████████████████| 20/20 [00:02<00:00, 9.54it/s]
Prompt executed in 2.39 seconds
got prompt
100%|████████████████████████| 20/20 [00:02<00:00, 9.56it/s]
Prompt executed in 2.38 seconds

6 replies

xos84292-hub Mar 15, 2025

Prompt executed in 0.88 seconds

whaat? is it executed in the original precision(fp16)?

with CKPT

How to check run on native FP16 ? i just download a SD1.5 ckpt.

FeepingCreature Mar 15, 2025

Yeah SD1.5 is a lot faster. Try SDXL Base with 1024x1024.

xos84292-hub Mar 15, 2025

Yeah SD1.5 is a lot faster. Try SDXL Base with 1024x1024.

model weight dtype torch.float16, manual cast: None
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
Requested to load SDXL
loaded completely 9.5367431640625e+25 4897.0483474731445 True
100%|████████████████████████| 20/20 [00:02<00:00, 9.98it/s]
Requested to load AutoencoderKL
loaded completely 9.5367431640625e+25 159.55708122253418 True
Prompt executed in 5.52 seconds
got prompt
100%|████████████████████████| 20/20 [00:02<00:00, 9.54it/s]
Prompt executed in 2.39 seconds
got prompt
100%|████████████████████████| 20/20 [00:02<00:00, 9.56it/s]
Prompt executed in 2.38 seconds

FeepingCreature Mar 15, 2025

Damn, very impressive! That's like 50% faster than my 7900.

xos84292-hub Mar 15, 2025

Damn, very impressive! That's like 50% faster than my 7900.

it run with out xformers(i can't make it work on comfui)
If use xformers 10~12% faster

neubsi · 2025-03-29T19:19:52Z

neubsi
Mar 29, 2025

Total VRAM 24090 MB, total RAM 128714 MB
pytorch version: 2.8.0.dev20250321+cu128
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : cudaMallocAsync
Using pytorch attention
ComfyUI version: 0.3.27
ComfyUI frontend version: 1.14.6

[Crystools .[0;32mINFO.[0m] Crystools version: 1.22.1
[Crystools .[0;32mINFO.[0m] CPU: AMD Ryzen 9 5900X 12-Core Processor - Arch: x86_64 - OS: Linux 6.8.12-8-pve
[Crystools .[0;32mINFO.[0m] Pynvml (Nvidia) initialized.
[Crystools .[0;32mINFO.[0m] GPU/s:
[Crystools .[0;32mINFO.[0m] 0) NVIDIA GeForce RTX 4090
[Crystools .[0;32mINFO.[0m] NVIDIA Driver: 570.124.06

0 replies

indikoindiko · 2025-04-03T19:51:20Z

indikoindiko
Apr 3, 2025

This is mine..
i'm happy than it works!!
Laptop Lenovo ideapad
intel i5 12500h 2.5 ghz
16 gb ram
Gpu Intel Arc A370M 4Gb Vram
Prompt executed in 368.53 seconds

0 replies

ghobi12 · 2025-04-04T02:01:29Z

ghobi12
Apr 4, 2025

Total VRAM 24576 MB, total RAM 81787 MB
pytorch version: 2.6.0+cu126
xformers version: 0.0.29.post3
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
Using xformers attention
ComfyUI version: 0.3.26
ComfyUI frontend version: 1.12.14

Requested to load SDXLClipModel
loaded completely 21886.8 1560.802734375 True
Requested to load SDXL
loaded completely 20229.997177886962 4897.0483474731445 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 8.53it/s]
Requested to load AutoencoderKL
loaded completely 15055.100914001465 159.55708122253418 True
Prompt executed in 15.64 seconds

512x512

got prompt
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 8.74it/s]
Prompt executed in 2.49 seconds
got prompt
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 8.47it/s]
Prompt executed in 2.51 seconds
got prompt
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 8.50it/s]
Prompt executed in 2.48 seconds

1024x1024

got prompt
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00, 3.78it/s]
Prompt executed in 5.90 seconds
got prompt
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00, 3.71it/s]
Prompt executed in 5.97 seconds

0 replies

Andyholm · 2025-04-05T15:26:03Z

Andyholm
Apr 5, 2025

Hardware:

AMD Ryzen 5 7600X
AMD Radeon RX 7900 XTX / XFX MERC 310 AMD Radeon RX 7900 XTX Black
2x 16 GB DDR5 5600 MT/s
3840x2160 / 240 Hz

Software:

Win11 24H2
HIP SDK 6.2.4
ZLUDA v3.9.2
ComfyUI ZLUDA

Results seed 1-3:

6 replies

Andyholm Apr 5, 2025

Just this, wasn't aware flash attention worked
--use-quad-cross-attention --reserve-vram 0.9 --cuda-device 1

Andyholm Apr 5, 2025

I'm not really sure how to get flash attention working. :/
Also in case you missed it, I'm on windows, not linux

FeepingCreature Apr 5, 2025

Try --use-pytorch-cross-attention at least... Also yeah I don't know how to get FA working on Windows in general. Theoretically you'd just pip install -U git+https://github.com/gel-crabs/flash-attention-gfx11@headdim512 and then use --use-flash-attention but idk if that works.

Andyholm Apr 5, 2025

--use-pytorch-cross-attention is a tiny bit slower or the same in my experience, and I've tried a bunch of ways to install flash attention but it throws a bunch of errors, lol
Last one I ended up on was this OSError: Building PyTorch extensions using ROCm and Windows is not supported.
I don't feel like dualbooting Linux so I guess I'll just be fine with the 3 ish it/s I'm getting

--use-pytorch-cross-attention gives me about 2.9 it/s

FeepingCreature Apr 5, 2025

Well darn, okay then. Thanks for trying!

PluckTheDragon · 2025-04-08T02:09:21Z

PluckTheDragon
Apr 8, 2025

My new PC has Windows 11, a Nvidia GeForce RTX 3060 Ti with a AMD Ryzen 5 3600 6-Core Processor.

Is it okay or meh to generates AI arts with a big picutre size?

0 replies

GPU Benchmark! 4070 4080 4090 3080 3090 #2970

Replies: 53 comments · 38 replies

metamountain Mar 4, 2024 Author

metamountain Mar 6, 2024 Author

metamountain Mar 6, 2024 Author

ltdrdata Mar 9, 2024 Collaborator

Replies: 53 comments 38 replies

metamountain
Mar 4, 2024
Author

metamountain Mar 6, 2024
Author

metamountain
Mar 6, 2024
Author

ltdrdata
Mar 9, 2024
Collaborator