Skip to content

Big performance regression of llama-bench with Vulkan backend using forced integer dot product code path (at least on NV 4070 latest driver) (from initial support in b5010).. #13063

Closed
@oscarbg

Description

@oscarbg

Hi,
easy to replicate forcing it by disabling coop mat 1 and 2 code paths:

set GGML_VK_DISABLE_COOPMAT2=1
set GGML_VK_DISABLE_COOPMAT=1

tok/s go from 2899 in build 5010 (first with integer dot product usage?) to 1935.29 in build 5145..

EDIT: lazy to bisect in which build/commit perf regressed..

using latest Nvidia drivers both 575.xx branch and Nv VK dev driver..

llama-b5010-bin-win-vulkan-x64>llama-bench
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         pp512 |      2899.14 ± 48.84 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         tg128 |        100.46 ± 0.49 |

build: a8a1f335 (5010)
llama-b5145-bin-win-vulkan-x64>llama-bench
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         pp512 |       1935.29 ± 4.16 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         tg128 |        101.61 ± 0.21 |

build: 12b17501 (5145)

tested also on Linux results equally badly..

export GGML_VK_DISABLE_COOPMAT=1
export GGML_VK_DISABLE_COOPMAT2=1
~/llamavk/lin5010$ ./llama-bench
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         pp512 |       2953.53 ± 8.11 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         98.85 ± 0.27 |

build: a8a1f335 (5010)

~/llamavk/lin5145$ ./llama-bench
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         pp512 |       1926.43 ± 5.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         99.65 ± 1.92 |

build: 12b17501 (5145)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions