You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm having mixed results with my 24gb P40 running Deepseek R1 2.71b (from unsloth)
llama-cli starts at 4.5 tokens/s, but it suddenly drops to 2 even before finishing the answer when using flash attention and q4_0 for both k and v cache.
On the other hand, NOT using flash attention nor q4_0 for v cache, I can complete the prompt without issues and it finishes at 3 tokens/second.
non-flash attention, finishes correctly at 2300 tokens:
llama_perf_sampler_print: sampling time = 575.53 ms / 2344 runs ( 0.25 ms per token, 4072.77 tokens per second)
llama_perf_context_print: load time = 738356.48 ms
llama_perf_context_print: prompt eval time = 1298.99 ms / 12 tokens ( 108.25 ms per token, 9.24 tokens per second)
llama_perf_context_print: eval time = 698707.43 ms / 2331 runs ( 299.75 ms per token, 3.34 tokens per second)
llama_perf_context_print: total time = 702025.70 ms / 2343 tokens
Flash attention. I need to stop it manually because it can take hours and it goes below 1 t/s:
llama_perf_sampler_print: sampling time = 551.06 ms / 2387 runs ( 0.23 ms per token, 4331.63 tokens per second)
llama_perf_context_print: load time = 143539.30 ms
llama_perf_context_print: prompt eval time = 959.07 ms / 12 tokens ( 79.92 ms per token, 12.51 tokens per second)
llama_perf_context_print: eval time = 1142179.89 ms / 2374 runs ( 481.12 ms per token, 2.08 tokens per second)
llama_perf_context_print: total time = 1145100.79 ms / 2386 tokens
Interrupted by user
llama-bench is not showing anything like that. Here is the comparison:
no flash attention - 42 layers in gpu
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model | size | params | backend | ngl | type_k | ot | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | --------------------- | --------------: | -------------------: |
| deepseek2 671B Q2_K - Medium | 211.03 GiB | 671.03 B | CUDA | 42 | q4_0 | exps=CPU | pp512 | 8.63 ± 0.01 |
| deepseek2 671B Q2_K - Medium | 211.03 GiB | 671.03 B | CUDA | 42 | q4_0 | exps=CPU | tg128 | 4.35 ± 0.01 |
| deepseek2 671B Q2_K - Medium | 211.03 GiB | 671.03 B | CUDA | 42 | q4_0 | exps=CPU | pp512+tg128 | 6.90 ± 0.01 |
build: 7c07ac24 (5403)
flash attention - 62 layers on gpu
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model | size | params | backend | ngl | type_k | type_v | fa | ot | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | --------------------- | --------------: | -------------------: |
| deepseek2 671B Q2_K - Medium | 211.03 GiB | 671.03 B | CUDA | 62 | q4_0 | q4_0 | 1 | exps=CPU | pp512 | 7.93 ± 0.01 |
| deepseek2 671B Q2_K - Medium | 211.03 GiB | 671.03 B | CUDA | 62 | q4_0 | q4_0 | 1 | exps=CPU | tg128 | 4.56 ± 0.00 |
| deepseek2 671B Q2_K - Medium | 211.03 GiB | 671.03 B | CUDA | 62 | q4_0 | q4_0 | 1 | exps=CPU | pp512+tg128 | 6.10 ± 0.01 |
Any ideas? This is the command I use to test the prompt:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I'm having mixed results with my 24gb P40 running Deepseek R1 2.71b (from unsloth)
llama-cli starts at 4.5 tokens/s, but it suddenly drops to 2 even before finishing the answer when using flash attention and q4_0 for both k and v cache.
On the other hand, NOT using flash attention nor q4_0 for v cache, I can complete the prompt without issues and it finishes at 3 tokens/second.
non-flash attention, finishes correctly at 2300 tokens:
Flash attention. I need to stop it manually because it can take hours and it goes below 1 t/s:
llama-bench is not showing anything like that. Here is the comparison:
no flash attention - 42 layers in gpu
flash attention - 62 layers on gpu
Any ideas? This is the command I use to test the prompt:
I remove cache type-v and fa parameters to test without flash attention. I also have to reduce from 62 layers to 42 to make it fit in the 24GB of VRAM
The specs:
Looks similar to this, but I'm using CUDA on an nvidia card: #12629
Thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions