Description
Since b2475
row split and layer split has the same performance.
llama-bench
is not affected, but main
and server
has this regression.
b2474
main
llama_print_timings: load time = 9945.29 ms
llama_print_timings: sample time = 4.05 ms / 128 runs ( 0.03 ms per token, 31565.97 tokens per second)
llama_print_timings: prompt eval time = 1712.75 ms / 15 tokens ( 114.18 ms per token, 8.76 tokens per second)
llama_print_timings: eval time = 9521.36 ms / 127 runs ( 74.97 ms per token, 13.34 tokens per second)
llama_print_timings: total time = 11268.98 ms / 142 tokens
server
{"function":"print_timings","id_slot":0,"id_task":0,"level":"INFO","line":322,"msg":"generation eval time = 23176.51 ms / 281 runs ( 82.48 ms per token, 12.12 tokens per second)","n_decoded":281,"n_tokens_second":12.124345910954315,"t_token":82.47867615658363,"t_token_generation":23176.508,"tid":"139827722453120","timestamp":1712220482}
b2475
main
llama_print_timings: sample time = 3.13 ms / 128 runs ( 0.02 ms per token, 40933.80 tokens per second)
llama_print_timings: prompt eval time = 3413.32 ms / 15 tokens ( 227.55 ms per token, 4.39 tokens per second)
llama_print_timings: eval time = 14874.55 ms / 127 runs ( 117.12 ms per token, 8.54 tokens per second)
llama_print_timings: total time = 18340.76 ms / 142 tokens
server
{"function":"print_timings","id_slot":0,"id_task":0,"level":"INFO","line":322,"msg":"generation eval time = 38207.86 ms / 313 runs ( 122.07 ms per token, 8.19 tokens per second)","n_decoded":313,"n_tokens_second":8.192032335129394,"t_token":122.06983067092654,"t_token_generation":38207.857,"tid":"139892693971072","timestamp":1712220597}
llama-bench
model size params backend ngl sm test t/s llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row pp 512 21.78 ± 0.03 llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row tg 128 12.96 ± 0.02 build: ccf58aa (1)
b2600
main
llama_print_timings: load time = 9996.37 ms
llama_print_timings: sample time = 3.06 ms / 128 runs ( 0.02 ms per token, 41871.12 tokens per second)
llama_print_timings: prompt eval time = 3380.90 ms / 15 tokens ( 225.39 ms per token, 4.44 tokens per second)
llama_print_timings: eval time = 14900.67 ms / 127 runs ( 117.33 ms per token, 8.52 tokens per second)
llama_print_timings: total time = 18311.99 ms / 142 tokens
server
{"tid":"139684675611648","timestamp":1712223015,"level":"INFO","function":"print_timings","line":332,"msg":"generation eval time = 35799.18 ms / 295 runs ( 121.35 ms per token, 8.24 tokens per second)","id_slot":0,"id_task":0,"t_token_generation":35799.182,"n_decoded":295,"t_token":121.3531593220339,"n_tokens_second":8.240411750190269}
llama-bench
model size params backend ngl sm test t/s llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row pp 512 22.02 ± 0.04 llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row tg 128 12.93 ± 0.01 build: 4399f13 (2600)
Commands used:
HIP_VISIBLE_DEVICES=0,1,2 ./llama-bench -sm row -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf
HIP_VISIBLE_DEVICES=0,1,2 ./main -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf -t 4 -ngl 99 --seed 1234 -n 128 --ignore-eos -p "USER: Tell me a joke ASSISTANT: " --split-mode row
HIP_VISIBLE_DEVICES=0,1,2 ./server -t 4 -ngl 99 -sm row -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf -c 4096 -ts 8,10,10 -b 512 --port 8080 --host 192.168.0.87