Description
I encountered an unexpected behavior when running in the following command:
./parallel -m ./models/llama_7b/llama-2-7b/ggml-model-f16.gguf -t 1 -ngl 100 -c 4096 -b 512 -s 1 -np 8 -ns 128 -n 100 -cb
,
following the instruction for Serving multiple clients with parallel decoding and continuous batching (#3749 (comment)).
The model truncates the outputs for some prompts. For instance, the model stoped generating outputs after "... for gettting started:" as shown in the image below:
A similar behavior is observed with mixtral-8x7b-instruct using the following command:
./parallel -m ./models--mistralai--Mixtral-8x7B-Instruct-v0.1/ggml-model-Q4_K_M.gguf -c 8192 -ngl 100 -f ~/data1_10.txt -n 2000 --temp 0.5 --top-p 0.9 --color --in-prefix "[INST]" --in-suffix "[/INST]" -b 8192 -np 2 -ns 10 -cb -t 1
.
As shown in the image below, the model truncates output after "... The transcript '".

This behavior disappears when I provide the same prompt to ./main.
I am using 4 A100s.