/parallel often produces truncated outputs

I encountered an unexpected behavior when running in the following command:

`./parallel -m ./models/llama_7b/llama-2-7b/ggml-model-f16.gguf -t 1 -ngl 100 -c 4096 -b 512 -s 1 -np 8 -ns 128 -n 100 -cb`, 

following the instruction for **Serving multiple clients with parallel decoding and continuous batching** (https://github.com/ggerganov/llama.cpp/pull/3749#issue-1957754158).

The model truncates the outputs for some prompts. For instance, the model stoped generating outputs after "... for gettting started:" as shown in the image below:
<img width="1140" alt="Screenshot 2024-02-19 at 4 30 03 PM" src="https://github.com/ggerganov/llama.cpp/assets/24445335/6bc7c92c-3bff-4dc7-af19-2b4fa911ab37">

A similar behavior is observed with mixtral-8x7b-instruct using the following command: 

`./parallel -m ./models--mistralai--Mixtral-8x7B-Instruct-v0.1/ggml-model-Q4_K_M.gguf -c 8192 -ngl 100 -f ~/data1_10.txt -n 2000 --temp 0.5 --top-p 0.9 --color --in-prefix "[INST]" --in-suffix "[/INST]" -b 8192 -np 2 -ns 10 -cb -t 1`. 

As shown in the image below, the model truncates output after "... The transcript '".

<img width="190" alt="Screenshot 2024-02-19 at 4 40 20 PM" src="https://github.com/ggerganov/llama.cpp/assets/24445335/00b1481e-6685-4d8b-bf6e-b426924c2877">

This behavior disappears when I provide the same prompt to ./main.

I am using 4 A100s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

/parallel often produces truncated outputs #5601

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

/parallel often produces truncated outputs #5601

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions