-
Notifications
You must be signed in to change notification settings - Fork 12.5k
Open
Labels
bugSomething isn't workingSomething isn't workingbug-unconfirmedhigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Description
What happened?
The limit is respected when requesting a chat completion, but for non-chat ones, the model keeps generating tokens forever (until ctx-len is reached). With non-streaming there is no way to stop generation. Current workaround is to stream and close the connection when you reach the desired number of tokens
response = await client.completions.create(
model="[A llama3 8B gguf]",
prompt="Write me a funny story.",
max_tokens=200,
stream=True,
)
async for chunk in response:
print(chunk)
Name and Version
version: 3432 (45f2c19)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingbug-unconfirmedhigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)