Closed
Description
What happened?
in commit 6e7d133a5f9409dd257fad90d7f320721b07a1b2
, changes were made to how the /v1/chat/completions
is handled.
ealier the call sequence was -
const int id_task = ctx_server.queue_tasks.get_new_id();
ctx_server.queue_results.add_waiting_task_id(id_task);
ctx_server.request_completion(id_task, -1, data, false, false);
...
ctx_server.queue_results.remove_waiting_task_id(id_task);
after the changes, the ctx_server.queue_results.remove_waiting_task_id(id_task);
call is missing and is causing the server_response.waiting_task_ids
to increase after serving every call to the above endpoint. if there is a long running instance of llama-server in production, this will end up consuming a lot of memory as the ids are not cleared for a few refactored server handlers.
@ngxson kindly provide your inputs. tx.
Name and Version
$ ./bin/llama-cli --version
version: 3609 (2f3c1466)
built with Homebrew clang version 18.1.5 for arm64-apple-darwin23.3.0
What operating system are you seeing the problem on?
Mac
Relevant log output
No response