You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found a problem when using outlines with a local llm served with a Llama cpp backend on my Mac M3. When using multiple times consecutively the same generator and model for constrained generation, AsyncOpenAI event loop closes unexpectedly resulting in a retry.
The problem is that the request is still processed by my Llama cpp backend, resulting in double processing of every prompt when reusing the same generator. There is not exactly an error but I found the Traceback and retry when logging with level debug. I did not observe the error with a text generator.
For now I handle it by creating a new instance of model and generator everytime I want to process a new request to avoid this duplicate query but it doesn't seem right and maybe it hides something else. Thank you!
Llama.cpp output
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
##########################
First Query
##########################
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 13, n_tokens = 13
slot release: id 0 | task 0 | stop processing: n_past = 25, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 2335.83 ms / 13 tokens ( 179.68 ms per token, 5.57 tokens per second)
eval time = 857.46 ms / 13 tokens ( 65.96 ms per token, 15.16 tokens per second)
total time = 3193.29 ms / 26 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200
##########################
Second Query
##########################
slot launch_slot_: id 0 | task 14 | processing task
slot update_slots: id 0 | task 14 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14
slot update_slots: id 0 | task 14 | kv cache rm [0, end)
slot update_slots: id 0 | task 14 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000
slot update_slots: id 0 | task 14 | prompt done, n_past = 14, n_tokens = 14
slot release: id 0 | task 14 | stop processing: n_past = 24, truncated = 0
slot print_timing: id 0 | task 14 |
prompt eval time = 239.87 ms / 14 tokens ( 17.13 ms per token, 58.36 tokens per second)
eval time = 740.46 ms / 11 tokens ( 67.31 ms per token, 14.86 tokens per second)
total time = 980.33 ms / 25 tokens
request: POST /v1/chat/completions 127.0.0.1 200
##########################
Second Query (wrong retry)
##########################
slot launch_slot_: id 0 | task 19 | processing task
slot update_slots: id 0 | task 19 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14
slot update_slots: id 0 | task 19 | kv cache rm [0, end)
slot update_slots: id 0 | task 19 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000
slot update_slots: id 0 | task 19 | prompt done, n_past = 14, n_tokens = 14
slot release: id 0 | task 19 | stop processing: n_past = 23, truncated = 0
slot print_timing: id 0 | task 19 |
prompt eval time = 240.52 ms / 14 tokens ( 17.18 ms per token, 58.21 tokens per second)
eval time = 665.85 ms / 10 tokens ( 66.58 ms per token, 15.02 tokens per second)
total time = 906.37 ms / 24 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200
Describe the issue as clearly as possible:
Hi and thank you for your great work!
I found a problem when using outlines with a local llm served with a Llama cpp backend on my Mac M3. When using multiple times consecutively the same generator and model for constrained generation, AsyncOpenAI event loop closes unexpectedly resulting in a retry.
The problem is that the request is still processed by my Llama cpp backend, resulting in double processing of every prompt when reusing the same generator. There is not exactly an error but I found the Traceback and retry when logging with level debug. I did not observe the error with a text generator.
For now I handle it by creating a new instance of model and generator everytime I want to process a new request to avoid this duplicate query but it doesn't seem right and maybe it hides something else. Thank you!
Llama.cpp output
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
##########################
First Query
##########################
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 13, n_tokens = 13
slot release: id 0 | task 0 | stop processing: n_past = 25, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 2335.83 ms / 13 tokens ( 179.68 ms per token, 5.57 tokens per second)
eval time = 857.46 ms / 13 tokens ( 65.96 ms per token, 15.16 tokens per second)
total time = 3193.29 ms / 26 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200
##########################
Second Query
##########################
slot launch_slot_: id 0 | task 14 | processing task
slot update_slots: id 0 | task 14 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14
slot update_slots: id 0 | task 14 | kv cache rm [0, end)
slot update_slots: id 0 | task 14 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000
slot update_slots: id 0 | task 14 | prompt done, n_past = 14, n_tokens = 14
slot release: id 0 | task 14 | stop processing: n_past = 24, truncated = 0
slot print_timing: id 0 | task 14 |
prompt eval time = 239.87 ms / 14 tokens ( 17.13 ms per token, 58.36 tokens per second)
eval time = 740.46 ms / 11 tokens ( 67.31 ms per token, 14.86 tokens per second)
total time = 980.33 ms / 25 tokens
request: POST /v1/chat/completions 127.0.0.1 200
##########################
Second Query (wrong retry)
##########################
slot launch_slot_: id 0 | task 19 | processing task
slot update_slots: id 0 | task 19 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14
slot update_slots: id 0 | task 19 | kv cache rm [0, end)
slot update_slots: id 0 | task 19 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000
slot update_slots: id 0 | task 19 | prompt done, n_past = 14, n_tokens = 14
slot release: id 0 | task 19 | stop processing: n_past = 23, truncated = 0
slot print_timing: id 0 | task 19 |
prompt eval time = 240.52 ms / 14 tokens ( 17.18 ms per token, 58.21 tokens per second)
eval time = 665.85 ms / 10 tokens ( 66.58 ms per token, 15.02 tokens per second)
total time = 906.37 ms / 24 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200
Steps/code to reproduce the bug:
Expected result:
I would like to be able to reuse the same generator and model without processing multiple time the query due to the retry.
Error message:
Outlines/Python version information:
Version information
Context for the issue:
This doesn't affect my work as I found a hack but I think it is worth considering this problem.
The text was updated successfully, but these errors were encountered: