Constrained generation with local Llama cpp backend causes automatic retry (duplicated processing of prompt) #1247

marc-dlf · 2024-11-05T20:29:58Z

Describe the issue as clearly as possible:

Hi and thank you for your great work!

I found a problem when using outlines with a local llm served with a Llama cpp backend on my Mac M3. When using multiple times consecutively the same generator and model for constrained generation, AsyncOpenAI event loop closes unexpectedly resulting in a retry.

The problem is that the request is still processed by my Llama cpp backend, resulting in double processing of every prompt when reusing the same generator. There is not exactly an error but I found the Traceback and retry when logging with level debug. I did not observe the error with a text generator.

For now I handle it by creating a new instance of model and generator everytime I want to process a new request to avoid this duplicate query but it doesn't seem right and maybe it hides something else. Thank you!

Llama.cpp output

main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
##########################

First Query

##########################
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 13, n_tokens = 13
slot release: id 0 | task 0 | stop processing: n_past = 25, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 2335.83 ms / 13 tokens ( 179.68 ms per token, 5.57 tokens per second)
eval time = 857.46 ms / 13 tokens ( 65.96 ms per token, 15.16 tokens per second)
total time = 3193.29 ms / 26 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200
##########################

Second Query

##########################
slot launch_slot_: id 0 | task 14 | processing task
slot update_slots: id 0 | task 14 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14
slot update_slots: id 0 | task 14 | kv cache rm [0, end)
slot update_slots: id 0 | task 14 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000
slot update_slots: id 0 | task 14 | prompt done, n_past = 14, n_tokens = 14
slot release: id 0 | task 14 | stop processing: n_past = 24, truncated = 0
slot print_timing: id 0 | task 14 |
prompt eval time = 239.87 ms / 14 tokens ( 17.13 ms per token, 58.36 tokens per second)
eval time = 740.46 ms / 11 tokens ( 67.31 ms per token, 14.86 tokens per second)
total time = 980.33 ms / 25 tokens
request: POST /v1/chat/completions 127.0.0.1 200
##########################

Second Query (wrong retry)

##########################
slot launch_slot_: id 0 | task 19 | processing task
slot update_slots: id 0 | task 19 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14
slot update_slots: id 0 | task 19 | kv cache rm [0, end)
slot update_slots: id 0 | task 19 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000
slot update_slots: id 0 | task 19 | prompt done, n_past = 14, n_tokens = 14
slot release: id 0 | task 19 | stop processing: n_past = 23, truncated = 0
slot print_timing: id 0 | task 19 |
prompt eval time = 240.52 ms / 14 tokens ( 17.18 ms per token, 58.21 tokens per second)
eval time = 665.85 ms / 10 tokens ( 66.58 ms per token, 15.02 tokens per second)
total time = 906.37 ms / 24 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200

Steps/code to reproduce the bug:

from outlines import models, generate
from pydantic import BaseModel
import logging

logging.basicConfig(level="DEBUG")

class Hello(BaseModel):
    text: str

model = models.openai(
    "llama3.1", base_url="http://127.0.0.1:8080/v1/", api_key="<key>"
)
generator = generate.json(model, Hello)

# Query 1
response = generator("Say hello.")
# Query 2
response = generator("Say good morning.")

Expected result:

I would like to be able to reuse the same generator and model without processing multiple time the query due to the retry.

Error message:

INFO:httpx:HTTP Request: POST http://127.0.0.1:8080/v1/chat/completions "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:openai._base_client:HTTP Request: POST http://127.0.0.1:8080/v1/chat/completions "200 OK"
DEBUG:asyncio:Using selector: KqueueSelector
DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': 'Say good morning.'}], 'model': 'llama3.1', 'frequency_penalty': 0, 'logit_bias': {}, 'max_tokens': None, 'n': 1, 'presence_penalty': 0, 'response_format': {'type': 'json_schema', 'json_schema': {'name': 'default', 'strict': True, 'schema': {'properties': {'text': {'title': 'Text', 'type': 'string'}}, 'required': ['text'], 'title': 'Hello', 'type': 'object'}}}, 'seed': None, 'stop': None, 'temperature': 1.0, 'top_p': 1, 'user': ''}}
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_headers.failed exception=RuntimeError('Event loop is closed')
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.failed exception=RuntimeError('Event loop is closed')
DEBUG:openai._base_client:Encountered Exception
Traceback (most recent call last):
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/openai/_base_client.py", line 1571, in _request
    response = await self._client.send(
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_client.py", line 1674, in send
    response = await self._send_handling_auth(
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_client.py", line 1702, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_client.py", line 1739, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_client.py", line 1776, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 377, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request
    raise exc from None
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request
    response = await connection.handle_async_request(
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/connection.py", line 101, in handle_async_request
    return await self._connection.handle_async_request(request)
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/http11.py", line 142, in handle_async_request
    await self._response_closed()
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/http11.py", line 257, in _response_closed
    await self.aclose()
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/http11.py", line 265, in aclose
    await self._network_stream.aclose()
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_backends/anyio.py", line 55, in aclose
    await self._stream.aclose()
  File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1287, in aclose
    self._transport.close()
  File "/Users/mdelaferriere/.pyenv/versions/3.10.4/lib/python3.10/asyncio/selector_events.py", line 697, in close
    self._loop.call_soon(self._call_connection_lost, None)
  File "/Users/mdelaferriere/.pyenv/versions/3.10.4/lib/python3.10/asyncio/base_events.py", line 750, in call_soon
    self._check_closed()
  File "/Users/mdelaferriere/.pyenv/versions/3.10.4/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
DEBUG:openai._base_client:2 retries left
INFO:openai._base_client:Retrying request to /chat/completions in 0.408067 seconds
DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': 'Say good morning.'}], 'model': 'llama3.1', 'frequency_penalty': 0, 'logit_bias': {}, 'max_tokens': None, 'n': 1, 'presence_penalty': 0, 'response_format': {'type': 'json_schema', 'json_schema': {'name': 'default', 'strict': True, 'schema': {'properties': {'text': {'title': 'Text', 'type': 'string'}}, 'required': ['text'], 'title': 'Hello', 'type': 'object'}}}, 'seed': None, 'stop': None, 'temperature': 1.0, 'top_p': 1, 'user': ''}}
DEBUG:httpcore.connection:connect_tcp.started host='127.0.0.1' port=8080 local_address=None timeout=5.0 socket_options=None

Outlines/Python version information:

Version information

``` 0.1.1 Python 3.10.4 (main, May 15 2024, 10:27:17) [Clang 15.0.0 (clang-1500.3.9.4)] aiohappyeyeballs==2.4.3 aiohttp==3.10.10 aiosignal==1.3.1 airportsdata==20241001 annotated-types==0.7.0 anyio==4.6.2.post1 async-timeout==4.0.3 attrs==24.2.0 certifi==2024.8.30 charset-normalizer==3.4.0 cloudpickle==3.1.0 datasets==3.1.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 exceptiongroup==1.2.2 filelock==3.16.1 frozenlist==1.5.0 fsspec==2024.9.0 h11==0.14.0 httpcore==1.0.6 httpx==0.27.2 huggingface-hub==0.26.2 idna==3.10 interegular==0.3.3 Jinja2==3.1.4 jiter==0.7.0 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 lark==1.2.2 llama_cpp_python==0.3.1 MarkupSafe==3.0.2 mpmath==1.3.0 multidict==6.1.0 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.4.2 numpy==1.26.4 openai==1.54.1 outlines==0.1.1 outlines_core==0.1.14 packaging==24.1 pandas==2.2.3 propcache==0.2.0 pyarrow==18.0.0 pycountry==24.6.1 pydantic==2.9.2 pydantic_core==2.23.4 python-dateutil==2.9.0.post0 pytz==2024.2 PyYAML==6.0.2 referencing==0.35.1 regex==2024.9.11 requests==2.32.3 rpds-py==0.20.1 safetensors==0.4.5 six==1.16.0 sniffio==1.3.1 sympy==1.13.1 tokenizers==0.20.3 torch==2.5.1 tqdm==4.66.6 transformers==4.46.2 typing_extensions==4.12.2 tzdata==2024.2 urllib3==2.2.3 xxhash==3.5.0 yarl==1.17.1 ```

Context for the issue:

This doesn't affect my work as I found a hack but I think it is worth considering this problem.

marc-dlf added the bug label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constrained generation with local Llama cpp backend causes automatic retry (duplicated processing of prompt) #1247

Constrained generation with local Llama cpp backend causes automatic retry (duplicated processing of prompt) #1247

marc-dlf commented Nov 5, 2024

Constrained generation with local Llama cpp backend causes automatic retry (duplicated processing of prompt) #1247

Constrained generation with local Llama cpp backend causes automatic retry (duplicated processing of prompt) #1247

Comments

marc-dlf commented Nov 5, 2024

Describe the issue as clearly as possible:

Llama.cpp output

First Query

Second Query

Second Query (wrong retry)

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue: