Open
Description
Name and Version
./bin/llama-cli --version
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Max)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Max)
version: 5615 (f470bc36)
built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.5.0
Operating systems
Mac
Which llama.cpp modules do you know to be affected?
llama-server
Command line
cd tools/server/tests
./tests.sh
Problem description & steps to reproduce
Description
When running the server tests locally on my M3 Max 64GB, any that make a request after server.start()
fail unless I inject a manual sleep after server.start()
to make it wait a bit longer. It seems that despite the retry loop on /health
in server.start()
(here), the server does not have the model fully loaded and ready to query when /health
returns 200 successfully. This seems to be similar to the liveness
vs readiness
probe problem for kubernetes pods where /health
is currently indicating that the server is "live" but not "ready," whereas the tests are assuming it means ready.
Related Discussion
- This came up due to failing tests on my Hybrid Recurrent Cache branch: Hybrid recurrent cache #13979 (comment)
- I was able to make this "work" locally by inserting retry logic into all
server.make_request
calls on gabe-l-hart@39a93b3, but this fix is likely just masking the issue.
Python env
python --version
Python 3.11.8
pip freeze
aiohttp==3.9.5
aiosignal==1.3.2
annotated-types==0.7.0
anyio==4.9.0
attrs==25.3.0
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.8
contourpy==1.3.2
cycler==0.12.1
distro==1.9.0
docstring_parser==0.16
einops==0.7.0
filelock==3.13.3
flake8==7.2.0
fonttools==4.58.2
frozenlist==1.6.2
fsspec==2024.3.1
gguf==0.17.0
gitdb==4.0.12
GitPython==3.1.44
h11==0.16.0
httpcore==1.0.9
httpx==0.28.1
huggingface-hub==0.23.5
idna==3.6
iniconfig==2.1.0
Jinja2==3.1.3
jiter==0.10.0
joblib==1.4.2
kiwisolver==1.4.8
# Editable install with no version control (llama-cpp-scripts==0.0.0)
-e /Users/ghart/Projects/github/ggerganov/llama.cpp
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.10.3
mccabe==0.7.0
mdurl==0.1.2
mpmath==1.3.0
multidict==6.4.4
networkx==3.3
numpy==1.26.4
openai==1.55.3
packaging==24.0
pandas==2.2.3
pillow==10.2.0
pluggy==1.6.0
prometheus_client==0.20.0
propcache==0.3.1
protobuf==4.25.3
pycodestyle==2.13.0
pycparser==2.22
pydantic==2.6.4
pydantic_core==2.16.3
pyflakes==3.3.2
Pygments==2.19.1
pyparsing==3.2.3
PySide6==6.9.1
PySide6_Addons==6.9.1
PySide6_Essentials==6.9.1
pytest==8.3.5
python-dateutil==2.9.0.post0
pytz==2025.2
PyYAML==6.0.1
regex==2023.12.25
requests==2.32.3
rich==14.0.0
safetensors==0.5.3
scikit-learn==1.6.0
scipy==1.14.1
seaborn==0.13.2
sentence-transformers==3.3.1
sentencepiece==0.2.0
shellingham==1.5.4
shiboken6==6.9.1
six==1.17.0
smmap==5.0.2
sniffio==1.3.1
sympy==1.12
tabulate==0.9.0
threadpoolctl==3.5.0
tokenizers==0.20.3
torch==2.2.2
torchvision==0.17.2
tqdm==4.66.2
transformers==4.46.3
typer==0.15.4
typing_extensions==4.11.0
tzdata==2025.2
urllib3==2.2.1
wget==3.2
yarl==1.20.0
First Bad Commit
No response
Relevant log output
./tests.sh
========================================================= test session starts ==========================================================
platform darwin -- Python 3.11.8, pytest-8.3.5, pluggy-1.6.0 -- /Users/ghart/mambaforge/envs/llama.cpp/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/ghart/Projects/github/ggml-org/llama.cpp/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.9.0
collected 427 items / 229 deselected / 198 selected
unit/test_basic.py::test_server_start_simple PASSED [ 0%]
unit/test_basic.py::test_server_props FAILED [ 1%]
=============================================================== FAILURES ===============================================================
__________________________________________________________ test_server_props ___________________________________________________________
self = <Response [200]>, kwargs = {}
def json(self, **kwargs):
r"""Returns the json-encoded content of a response, if any.
:param \*\*kwargs: Optional arguments that ``json.loads`` takes.
:raises requests.exceptions.JSONDecodeError: If the response body does not
contain valid json.
"""
if not self.encoding and self.content and len(self.content) > 3:
# No encoding set. JSON RFC 4627 section 3 states we should expect
# UTF-8, -16 or -32. Detect which one to use; If the detection or
# decoding fails, fall back to `self.text` (using charset_normalizer to make
# a best guess).
encoding = guess_json_utf(self.content)
if encoding is not None:
try:
return complexjson.loads(self.content.decode(encoding), **kwargs)
except UnicodeDecodeError:
# Wrong UTF codec detected; usually because it's not UTF-8
# but some other 8-bit codec. This is an RFC violation,
# and the server didn't bother to tell us what codec *was*
# used.
pass
except JSONDecodeError as e:
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
try:
> return complexjson.loads(self.text, **kwargs)
/Users/ghart/mambaforge/envs/llama.cpp/lib/python3.11/site-packages/requests/models.py:974:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Users/ghart/mambaforge/envs/llama.cpp/lib/python3.11/json/__init__.py:346: in loads
return _default_decoder.decode(s)
/Users/ghart/mambaforge/envs/llama.cpp/lib/python3.11/json/decoder.py:337: in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <json.decoder.JSONDecoder object at 0x103240dd0>
s = '<!doctype html>\n<html lang="en">\n\t<head>\n\t\t<meta charset="utf-8" />\n\t\t<link rel="icon" type="image/png" href...\t}\n\t}\n\n\t.animate-pulse-fast {\n\t\tanimation: pulse 1.5s cubic-bezier(0.4, 0, 0.6, 1) infinite;\n\t}\n</style>\n'
idx = 0
def raw_decode(self, s, idx=0):
"""Decode a JSON document from ``s`` (a ``str`` beginning with
a JSON document) and return a 2-tuple of the Python
representation and the index in ``s`` where the document ended.
This can be used to decode a JSON document from a string that may
have extraneous data at the end.
"""
try:
obj, end = self.scan_once(s, idx)
except StopIteration as err:
> raise JSONDecodeError("Expecting value", s, err.value) from None
E json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
/Users/ghart/mambaforge/envs/llama.cpp/lib/python3.11/json/decoder.py:355: JSONDecodeError
During handling of the above exception, another exception occurred:
def test_server_props():
global server
server.start()
> res = server.make_request("GET", "/props")
unit/test_basic.py:24:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
utils.py:275: in make_request
result.body = response.json() if parse_body else None
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Response [200]>, kwargs = {}
def json(self, **kwargs):
r"""Returns the json-encoded content of a response, if any.
:param \*\*kwargs: Optional arguments that ``json.loads`` takes.
:raises requests.exceptions.JSONDecodeError: If the response body does not
contain valid json.
"""
if not self.encoding and self.content and len(self.content) > 3:
# No encoding set. JSON RFC 4627 section 3 states we should expect
# UTF-8, -16 or -32. Detect which one to use; If the detection or
# decoding fails, fall back to `self.text` (using charset_normalizer to make
# a best guess).
encoding = guess_json_utf(self.content)
if encoding is not None:
try:
return complexjson.loads(self.content.decode(encoding), **kwargs)
except UnicodeDecodeError:
# Wrong UTF codec detected; usually because it's not UTF-8
# but some other 8-bit codec. This is an RFC violation,
# and the server didn't bother to tell us what codec *was*
# used.
pass
except JSONDecodeError as e:
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
try:
return complexjson.loads(self.text, **kwargs)
except JSONDecodeError as e:
# Catch JSON-related errors and raise as requests.JSONDecodeError
# This aliases json.JSONDecodeError and simplejson.JSONDecodeError
> raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
E requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
/Users/ghart/mambaforge/envs/llama.cpp/lib/python3.11/site-packages/requests/models.py:978: JSONDecodeError
--------------------------------------------------------- Captured stdout call ---------------------------------------------------------
tests: starting server with: ../../../build/bin/llama-server --host 127.0.0.1 --port 8080 --temp 0.8 --seed 42 --hf-repo ggml-org/models --hf-file tinyllamas/stories260K.gguf --batch-size 32 --alias tinyllama-2 --ctx-size 512 --parallel 2 --n-predict 64
server pid=66945, pytest pid=66943
Response from server {
"status": true
}
------------------------------------------------------- Captured stdout teardown -------------------------------------------------------
Stopping server with pid=66945
======================================================= short test summary info ========================================================
FAILED unit/test_basic.py::test_server_props - requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================= 1 failed, 1 passed, 229 deselected in 0.87s ==============================================