-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom seed
values ignored by llama.cpp HTTP server
#7381
Comments
There are currently issues with nondeterminism in the server, especially when using >1 slots, see e.g. #7347 . However, I think that in this case the seed that is being reported back is simply incorrect. When I run curl --request POST --url http://localhost:8080/completion --data '{"prompt": "", "temperature": 0.7, "top_p": 0.8, "repeat_penalty": 1.1, "seed": 42, "n_predict": 20}' | python3 -m json.tool multiple times I get the exact same output but I get different outputs when I don't set the seed. |
Presumably never trying my test (prompt) or restarting the server after trying it and before trying yours, right?:) Because if your did your prompt immediately after my prompt (not restarting the server), emptying the prompt and setting the seed would not help in achieving deterministic responses for empty prompts - you would get a a different response every time despite the fixed seed. It must be therefore to do with not resetting the prompt history after each inference (as would be intuitively expected and as I may also add that from my experience even the perfectly deterministic results achievable thanks to fixing the seed in |
As I said, there are issues with nondeterminism that I am already aware of. These especially affect long generations like yours where small differences in rounding error will cause the sequences to diverge at points where the distribution of possible continuations is flat, like at the beginnings of sentences. However, if this was an issue with the seed not being correctly set the sequences would diverge right from the beginning. I can also confirm by just looking at the code that simply the wrong value is being reported back. So there is more than one issue at play here that causes nondeterministic results. |
Okay, seems like you are right after all. The first few tokens (but not necessarily all 20 - I managed to get several different versions even with that limit:) are always the same if: I think this issue can be safely closed in favor of yours, but first let's link here the issue(s) where these the non-deterministic responses due to the "butterfly effect" are investigated, shall we? I'd like to chip in my tests there. |
I don't have a comprehensive list but #7347 , ggerganov/whisper.cpp#1941 (comment) , and #6122 (comment) are related. |
Thank you for the links! And it turned out that this: "I'm trying to figure out how to enable deterministic results for >1 slots. " (from #7347 ) could explain fully all reproducibility problems even for my tweet-based tests (I was using multi-processing in the server from my previous scaling tests and multiprocessing was not turned on / not available in the Python package where reproducibility with fixed seeds was never an issue) . So the current workaround that will give reproducible results for any combination of inference parameters (not restricted to near-zero |
Yes, to my knowledge using a single slot should make the results reproducible. |
Problem. The custom
seed
value is not passed to the inference engine when usingllama.cpp HTTP server
(even though it works as expected inllama_cpp_python
package).How to reproduce: in the latest Linux version of
llama.cpp
repeat several times exactly the same cURL request to the completion API endpoint of thellama.cpp HTTP server
, with the prompt containing an open question and with a high value oftemperature
andtop_p
(to maximize the variability of model output), while fixing theseed
, e.g. like this one to infer from the 8-bit quant ofbartowski/Meta-Llama-3-8B-Instruct-GGUF
(Meta-Llama-3-8B-Instruct-Q8_0.gguf) model:We can see that regardless of the value passed to
seed
in the HTTP request (e.g. 42 in the example above), theseed
values reported to the HTTP client are invariably the default ones (4294967295, i.e. -1 cast to to unsigned int).The fact that the default -1 (i.e. random, unobservable and non-repeatable seed) is used as the seed, while the custom client-supplied values are being ignored, is corroborated by the fact that the model-generated output is always different, rather than always the same as expected (and as attainable with the above settings when repeating this test against the non-server
llama.cpp
backend using its Python package - local binding, without client-server communication).The text was updated successfully, but these errors were encountered: