-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
LocalAI version:
3.6.0
Environment, CPU architecture, OS, and Version:
N/A
Describe the bug
Concurrent async requests can and will interact with the global seed in memory.
To Reproduce
- submit a completion request via API
- cancel the request in the client before the first succeeds
- send another request immediately
Expected behavior
The output of each thread should be independent of the other threads, and seed related operations should not effect other requests in flight.
Additional context
This is largely inhereted from upstream mlx_lm. mlx_lm.generate sets a global stream of which it uses internally:
https://github.com/ml-explore/mlx-lm/blob/367d6d76860499767f62b0bc34408b51c9ed916b/mlx_lm/generate.py#L215-L216
While mlx.random.key(seed) does implement a way to extract and reuse a PRNG key from a seed, mlx_lm.generate provides no way to pass seed= or key=, let alone stream=. It takes a bit of code squinting to follow this because generate() and stream_generate() are a nesting doll of kwargs, but once we follow the call stack all the way down to generate_step(), we can confirm that no such parameters are accepted.
This is a long way of saying that the only way to interface with the seeds used by mlx_lm.generate is to interact with the global PRNG, which is not thread safe across async requests. The reference server implementation in mlx_lm.server does not disagree; it interacts with the global PRNG the same way we do, but can get away with it because their API is not asynchronous and blocks for the duration of the call.
Looking into this on my own time, but logging the bug to document the research so far.