Description
plan copypasta from ggml-org/llama.cpp#6853 (reply in thread):
We might not even need to write too much new code for this, I suppose. Given that models are separate, we can start (main_A + speculative) on instance_A, (main_B + speculative) on instance_B. Then we need to orchestrate the data/logic passing during transition phase:
In the 'middle' of main model processing (A is done with first half), we need to pass activations to B and whatever B speculated so far back to A
At the end of main model processing (B is done with logits) we need to get whatever latest speculation on B is, consolidate it with what we have currently produced on A, pass the 'current approved tokens' to A, start speculating on B.
repeat
Relevant links:
- async/parallel speculative execution ggml-org/llama.cpp#6853
- speculative : add tree-based sampling example ggml-org/llama.cpp#3624
- ggml : add RPC backend ggml-org/llama.cpp#6829
Devices I can test it on are:
- M2 Ultra 192GB
- M2 24GB
- M1 16GB
Activity