Skip to content

distributed RPC-based speculative evaluation #1

Open
@okuvshynov

Description

plan copypasta from ggml-org/llama.cpp#6853 (reply in thread):

We might not even need to write too much new code for this, I suppose. Given that models are separate, we can start (main_A + speculative) on instance_A, (main_B + speculative) on instance_B. Then we need to orchestrate the data/logic passing during transition phase:

In the 'middle' of main model processing (A is done with first half), we need to pass activations to B and whatever B speculated so far back to A
At the end of main model processing (B is done with logits) we need to get whatever latest speculation on B is, consolidate it with what we have currently produced on A, pass the 'current approved tokens' to A, start speculating on B.
repeat

Relevant links:

Devices I can test it on are:

  • M2 Ultra 192GB
  • M2 24GB
  • M1 16GB

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions