Skip to content

Combine large LLM with small LLM for faster inference #630

Closed
@ggerganov

Description

@ggerganov

So I was thinking about the following idea.
It is probably completely bogus, but I would definitely investigate it when and if I had the time to, so maybe someone else would be interested as well.


Large LLM takes a lot of time to perform token inference. Lets say it takes 500ms per token.

A small LLM (or some other approach) can infer a token very fast. Lets say < 5ms.

Lets assume that the small LLM is correct 80-90% of the time.

The idea is the following:

  • Before I run the large LLM inference for the next token, I infer it using the small LLM
  • I now want to somehow partially evaluate the large LLM (let's say the first 10% of the layers) and get an approximate estimate for the next token
  • If this estimate indicates a high probability for that token (i.e. above some threshold) - we stop and directly say that this is the new token. At this point we would have consumed (5ms for the small LLM + ~50ms for the large LLM)
  • Otherwise, we proceed to evaluate the rest of the layers of the large LLM

In the described process, I would reach step 4 only for 10-20% of the tokens, but for the rest - I will take the shortcut in step 3.
Hence, I will have an efficient inference with a Large LLM.

Obviously, the biggest question is if step 2 is possible at all.
I suppose the answer is "no", but who knows.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions