Combine large LLM with small LLM for faster inference

So I was thinking about the following idea.
It is probably completely bogus, but I would definitely investigate it when and if I had the time to, so maybe someone else would be interested as well.

---

Large LLM takes a lot of time to perform token inference. Lets say it takes 500ms per token.

A small LLM (or some other approach) can infer a token very fast. Lets say < 5ms.

Lets assume that the small LLM is correct 80-90% of the time.

The idea is the following:

- Before I run the large LLM inference for the next token, I infer it using the small LLM
- I now want to somehow partially evaluate the large LLM (let's say the first 10% of the layers) and get an approximate estimate for the next token
- If this estimate indicates a high probability for that token (i.e. above some threshold) - we stop and directly say that this is the new token. At this point we would have consumed (5ms for the small LLM + ~50ms for the large LLM)
- Otherwise, we proceed to evaluate the rest of the layers of the large LLM

In the described process, I would reach step 4 only for 10-20% of the tokens, but for the rest - I will take the shortcut in step 3.
Hence, I will have an efficient inference with a Large LLM.

Obviously, the biggest question is if step 2 is possible at all.
I suppose the answer is "no", but who knows.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Combine large LLM with small LLM for faster inference #630

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Combine large LLM with small LLM for faster inference #630

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions