Description
So I was thinking about the following idea.
It is probably completely bogus, but I would definitely investigate it when and if I had the time to, so maybe someone else would be interested as well.
Large LLM takes a lot of time to perform token inference. Lets say it takes 500ms per token.
A small LLM (or some other approach) can infer a token very fast. Lets say < 5ms.
Lets assume that the small LLM is correct 80-90% of the time.
The idea is the following:
- Before I run the large LLM inference for the next token, I infer it using the small LLM
- I now want to somehow partially evaluate the large LLM (let's say the first 10% of the layers) and get an approximate estimate for the next token
- If this estimate indicates a high probability for that token (i.e. above some threshold) - we stop and directly say that this is the new token. At this point we would have consumed (5ms for the small LLM + ~50ms for the large LLM)
- Otherwise, we proceed to evaluate the rest of the layers of the large LLM
In the described process, I would reach step 4 only for 10-20% of the tokens, but for the rest - I will take the shortcut in step 3.
Hence, I will have an efficient inference with a Large LLM.
Obviously, the biggest question is if step 2 is possible at all.
I suppose the answer is "no", but who knows.