You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using partial layers for guess, and achive about 1.78x speed up. No draft model, the only thing needed to be cared is kv cache.
Seems supports samping decoding. Cause the hidden size and intermediate size is same,so the kv cache is reusable, the only thing need to do is to kv cache memory reclaim when reject tokens.
What will the future of VLLM speculative sampling look like? Is there a rough plan?
Hi @MeJerry215 . Once #2188 is merged, self-speculative decoding can be added easily as a replacement for the draft model. Follow along in that PR for more details.
github repo: https://github.com/dilab-zju/self-speculative-decoding
Using partial layers for guess, and achive about 1.78x speed up. No draft model, the only thing needed to be cared is kv cache.
Seems supports samping decoding. Cause the hidden size and intermediate size is same,so the kv cache is reusable, the only thing need to do is to kv cache memory reclaim when reject tokens.
What will the future of VLLM speculative sampling look like? Is there a rough plan?
@cadedaniel @LiuXiaoxuanPKU
The text was updated successfully, but these errors were encountered: