Open
Description
In order to enable Llama3.2 1B (see #8 ), we had to upgrade from transformers
v4.34.1 to v4.45.2.
This new version of transformers
had refactored the KV cache implementation to a more efficient implementation that would have required us to refactor forward_early(...)
and forward_remainder(...)
in self_speculation/llama_model_utils.py
. Instead, we opted to use the less efficient legacy KV cache.
In order to ensure apples-to-apples comparison, in 62debc0, we changed autoregressive decoding to use legacy cache.
Ideally, we should ensure forward_early(...)
and forward_remainder(...)
to use transformers
new more efficient KV cache implementation.
Metadata
Metadata
Assignees
Labels
No labels