Use HuggingFace's new KV Cache Implementation

In order to enable Llama3.2 1B (see #8 ), we had to upgrade from `transformers` v4.34.1 to v4.45.2.

This new version of `transformers` had refactored the KV cache implementation to a more efficient implementation that would have required us to refactor `forward_early(...)` and `forward_remainder(...)` in `self_speculation/llama_model_utils.py`. Instead, we opted to use the less efficient legacy KV cache. 

In order to ensure apples-to-apples comparison, in 62debc0db07f26e6652ac176bae303d63b03873b, we changed autoregressive decoding to use legacy cache.

Ideally, we should ensure `forward_early(...)` and `forward_remainder(...)` to use `transformers` new more efficient KV cache implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HuggingFace's new KV Cache Implementation #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use HuggingFace's new KV Cache Implementation #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions