Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support e4m3fn KV cache #2655

Merged
merged 2 commits into from
Oct 17, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Make check more obvious
  • Loading branch information
danieldk committed Oct 16, 2024
commit 751f1bb8154fd4fe3a36a4b128c5f830dd0effa2
6 changes: 2 additions & 4 deletions server/text_generation_server/layers/attention/kv_cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,8 @@ def __init__(
):
"""Construct the key-value cache for a layer."""

if (
dtype.itemsize == 1
and dtype.is_floating_point
and (ATTENTION != "flashinfer" or SYSTEM != "cuda")
if dtype in {torch.float8_e5m2, torch.float8_e4m3fn} and (
ATTENTION != "flashinfer" or SYSTEM != "cuda"
):
raise ValueError(
"FP8 KV cache is currently only supported for flashinfer on CUDA"
Expand Down
Loading