Add basic FP8 KV cache support #2603

danieldk · 2024-10-02T15:14:42Z

What does this PR do?

This change adds rudimentary FP8 KV cache support. The support is enabled by passing --kv-cache-dtype fp8_e5m2 to the launcher. Doing so uses this type for the KV cache. However support is still limited:

Only the fp8_e5m2 type is supported.
The KV cache layout is the same as float16/bfloat16 (HND).
The FP8 KV cache is only supported for FlashInfer.
Loading of scales is not yet supported.

This PR is intentionally small to keep things reviewable. I'll follow it up with PRs that add more functionality.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-10-02T15:16:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Narsil

Looks great and actually relatively simple Change !

server/text_generation_server/layers/attention/kv_cache.py

Narsil · 2024-10-04T09:17:30Z

server/text_generation_server/layers/attention/kv_cache.py

+            key_cache.view(-1, shape[-2], shape[-1])[slots] = key
+            value_cache.view(-1, shape[-2], shape[-1])[slots] = value
+        else:
+            reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0)


"auto", 1.0 ? What are those flags ? They didn't seem to be used before, aren't they defaulted in paged ?

Specifying them here breaks IPEX no ?

Ah, yes, the others have these arguments, but not IPEX, so I reverted this part of the PR (the KV cache now uses the existing reshape_and_cache wrappers.

This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported.

drbh · 2024-10-04T14:56:43Z

server/text_generation_server/models/custom_modeling/flash_rw_modeling.py

+        kv_cache.store(
+            key=kv[:, :, 0].contiguous(), value=kv[:, :, 1].contiguous(), slots=slots


not important, but its strange we need the .contiguous() calls here

Yeah, I wasn't sure if it was needed, these seems to be key/value striding in other places that is non-contiguous, but I also didn't want to touch it.

drbh

LGTM! Great addition

danieldk force-pushed the feature/fp8-kv-cache branch 2 times, most recently from 2628268 to 37df2ff Compare October 3, 2024 11:12

Narsil previously approved these changes Oct 4, 2024

View reviewed changes

danieldk dismissed Narsil’s stale review via 4cc5405 October 4, 2024 13:20

danieldk force-pushed the feature/fp8-kv-cache branch from 37df2ff to 4cc5405 Compare October 4, 2024 13:20

danieldk added 2 commits October 4, 2024 13:24

Fix Cargo.toml

ed5c2fb

danieldk force-pushed the feature/fp8-kv-cache branch from 4cc5405 to ed5c2fb Compare October 4, 2024 13:25

danieldk mentioned this pull request Oct 4, 2024

Simplify the attention function #2609

Merged

5 tasks

drbh reviewed Oct 4, 2024

View reviewed changes

drbh approved these changes Oct 4, 2024

View reviewed changes

danieldk merged commit 2358c2b into main Oct 4, 2024
12 of 13 checks passed

danieldk deleted the feature/fp8-kv-cache branch October 4, 2024 15:51

Narsil mentioned this pull request Oct 8, 2024

Add FP8 KVCache support #2028

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic FP8 KV cache support #2603

Add basic FP8 KV cache support #2603

danieldk commented Oct 2, 2024

HuggingFaceDocBuilderDev commented Oct 2, 2024

Narsil left a comment

Narsil Oct 4, 2024

Narsil Oct 4, 2024

danieldk Oct 4, 2024

drbh Oct 4, 2024

danieldk Oct 4, 2024

drbh left a comment

		kv_cache.store(
		key=kv[:, :, 0].contiguous(), value=kv[:, :, 1].contiguous(), slots=slots

Add basic FP8 KV cache support #2603

Add basic FP8 KV cache support #2603

Conversation

danieldk commented Oct 2, 2024

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Oct 2, 2024

Narsil left a comment

Choose a reason for hiding this comment

Narsil Oct 4, 2024

Choose a reason for hiding this comment

Narsil Oct 4, 2024

Choose a reason for hiding this comment

danieldk Oct 4, 2024

Choose a reason for hiding this comment

drbh Oct 4, 2024

Choose a reason for hiding this comment

danieldk Oct 4, 2024

Choose a reason for hiding this comment

drbh left a comment

Choose a reason for hiding this comment