[GPU] KV-cache compression support #27114

sshlyapn · 2024-10-18T05:44:41Z

Details:

This PR enables KV-cache compression support
Currently, it supports only combinations of the following configurations:

Data types: INT8_SYM / INT8_ASYM
Modes: per-token (quantization of num_heads * head_size in a single group) / per-token-per-head (quantization of each head_size group for each head per token)

Tickets:

ticket-id

vladimir-paramuzov · 2024-10-18T06:46:16Z

src/plugins/intel_gpu/include/intel_gpu/op/dynamic_quantize.hpp

+    DynamicQuantize(const Output<Node>& data,
+                    const QuantizationConfig& config,
+                    const std::vector<uint64_t>& scales_zp_output_order = {},
+                    const bool combine_scales_and_zp = false);


I'd suggest adding this interleaved/planar mode for scales and zero points as a part of common operation

[GPU] Add KV-cache compression support

5d057bd

sshlyapn added category: GPU OpenVINO GPU plugin under_perf_check labels Oct 18, 2024

sshlyapn added this to the 2024.5 milestone Oct 18, 2024

github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Oct 18, 2024

vladimir-paramuzov reviewed Oct 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] KV-cache compression support #27114

[GPU] KV-cache compression support #27114

sshlyapn commented Oct 18, 2024

vladimir-paramuzov Oct 18, 2024

[GPU] KV-cache compression support #27114

Are you sure you want to change the base?

[GPU] KV-cache compression support #27114

Conversation

sshlyapn commented Oct 18, 2024

Details:

Tickets:

vladimir-paramuzov Oct 18, 2024

Choose a reason for hiding this comment