Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] KV-cache compression support #27114

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sshlyapn
Copy link
Contributor

Details:

This PR enables KV-cache compression support
Currently, it supports only combinations of the following configurations:

  • Data types: INT8_SYM / INT8_ASYM
  • Modes: per-token (quantization of num_heads * head_size in a single group) / per-token-per-head (quantization of each head_size group for each head per token)

Tickets:

  • ticket-id

@sshlyapn sshlyapn added this to the 2024.5 milestone Oct 18, 2024
@github-actions github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Oct 18, 2024
DynamicQuantize(const Output<Node>& data,
const QuantizationConfig& config,
const std::vector<uint64_t>& scales_zp_output_order = {},
const bool combine_scales_and_zp = false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding this interleaved/planar mode for scales and zero points as a part of common operation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations under_perf_check
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants