[CPU] Add per-token asym INT8 dynamic quantization support to QKV/MLP node #27001

usstq · 2024-10-11T02:09:19Z

Details:

to use AMX-INT8 to boost performance of QKV/MLP layers in LLM, we need dynamic per-token INT8 quantization according to many research papers (SmoothQuant for example) and reference implementations (xFT for example), here we add the support, when following conditions are met:
- platform support AMX-INT8
- QKV&MLP weights are symmetrically per-OC quantized as INT8
add AMX-FP16 support to QKV/MLP layers on GNR platform.
optimize single batch 2nd token special case.

Tickets:

ticket-id

usstq · 2024-10-11T02:17:13Z

Hi @dmitry-gorokhov , this is only a draft due to some unsolved design decision, like how to conditionally enable dynamic quantization, and maybe on per-layer basis. please take a look and comment, Thanks!

luo-cheng2021 and others added 8 commits September 30, 2024 10:10

small batch opt

48ced7f

fix that k is not a multiple of 256

39b8410

add qkv/mlp quantized mode

ecb1c31

fix quantize on nan float number

18665ab

fix batch consistency bug in qkvproj

a10b313

MLP&QKV support amx-f16

84a41fd

remove linux_perf

a313502

add QKV/MLP dynamic quantization tests

dbf235f

github-actions bot added category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra labels Oct 11, 2024

dmitry-gorokhov self-assigned this Oct 11, 2024

usstq added 6 commits October 12, 2024 10:40

fix coding style

c5f9d5b

Support LLM with qkv already fused like phi3

16472f6

Support LLM with gate-up already fused like phi3

36e1691

Support dynamic-quant for combined qkv/mlp in LLM like phi3

eaec512

Update

b59ae1e

using DYNAMIC_QUANTIZATION_GROUP_SIZE to enable per-token quant

3c2dde9

usstq marked this pull request as ready for review October 18, 2024 08:17

usstq requested review from a team as code owners October 18, 2024 08:17

fix CI error

e09bf2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Add per-token asym INT8 dynamic quantization support to QKV/MLP node #27001

[CPU] Add per-token asym INT8 dynamic quantization support to QKV/MLP node #27001

usstq commented Oct 11, 2024 •

edited

Loading

usstq commented Oct 11, 2024

[CPU] Add per-token asym INT8 dynamic quantization support to QKV/MLP node #27001

Are you sure you want to change the base?

[CPU] Add per-token asym INT8 dynamic quantization support to QKV/MLP node #27001

Conversation

usstq commented Oct 11, 2024 • edited Loading

Details:

Tickets:

usstq commented Oct 11, 2024

usstq commented Oct 11, 2024 •

edited

Loading