[Feature] TurboMind support W8A8 or FP8 KV Cache #1463

zhyncs · 2024-04-19T05:20:42Z

Motivation

We plan to add support for W8A8 SmoothQuant or FP8 KV Cache on TurboMind. There is currently no clear decision on which one to prioritize first. We would like to understand how the community judges the priority of these two options and if any advice can be provided. Thanks. @lvhan028 @lzhangzz @grimoire @irexyc cc @ispobock

Related resources

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-04-19T05:23:02Z

We can currently do development and testing FP8 on L40.

zhyncs · 2024-04-19T06:36:42Z

Considering that FP8 has a significant advantage in precision compared to Int8, this means that we are more likely to use it in actual online serving compared to Int8.

Refer to the following blog posts:
https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482
https://blog.fireworks.ai/fireattention-serving-open-source-models-4x-faster-than-vllm-by-quantizing-with-no-tradeoffs-a29a85ad28d0

Currently, we are inclined to conduct research on FP8 internally first, and then decide which feature to work on. Do you have any suggestions?

lzhangzz · 2024-04-19T12:46:29Z

FP8 KV cache will be a lot more easier. You will need to add some template specialization for type conversion and some code for dispatching the kernels.

zhyncs · 2024-04-19T12:53:11Z

FP8 KV cache will be a lot more easier. You will need is to add some template specialization for type conversion and some code for dispatching the kernels.

Exactly we plan to reference the implementation of kv cache online Int8.

zhyncs · 2024-04-23T10:01:32Z

Hi all. After internal discussions, we plan to start the development work related to FP8 in May. Please stay tuned. Cheers.

zhyncs · 2024-04-24T03:09:06Z

The following blogs and documents are not directly related to FP8 KV Cache, they are mainly related to FP8 Attention, but also bring us some inspiration. The format used by FriendliAI for implementing FP8 Attention is E4M3. And ColfaxResearch provided an implementation reference for type conversion.

https://friendli.ai/blog/weight-activation-quantization-fp8/
https://docs.friendli.ai/guides/container/efficient_inference_with_fp8/
https://research.colfax-intl.com/adding-fp8-to-flashattention/

zhyncs · 2024-04-24T03:21:03Z

We plan to add support for W8A8 SmoothQuant or FP8 KV Cache on TurboMind.

From https://friendli.ai/blog/quantization-reduce-llm-size/, it can be seen that currently, in terms of speed, SmoothQuant > AWQ > GPTQ and in terms of accuracy, AWQ > GPTQ > SmoothQuant. Among them, AWQ has achieved a good balance and has been efficiently implemented in LMDeploy. GPTQ, as well as AWQ, is W4A16 without any advantages. Our team achieved W8A8 on vLLM in the second half of last year vllm-project/vllm#1508. Due to precision issues, it is difficult for SmoothQuant to be used realistically in online environments. From this perspective, trying FP8 KV Cache on L40 can make it easier to use for online serving.

ispobock · 2024-05-10T01:40:07Z

I evaluated the KV Cache INT8 in llama2 and llama3 models and get the following results:

dataset	metrics	llama2-13b-chat	llama2-13b-chat-kvint8	llama3-8b	llama3-8b-kvint8	llama3-80b	llama3-80b-kvint8
ceval	naive_average	35.44	35.18	48.29	48.99	67.26	67.08
mmlu	naive_average	48.61	48.64	62.68	62.74	79.68	79.53
WiC	accuracy	35.58	35.42	0	0	29.78	28.06
WSC	accuracy	34.62	33.65	6.73	6.73	34.62	33.65
triviaqa	score	60.12	60.27	60.11	60.26	76.89	76.77
gsm8k	accuracy	42.61	41.77	56.71	56.63	90.07	90.67
race-middle	accuracy	46.38	46.17	33.84	32.87	93.11	93.04
race-high	accuracy	34.59	34.33	26.07	25.76	89.25	89.22

It seems that KV Cache INT8 can keep most of the accuracy, do we still need FP8 quantization for KV cache?
It's abnormal that the accuracy is 0 for llama3 on WiC dataset. Could you help check if I did the evaluation correctly?

Here is my evaluation steps:

# start server
lmdeploy serve api_server /workdir/llm_models/Meta-Llama-3-8B --server-name 0.0.0.0 --server-port 23333 --tp 1 --quant-policy 8

# start opencompass evaluation
python run.py configs/eval_internlm_chat_lmdeploy_apiserver.py -w outputs

@lvhan028 @lzhangzz Do you have any suggestions? cc: @zhyncs

lvhan028 · 2024-05-10T03:35:21Z

OpenCompass team said WiC and WSC can be neglected

ispobock · 2024-05-10T03:44:06Z

OpenCompass team said WiC and WSC can be neglected

OK, got it.

zhyncs · 2024-05-10T05:23:45Z

It seems that KV Cache INT8 can keep most of the accuracy, do we still need FP8 quantization for KV cache?

Due to the excellent performance improvement and negligible accuracy loss of the Online KV Cache Int8 currently implemented in LMDeploy, we are inclined not to proceed with FP8 KV Cache for now. The ROI is not very high for us. Is there any plan in the community to work on this? Looking forward to your reply. Thanks. @lvhan028 @lzhangzz

lzhangzz · 2024-05-11T07:36:35Z

We don't have plan to support FP8 KV cache, as the current INT8 implementation works just fine and it also works on pre sm_89 devices. (well the fact is that I don't even have a sm_89+ device to start with)

We seek to improve the accuracy of current INT8/INT4 implementations by more advanced quantization methods.

zhyncs closed this as completed May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] TurboMind support W8A8 or FP8 KV Cache #1463

[Feature] TurboMind support W8A8 or FP8 KV Cache #1463

zhyncs commented Apr 19, 2024 •

edited

Loading

zhyncs commented Apr 19, 2024

zhyncs commented Apr 19, 2024

lzhangzz commented Apr 19, 2024 •

edited

Loading

zhyncs commented Apr 19, 2024

zhyncs commented Apr 23, 2024

zhyncs commented Apr 24, 2024

zhyncs commented Apr 24, 2024

ispobock commented May 10, 2024

lvhan028 commented May 10, 2024

ispobock commented May 10, 2024

zhyncs commented May 10, 2024

lzhangzz commented May 11, 2024

[Feature] TurboMind support W8A8 or FP8 KV Cache #1463

[Feature] TurboMind support W8A8 or FP8 KV Cache #1463

Comments

zhyncs commented Apr 19, 2024 • edited Loading

Motivation

Related resources

Additional context

zhyncs commented Apr 19, 2024

zhyncs commented Apr 19, 2024

lzhangzz commented Apr 19, 2024 • edited Loading

zhyncs commented Apr 19, 2024

zhyncs commented Apr 23, 2024

zhyncs commented Apr 24, 2024

zhyncs commented Apr 24, 2024

ispobock commented May 10, 2024

lvhan028 commented May 10, 2024

ispobock commented May 10, 2024

zhyncs commented May 10, 2024

lzhangzz commented May 11, 2024

zhyncs commented Apr 19, 2024 •

edited

Loading

lzhangzz commented Apr 19, 2024 •

edited

Loading