[LLM] Support block_attention/cachekv quant for llama #7649

RichardWooSJTU · 2023-12-14T03:32:38Z

PR types

New features

PR changes

Others

Description

support block attention, default disabled, enabling by setting block_attn
support cachekv quant, defalt disabled, enabling static quant by setting use_cachekv_int8 to static, enabling dynamic quant by setting use_cachekv_int8 to dynamic,

paddle-bot · 2023-12-14T03:32:43Z

Thanks for your contribution!

CLAassistant · 2023-12-14T03:32:45Z

All committers have signed the CLA.

codecov · 2023-12-14T04:11:13Z

Codecov Report

Attention: 422 lines in your changes are missing coverage. Please review.

Comparison is base (dab175b) 57.12% compared to head (8b91dc8) 56.95%.
Report is 5 commits behind head on develop.

Files	Patch %	Lines
...dlenlp/experimental/transformers/llama/modeling.py	0.00%	209 Missing ⚠️
...enlp/experimental/transformers/generation_utils.py	0.00%	100 Missing ⚠️
...erimental/transformers/fused_transformer_layers.py	0.00%	99 Missing ⚠️
paddlenlp/experimental/model_utils.py	12.50%	14 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #7649      +/-   ##
===========================================
- Coverage    57.12%   56.95%   -0.17%     
===========================================
  Files          587      587              
  Lines        88190    88626     +436     
===========================================
+ Hits         50376    50479     +103     
- Misses       37814    38147     +333

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wawltor · 2023-12-28T06:35:43Z

llm/export.sh

+
+
+
+python -m paddle.distributed.launch \


这个脚本可以先放到llama 模型目录下

wawltor · 2023-12-28T06:37:12Z

llm/predictor.py

@@ -26,6 +26,7 @@
 import paddle


如沟通先移动到llama目录下，后续等所有的模型都覆盖了，再迁移

wawltor · 2023-12-28T06:37:38Z

llm/read_res.py

@@ -0,0 +1,49 @@
+import paddle


wawltor · 2023-12-28T06:37:47Z

llm/run_dygraph.sh

+#     --batch_size 2 \
+#     --inference_model \
+#     --quant_type ${quant_type}  \
+#     --block_attn \


这块的脚本可以迁移到inference.md 文档里面去了。

wawltor · 2023-12-28T06:38:00Z

llm/run_static.sh

+    --block_attn \
+    --inference_model  \
+    --use_cachekv_int8 static
+


wawltor · 2023-12-28T06:39:33Z

paddlenlp/experimental/transformers/generation_utils.py

+            temperature,
+            model_kwargs,
+        )
+


补充上PaddleNLP CI 和 Paddle CI

vivienfanghuagood · 2023-12-28T06:43:36Z

csrc/generation/step.cu

+    }
+}
+
+// 根据上一步计算出的可以复原的query_id进行状态恢复


换成英文注释吧～

wj-Mcat · 2023-12-28T08:22:28Z

llm/.gitignore

@@ -1,3 +1,6 @@
+
+max_len.txt 


这里是你本地的测试环境下的配置文件，建议回滚一下这个文件。

wj-Mcat · 2023-12-28T08:25:38Z

llm/benchmark.sh

@@ -27,10 +27,10 @@ export FLAGS_cache_inference_while_scope=1
 python predictor.py \
    --model_name_or_path ./llama7b-inference_model_fp16 \
    --dtype float16 \
-    --src_length 300 \
-    --max_length 100 \
+    --src_length ${total_len} \


从语义上面来讲，total_len应该是 src_length + max_length，所以这两个参数的变量命令是不是与--name 对应一致呢？

src_length 和 max_length 设置一下默认值： src_length=${src_length:-300}

wj-Mcat · 2023-12-28T08:27:03Z

llm/export.sh

+    --src_length ${total_len} \
+    --block_attn \
+    --quant_type ${quant_type} \
+    --use_cachekv_int8 static


我这边的建议是将这个脚本迁移到 ./inference.md 文档里面去，新开一个 cachekv 的 section 来描述这个。

wj-Mcat · 2023-12-28T08:29:12Z

llm/predictor.py

@@ -703,6 +723,526 @@ def _infer(self, inputs: dict[str, paddle.Tensor]):
        return None


+class DygraphBlockInferencePredictor(BasePredictor):


这个模块应该是需要继承：InferencePredictorMixin 吧，这个都是用来处理推理模型的。

wj-Mcat · 2023-12-28T08:30:15Z

llm/predictor.py

+            self.pre_caches = [
+                paddle.zeros(
+                    [config.batch_size, self.num_attention_heads, self.pre_cache_length, self.head_dim],
+                    dtype=self.dtype,
+                )
+                for _ in range(2 * self.num_layers)
+            ]


这边有测试过 pre_cache + cache_kv-int8 的组合吗？如果有的话，能够添加一下对应的单测呢？

wj-Mcat · 2023-12-28T08:43:03Z

llm/run_dygraph.sh

+#     --batch_size 2 \
+#     --inference_model \
+#     --quant_type ${quant_type}  \
+#     --block_attn \


这块的脚本可以迁移到inference.md 文档里面去了。

wj-Mcat · 2023-12-28T08:43:33Z

paddlenlp/experimental/model_utils.py

+            print("scale_type: ", scale_type)
+            print("key_template: ", key_template)


这是一个基础类，所以这里的 print 应该是需要删除的。

wj-Mcat · 2023-12-28T08:44:18Z

paddlenlp/experimental/transformers/fused_transformer_layers.py

+    def post_process(self, **kwargs):
+        time_step = kwargs.get("time_step", None)
+        multi_block_output = kwargs.get("multi_block_output", None)
+        cum_offsets = kwargs.get("cum_offsets", None)
+        seq_lens = kwargs.get("seq_lens", None)
+        input_ids = kwargs.get("input_ids", None)


这块的代码都是要写单测的。

wj-Mcat · 2023-12-28T08:44:29Z

paddlenlp/experimental/transformers/fused_transformer_layers.py

+            # print("out_linear_out", out_linear_out)
+            # exit(0)


wj-Mcat · 2023-12-28T08:45:11Z

paddlenlp/experimental/transformers/generation_utils.py

+        else:
+            precache_kv_spec = None
+        use_cachekv_int8 = config.get("use_cachekv_int8", "None")
+        print("use_cachekv_int8", use_cachekv_int8)


其他的 print 我就不一一说了。

support blha and cache kv quant

c43c61a

RichardWooSJTU force-pushed the restruct_52_dev branch from c27663e to c43c61a Compare December 20, 2023 09:06

RichardWooSJTU added 4 commits December 20, 2023 17:37

fix conflict

d5ab849

lint

340d6d1

fix unit test

235198d

fix infer when blha is on

7df5824

wawltor reviewed Dec 28, 2023

View reviewed changes

paddlenlp/experimental/transformers/generation_utils.py

temperature,

model_kwargs,

)

Copy link

Collaborator

wawltor Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补充上PaddleNLP CI 和 Paddle CI

vivienfanghuagood reviewed Dec 28, 2023

View reviewed changes

wj-Mcat requested changes Dec 28, 2023

View reviewed changes

RichardWooSJTU added 10 commits December 29, 2023 18:18

code refine

79e0b31

merge develop

d44fe99

merge develop

846e627

add docs and fix ops

8dac802

merge blha read res in predictor

cc0a25d

finish docs

121c920

add docs and unittest

b8aea32

add unittest

afb1dd0

merge develop

ed6db33

migrate read res

8b91dc8

wj-Mcat approved these changes Jan 10, 2024

View reviewed changes

wawltor merged commit c5d8d5b into PaddlePaddle:develop Jan 10, 2024
7 of 9 checks passed

RichardWooSJTU mentioned this pull request Jan 19, 2024

[LLM] revert benchmark codes #7871

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM] Support block_attention/cachekv quant for llama #7649

[LLM] Support block_attention/cachekv quant for llama #7649

RichardWooSJTU commented Dec 14, 2023 •

edited

Loading

paddle-bot bot commented Dec 14, 2023

CLAassistant commented Dec 14, 2023 •

edited

Loading

codecov bot commented Dec 14, 2023 •

edited

Loading

wawltor Dec 28, 2023

wawltor Dec 28, 2023

wawltor Dec 28, 2023

wawltor Dec 28, 2023

wj-Mcat Dec 28, 2023

wawltor Dec 28, 2023

wawltor Dec 28, 2023

vivienfanghuagood Dec 28, 2023

wj-Mcat Dec 28, 2023

wj-Mcat Dec 28, 2023

wj-Mcat Dec 28, 2023

wj-Mcat Dec 28, 2023

wj-Mcat Dec 28, 2023

wj-Mcat Dec 28, 2023

wj-Mcat Dec 28, 2023

wj-Mcat Dec 28, 2023

wj-Mcat Dec 28, 2023

wj-Mcat Dec 28, 2023

		@@ -703,6 +723,526 @@ def _infer(self, inputs: dict[str, paddle.Tensor]):
		return None


		class DygraphBlockInferencePredictor(BasePredictor):

		print("scale_type: ", scale_type)
		print("key_template: ", key_template)

[LLM] Support block_attention/cachekv quant for llama #7649

[LLM] Support block_attention/cachekv quant for llama #7649

Conversation

RichardWooSJTU commented Dec 14, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Dec 14, 2023

CLAassistant commented Dec 14, 2023 • edited Loading

codecov bot commented Dec 14, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RichardWooSJTU commented Dec 14, 2023 •

edited

Loading

CLAassistant commented Dec 14, 2023 •

edited

Loading

codecov bot commented Dec 14, 2023 •

edited

Loading