-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PP-MiniLM #1403
Add PP-MiniLM #1403
Conversation
6eb85a4
to
ebd2be2
Compare
|
||
PP-MiniLM融合了蒸馏、裁剪、量化、高性能推理技术,拥有精度高、推理速度快、参数规模小的特点: | ||
|
||
- 精度高:6层-768hidden size的模型,精度高于华为、腾讯同样大小的模型; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要用公司名,要用模型名。
# PP-MiniLM中文特色小模型 | ||
|
||
|
||
PP-MiniLM中文特色小模型,模型结构同ERNIE,目前本案例主要包含六层transformer layer的模型的通用蒸馏,以及借助PaddleSlim对模型进行裁剪和量化,进一步提升推理速度。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
模型介绍中需要是否突出下基于MiniLMv2策略的改进?@tianxin1860
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如线上沟通,按照 1.推理速度快 2.模型效果好 3.参数规模小 的逻辑来呈现。在模型效果好
小项里体现我们对 MiniLMv2 的改进。
| -------------------- | ------------- | ------ | ------- | ----- | ----- | ------- | ----- | ----- | ----- | ----- | ---------- | | ||
| bert-base-chinese | 102.27M | | TODO | | | | | | | | | | ||
| TinyBERT(6l-768d) | 59.7M | | 1.00x | 72.22 | 55.82 | 58.10 | 79.53 | 74.00 | 75.99 | 80.57 | 70.89 | | ||
| 腾讯 RoBERTa 6l-768d | 59.7M | | 1.00x | 69.74 | 66.36 | 59.95 | 77.00 | 71.39 | 71.05 | 82.83 | 71.19 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UER-py RoBERTa xxxx 去掉公司名
|
||
### 数据介绍 | ||
|
||
百度内部业务数据。数据被分割成64个文件,放在目录dataset下。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
去掉内部业务数据这段话。是否改用CLUESmall数据来作为数据示例。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
或者只是简单说下数据的组织方式即可
|
||
PP-MiniLM模型的蒸馏方法介绍: | ||
|
||
用large-size教师模型的第20层对6层学生模型第6层的q与q、k与k、v与v之间的样本间关系进行蒸馏。即对q、k、v统一head_num之后进行重新排列, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
large-size教师模型是什么?是否需要以某个模型为例?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这段话需要重新概括
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不适特别清晰
|
||
### 性能测试 | ||
|
||
我们在NVIDIA 16G T4单卡上,使用inference/infer.py脚本,对量化后的模型进行预测。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA Tesla T4 (T4只有16G,可以无需特别强调)
```shell | ||
cd inference | ||
|
||
python infer.py --task_name ${task} --model_path ../quantization/${task}_quant_models/${algo}${bs}/int8 --int8 --use_trt --collect_shape # 生成shape range info文件 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collect shape环节是否需要增强下说明
|
||
### 环境要求: | ||
|
||
这一步需要依赖paddle2.2.1,如果想要看到更明显的加速比,需要在T系列卡上测试(本案例使用的是T4)。若在V系列卡上测试,由于其不支持int8 tensor core,加速效果将达不到本文档表格中的加速效果。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要注重英文术语大小写拼写的正确性
Int8 Tensor Core
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果需要得到更明显的加速效果,推荐在NVIDA Tensor Core GPU(如T4、A10、A100)上进行测试。
float32预测脚本: | ||
|
||
```shell | ||
python infer.py --task_name ${task} --model_path $MODEL_PATH --use_trt --collect_shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collect shape环节需要单独说明一下比较好,否则这里会让用户很困惑
config.tensorrt_engine_enabled())) | ||
if args.collect_shape: | ||
config.collect_shape_range_info( | ||
os.path.dirname(args.model_path) + "/" + args.task_name + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目录的拼写应该是用os.path.join API来拼接,强制用 /会导致windows不兼容
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave some comments
|
||
### 数据介绍 | ||
|
||
本实验基于CLUE中分类数据集,linux系统下该数据集会在启动脚本后自动下载到`~/.paddlenlp/datasets/Clue/`。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
本实验基于 CLUE 数据集,运行 Fine-tune 脚本会自动下载该数据集到 *** 目录.
|
||
本实验基于CLUE中分类数据集,linux系统下该数据集会在启动脚本后自动下载到`~/.paddlenlp/datasets/Clue/`。 | ||
|
||
使用以下超参范围对第一步通用蒸馏得到的通用模型`GENERAL_MODEL_DIR`进行精调 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
基于如下超参范围对第一步蒸馏产出的小模型 GENERAL_MODEL_DIR
进行 Grid Search 超参寻优
cd ofa | ||
``` | ||
|
||
经过我们的实验,模型的宽度压缩为原来的3/4的情况下,模型精度无损(-0.15)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意表述: 6L768H 条件下,模型宽度压缩为原来的 3/4, 精度几乎无损。
|
||
经过我们的实验,模型的宽度压缩为原来的3/4的情况下,模型精度无损(-0.15)。 | ||
|
||
### 压缩和蒸馏的启动脚本 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否明确一下 压缩、裁剪、蒸馏的关系以及使用场合?感觉标题这里用压缩和蒸馏可能引起误解。
"cmnli": Accuracy, | ||
"cluewsc2020": Accuracy, | ||
"csl": Accuracy, | ||
"xnli": Accuracy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除 CLUE 之外的数据集更合适一些?
#print(origin_model_new.state_dict().keys()) | ||
#print("=====================") | ||
#for name, params in origin_model_new.named_parameters(): | ||
# print(name, params.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
export CUDA_VISIBLE_DEVICES=$6 | ||
export TASK_NAME=$1 | ||
export BATCH_SIZE=$3 | ||
export SEQ_LEN=$5 | ||
export PRE_EPOCHS=$4 | ||
export LR=$2 | ||
export STUDENT_DIR=$7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否按 $1、$2 顺序依次解析变量?
|
||
do | ||
|
||
python quant_post.py --task_name ${task} --input_dir ${MODEL_DIR}/${task}/0.75/sub_static |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 0.75 直接作为目录名吗?
'target']['span1_text'], example['target']['span2_text'], example[ | ||
'target']['span1_index'], example['target']['span2_index'] | ||
text_list = list(text) | ||
# print(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多余注释
s_head_dim, t_head_dim = s.shape[3], t.shape[3] | ||
|
||
if alpha + beta == 1.0: | ||
loss1 = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loss1、loss2、loss3 是否替换为相应有意义的命名?
| ----- | ----- | ------- | ----- | ----- | ----- | ----- | ---------- | | ||
| 74.28 | 57.33 | 61.72 | 81.06 | 76.20 | 86.51 | 78.77 | 73.70 | | ||
|
||
### 你可以这样导出Fine-tuning之后的模型直接用于部署 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个作为标题不太合适,标题要尽可能简洁
# PP-MiniLM中文特色小模型 | ||
|
||
|
||
PP-MiniLM中文特色小模型,模型结构同ERNIE,目前本案例主要包含六层transformer layer的模型的通用蒸馏,以及借助PaddleSlim对模型进行裁剪和量化,进一步提升推理速度。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们自称为特色是否合适还请再看下。
这里的本案例可能也不太合适,现在是本模型或者本方案了,是自己的模型了
|
||
### 原理介绍 | ||
|
||
PP-MiniLM模型的蒸馏方法介绍: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否也提下体现下原版MiniLM,也能显出我们命名的由来
|
||
执行完成后,模型保存的路径位于`ofa_models/CLUEWSC2020/0.75/best_model/` | ||
|
||
### 导出裁剪后的模型: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
标题里加: 好像也不太合适
|
||
### 环境要求 | ||
|
||
本实验如果基于NVIDIA V100 32G 8卡进行,训练周期约为2-3天。若资源有限,可以直接下载这一步得到的模型跳过此步骤。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个感觉不能算是环境要求
|
||
PP-MiniLM中文特色小模型,模型结构同ERNIE,目前本案例主要包含六层transformer layer的模型的通用蒸馏,以及借助PaddleSlim对模型进行裁剪和量化,进一步提升推理速度。 | ||
|
||
PP-MiniLM融合了蒸馏、裁剪、量化、高性能推理技术,拥有精度高、推理速度快、参数规模小的特点: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这句和上面那句可以合起来
| 腾讯 RoBERTa 6l-768d | 59.7M | | 1.00x | 69.74 | 66.36 | 59.95 | 77.00 | 71.39 | 71.05 | 82.83 | 71.19 | | ||
| PP-MiniLM 6l-768d | 59.7M | | 1.00x | 74.28 | 57.33 | 61.72 | 81.06 | 76.2 | 86.51 | 78.77 | 73.70 | | ||
| PP-MiniLM裁剪后 | 49.1M (+裁剪) | | 1.15x | 73.82 | 57.33 | 61.60 | 81.38 | 76.20 | 85.52 | 79.00 | 73.55 | | ||
| PP-MiniLM量化后 | 49.2M(+量化) | | 2.18x | 73.61 | 57.18 | 61.49 | 81.26 | 76.31 | 84.54 | 77.67 | 73.15 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
量化后模型比上一步还更大了吗
- `num_relation_heads` relation heads的个数,一般对于large size的教师模型是64,对于base size的教师模型是48。 | ||
- `teacher_model_type`指示了教师模型类型,当前仅支持'ernie'、'roberta'。 | ||
- `teacher_layer_index`蒸馏时使用的教师模型的层数 | ||
- `student_layer_index` 蒸馏时使用的学生模型的层数 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是想表示选用第几层吧,层数感觉可能会带来些误解
# The first commit's message is: update inference # This is the 2nd commit message: update
fix infer perf remove useless comments
7626a14
to
8d0af8b
Compare
…nto add-ppminilm
update readme
4b8aa23
to
af7a461
Compare
@@ -0,0 +1,305 @@ | |||
# PP-MiniLM中文小模型 | |||
|
|||
PP-MiniLM 中文特小模型案例旨在提供训推一体的高精度、高性能小模型解决方案。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PP-MiniLM 中文特小模型案例旨在提供训推一体的高精度、高性能小模型及解决方案。
|
||
当前解决方案依托业界领先的 Task Agnostic 模型蒸馏技术、裁剪技术、量化技术,使得小模型兼具推理速度快、模型效果好、参数规模小的 3 大特点。 | ||
|
||
- 推理速度快:我们集成了 PaddleSlim 的裁剪、量化技术进一步对小模型进行压缩,保证模型推理速度达到原先的2.18倍; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
推理速度快: 依托 PaddleSlim 的裁剪、量化技术进一步小模型进行压缩, 使得 PP-MiniLM 量化模型 GPU 推理速度相比 Bert-base 加速比高达 3.56;
|
||
- 推理速度快:我们集成了 PaddleSlim 的裁剪、量化技术进一步对小模型进行压缩,保证模型推理速度达到原先的2.18倍; | ||
|
||
- 精度高: 我们以 MiniLMv2 提出的 Multi-Head Self-Attention Relation Distillation 技术为基础,通过引入样本间关系知识蒸馏做了进一步算法优化。我们的6层、hidden size为768的模型,在CLUE上的平均准确率分别高于TinyBERT、UER-py RoBERTa同样大小的模型2.66%、1.51%。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
精度高: 我们以 MiniLMv2 提出的 Multi-Head Self-Attention Relation Distillation 技术为基础,通过引入样本间关系知识蒸馏做了进一步算法优化, 6 层 PP-MiniLM 模型在 CLUE 数据集上比 12 层 Bert-base-chinese 高 0.23%,比同等规模的 TinyBERT、UER-py RoBERTa 分别高 2.66%、1.51%;
|
||
- 精度高: 我们以 MiniLMv2 提出的 Multi-Head Self-Attention Relation Distillation 技术为基础,通过引入样本间关系知识蒸馏做了进一步算法优化。我们的6层、hidden size为768的模型,在CLUE上的平均准确率分别高于TinyBERT、UER-py RoBERTa同样大小的模型2.66%、1.51%。 | ||
|
||
- 参数规模小:依托 PaddleSlim 裁剪技术,在精度几乎无损(-0.15)条件下将模型宽度压缩 1/4。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数规模小:依托 PaddleSlim 裁剪技术,在精度几乎无损(-0.15%)条件下将模型隐层宽度压缩 1/4,模型参数量减少 28%;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改,十分感谢~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave some comments
acb5e69
to
dd9cac5
Compare
update code and readme update readme Add serial number to readme update readme Added a catalog fix a catalog bug fix a catalog bug
d4e9725
to
581625c
Compare
|
||
| Model | #Params | #FLOPs | Speedup | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | WSC | CSL | CLUE平均值 | | ||
| ----------------------- | ------- | ------ | ------- | ----- | ----- | ------- | ----- | ----- | ----- | ----- | ---------- | | ||
| Bert<sub>base</sub> | 102.3M | 10.87B | 1.00x | 74.17 | 57.17 | 61.14 | 81.14 | 75.08 | 80.26 | 81.47 | 72.92 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BERT,作为模型名的BERT统一大写
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢,已经修改
|
||
### 环境说明 | ||
|
||
本实验基于NVIDIA Tesla V100 32G 8卡进行,训练周期约为2-3天。若资源有限,可以直接[下载PP-MiniLM(6L768H)](https://bj.bcebos.com/paddlenlp/models/transformers/ppminilm/6l-768h)用于下游任务的微调。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要手动下载吗?是否可以告诉大家用from_pretrained
接口自动下载?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不需要,已经加上了用from_pretrained
导入的示例,并在modeling.py
和tokenizer.py
加上ppminilm相关配置
update reamde update readme update reamde update readme
c71a768
to
90df17e
Compare
@@ -0,0 +1,12 @@ | |||
for task in afqmc tnews iflytek cmnli ocnli cluewsc2020 csl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否需要加copyright声明呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢提示,已经把shell上都加了copyright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for inference api
|
||
#### 环境要求 | ||
|
||
这一步依赖安装有预测库的 PaddlePaddle 2.2.1。可以在[PaddlePaddle 官网](https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html)根据机器环境选择合适的 Python 预测库进行安装。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢提示,已经更新
8107426
to
75346d1
Compare
|
||
## 导入 PP-MiniLM | ||
|
||
PP-MiniLM是使用任务无关蒸馏方法,以 `roberta-wwm-ext-large` 做教师模型蒸馏产出的包含 6 层 Transformer Encoder Layer、Hidden Size 为 768 的中文预训练小模型,在[中文任务测评基准 CLUE](https://github.com/CLUEbenchmark/CLUE) 上七个分类任务上的模型精度超过 BERT<sub>base</sub>、TinyBERT<sub>6</sub>、UER-py RoBERTa L6-H768、RBT6。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roberta-wwm-ext-large 作为教师模型,6层ERNIE作为学生模型是吧?感觉体现下ERNIE比较清晰
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢建议,已经体现 6 层 ERNIE了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you all:)🙏 |
PR types
New features
PR changes
Models & Docs
Description
TODO:
1.更新QA对UER-py的测试结果进README;在cuda10.2 paddle2.2.1下测试CSL;