-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pre-Training] Add tutorial for clue small 14g dataset #1555
Conversation
@@ -131,7 +131,7 @@ chinese words: | |||
可选。是否需要WWM策略。一般而言,Bert/Ernie模型需要,GPT不需要。 | |||
--cn_seg_func {lac,seg,jieba} | |||
Words segment function for chinese words. | |||
默认lac,jieba速度较快 | |||
默认jieba,jieba速度较快,lac模型更复杂。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
复杂这个形容词标书不准确。
应该是lac分词模型更加准确,但计算量更高。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
包含如下子语料库(总共14G语料):新闻语料 news2016zh_corpus, 社区互动语料webText2019zh_corpus,维基百科语料wiki2019zh_corpus,评论数据-语料comments2019zh_corpus。 | ||
|
||
**数据集下载**: | ||
用户可以通过官方githu网页下载,https://github.com/CLUEbenchmark/CLUE 。同时,为方便用户,我们也提供了aistudio数据集下载地址。[part1](https://aistudio.baidu.com/aistudio/datasetdetail/60598),[part2](https://aistudio.baidu.com/aistudio/datasetdetail/124357)。使用aistudio版本的数据,下载好后,可以核对md5值: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github,少了b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
整体确认下官方名称,另外这部分预训练流程是否跟ernie-1.0的训练脚本合并呢?
@@ -0,0 +1,48 @@ | |||
# 详细介绍 | |||
本权重为使用PaddleNLP提供的ernie预训练教程,在clue corpus small 14g数据集上训练得到的权重。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ernie -> ERNIE 文档书写要区分模型官方名和api的参数名,正式名称是ERNIE/ERNIE-1.0
clue corpus small 14g. 使用正式名称
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
```python | ||
import paddle | ||
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer | ||
tokenizer = ErnieTokenizer.from_pretrained('zhui/cluecorpussmall_ernie-1.0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
名称改为ernie-1.0-cluecorpus2020?
double confirm下使用的语料官方名称是否角CLUECOrpus2020
https://github.com/CLUEbenchmark/CLUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/CLUEbenchmark/CLUECorpus2020
CLUECorpus2020
是100G的数据,需要申请,咱们使用的是 CLUECorpusSmall
只有14G。是两份不同数据。
我修改为ernie-1.0-cluecorpussmall
@@ -82,6 +82,32 @@ python -u -m paddle.distributed.launch \ | |||
- 一般而言, `global_batch_size = micro_batch_size * sharding_degree * dp_degree`。可以使用梯度累积的方式增大`global_batch_size`。设置`global_batch_size`为理论值的整数倍是,默认启用梯度累积。 | |||
- 训练断点重启,直接启动即可,程序会找到最新的checkpoint,开始重启训练。 | |||
|
|||
|
|||
### Clue corpus small 数据集训练结果 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLUECorpus2020 Small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
数据集应为 是指的 |
是的,是否作为ERNIE-1.0默认的数据训练流程? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Docs
Description
Add tutorial for clue small 14g dataset