-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pre-Training] Add tutorial for clue small 14g dataset #1555
Merged
Merged
Changes from 1 commit
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
d82e63d
add tutorial for clue small 14g.
ZHUI bcd7e42
add pre-train weight to community.
ZHUI 99220e2
fix typos.
ZHUI 17ec4c9
fix typo.
ZHUI 5419975
Merge branch 'develop' into add_clue_corpus
ZHUI 605ed30
add dataset link.
ZHUI 1bb43cd
Merge branch 'add_clue_corpus' of github.com:ZHUI/PaddleNLP into add_…
ZHUI 89cf276
change name to ernie-1.0-cluecorpussmall
ZHUI 0ec4a71
Merge branch 'develop' into add_clue_corpus
ZeyuChen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -131,7 +131,7 @@ chinese words: | |
可选。是否需要WWM策略。一般而言,Bert/Ernie模型需要,GPT不需要。 | ||
--cn_seg_func {lac,seg,jieba} | ||
Words segment function for chinese words. | ||
默认lac,jieba速度较快 | ||
默认jieba,jieba速度较快,lac模型更复杂。 | ||
--cn_splited Is chinese corpus is splited in to words. | ||
分词后的文本,可选。设置此选项则,cn_seg_func不起作用。 | ||
例如分词后文本串 "百度 手机助手 是 Android 手机 的 权威 资源平台" | ||
|
@@ -148,7 +148,7 @@ common config: | |
--workers WORKERS Number of worker processes to launch | ||
处理文本id化的进程个数。 | ||
``` | ||
同过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`. | ||
通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`. | ||
``` | ||
python -u create_pretraining_data.py \ | ||
--model_name ernie-1.0 \ | ||
|
@@ -190,3 +190,51 @@ sh run_static.sh | |
## 参考内容 | ||
|
||
注: 大部分数据流程,参考自[Megatron](https://github.com/NVIDIA/Megatron-LM),特此表达感谢。 | ||
|
||
|
||
# 附录 | ||
|
||
## Clue corpus small 数据集处理教程 | ||
ZeyuChen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
**数据集简介**:可用于语言建模、预训练或生成型任务等,数据量超过14G,近4000个定义良好的txt文件、50亿个字。主要部分来自于nlp_chinese_corpus项目 | ||
包含如下子语料库(总共14G语料):新闻语料 news2016zh_corpus, 社区互动语料webText2019zh_corpus,维基百科语料wiki2019zh_corpus,评论数据-语料comments2019zh_corpus。 | ||
|
||
**数据集下载**: | ||
用户可以通过官方githu网页下载,https://github.com/CLUEbenchmark/CLUE 。同时,为方便用户,我们也提供了aistudio数据集下载地址。[part1](https://aistudio.baidu.com/aistudio/datasetdetail/60598),[part2](https://aistudio.baidu.com/aistudio/datasetdetail/124357)。使用aistudio版本的数据,下载好后,可以核对md5值: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. github,少了b There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
```shell | ||
> md5sum ./* | ||
8a8be341ebce39cfe9524fb0b46b08c5 ./comment2019zh_corpus.zip | ||
4bdc2c941a7adb4a061caf273fea42b8 ./news2016zh_corpus.zip | ||
fc582409f078b10d717caf233cc58ddd ./webText2019zh_corpus.zip | ||
157dacde91dcbd2e52a60af49f710fa5 ./wiki2019zh_corpus.zip | ||
``` | ||
解压文件 | ||
```shell | ||
unzip comment2019zh_corpus.zip -d clue_corpus_small_14g/comment2019zh_corpus | ||
unzip news2016zh_corpus.zip -d clue_corpus_small_14g/news2016zh_corpus | ||
unzip webText2019zh_corpus.zip -d clue_corpus_small_14g/webText2019zh_corpus | ||
unzip wiki2019zh_corpus.zip -d clue_corpus_small_14g/wiki2019zh_corpus | ||
``` | ||
将txt文件转换为jsonl格式 | ||
``` | ||
python trans_to_json.py --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl | ||
``` | ||
现在我们得到了jsonl格式的数据集,下面是针对训练任务的数据集应用,此处以ernie为例。 | ||
``` | ||
python -u create_pretraining_data.py \ | ||
--model_name ernie-1.0 \ | ||
--tokenizer_name ErnieTokenizer \ | ||
--input_path clue_corpus_small_14g.jsonl \ | ||
--split_sentences\ | ||
--chinese \ | ||
--cn_whole_word_segment \ | ||
--cn_seg_func jieba \ | ||
--output_prefix clue_corpus_small_14g_20220104 \ | ||
--workers 48 \ | ||
--log_interval 10000 | ||
``` | ||
数据共有文档`15702702`条左右,由于分词比较耗时,大概一小时左右可以完成。在当前目录下产出训练所需数据。 | ||
``` | ||
clue_corpus_small_14g_20220104_ids.npy | ||
clue_corpus_small_14g_20220104_idx.npz | ||
``` | ||
用户可以使用此数据进行预训练任务。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
复杂这个形容词标书不准确。
应该是lac分词模型更加准确,但计算量更高。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done