Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support megatron dataset for T5 #6659

Merged
merged 1 commit into from
Sep 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions examples/language_model/t5/README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数据准备中,预置的token ids:baike_sample_ids.npy, 文章索引信息baike_sample_idx.npz样例应改为bin格式与idx格式,数据制作可以参考这里,注意参数配置,参考这里

Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,19 @@

数据流是预训练的非常重要的,[预处理文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md)提供了整体的数据变动的流程示意,用户可以查看数据制作的细节文档。

在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:[`baike_sample_ids.npy`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_ids.npy), 文章索引信息[`baike_sample_idx.npz`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_idx.npz).(这里提供了一个处理好的预训练数据,可点击链接下载)
在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:[`t5_openwebtext.bin`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/t5_openwebtext.bin), 文章索引信息[`t5_openwebtext.idx`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/t5_openwebtext.idx).(这里提供了一个处理好的预训练数据,可点击链接下载)

```shell
python -u create_pretraining_data.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:baike_sample_ids.npy, 文章索引信息baike_sample_idx.npz.(这里提供了一个处理好的预训练数据,可点击链接下载)

这块需要搞一个样例数据出来

--model_name t5-small \
--tokenizer_name T5Tokenizer \
--input_path baike_sample.jsonl \
--split_sentences\
--output_prefix baike_sample \
--data_format JSON \
--input_path openwebtext/2020-04.jsonl.zst \
--split_sentences \
--output_prefix t5_openwebtext \
--workers 1 \
--log_interval 5
--log_interval 5 \
--data_impl mmap
```

#### 2. 开始训练
Expand Down Expand Up @@ -73,8 +75,9 @@ python -u -m paddle.distributed.launch \
--disable_tqdm true \
--do_train \
--do_eval \
--seed 1234\
--device "gpu"
--seed 1234 \
--device "gpu" \
--data_impl "mmap"
```

其中参数释义如下:
Expand All @@ -95,6 +98,7 @@ python -u -m paddle.distributed.launch \
- `dataloader_num_workers` DataLoader采样进程,当数据输入为瓶颈时,可尝试提高采样进程数目。
- `eval_steps` 模型评估间隔。
- `device` 训练设备,默认为GPU。
- `data_impl` 指定输入文件数据制作类型,默认为mmap,可指定mmap或lazy。mmap格式在读入数据时会建立内存映射,lazy格式在读入数据时直接从文件读取。

### GLUE任务

Expand Down
20 changes: 13 additions & 7 deletions examples/language_model/t5/t5_run_pretrain_trainer.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,10 @@ class DataArguments:
default=3,
metadata={"help": "Max N Grams"},
)
data_impl: str = field(
default="mmap",
metadata={"help": "mmap/lazy format converted from preprocessed data."},
)


@dataclass
Expand Down Expand Up @@ -183,12 +187,13 @@ def create_pretrained_dataset(

def print_dataset(data, mode="train"):
logger.info(f"Sample data for {mode} mode")
# text_enc, text_dec, labels, loss_mask, truncated, enc_mask, dec_mask, enc_dec_mask = data
# print("line 195 t5 run pretain trainer", text_enc)
print(data)
print(tokenizer.convert_ids_to_tokens(token for token in list(data["text_enc"])))
print(tokenizer.convert_ids_to_tokens(token for token in list(data["text_dec"])))
# print(tokenizer.convert_ids_to_tokens(token for token in list(data["labels"])))
text_enc, text_dec = data["text_enc"], data["text_dec"]
if tokenizer.pad_token_id in text_enc:
text_enc = text_enc[0 : list(text_enc).index(tokenizer.pad_token_id)]
logger.info(tokenizer._decode(text_enc))
if tokenizer.pad_token_id in text_dec:
text_dec = text_dec[0 : list(text_dec).index(tokenizer.pad_token_id)]
logger.info(tokenizer._decode(text_dec))

print_dataset(train_ds[0], "train")
print_dataset(valid_ds[0], "valid")
Expand Down Expand Up @@ -224,9 +229,10 @@ def get_train_data_file(args):
files = [
os.path.join(args.input_dir, f)
for f in os.listdir(args.input_dir)
if (os.path.isfile(os.path.join(args.input_dir, f)) and "_idx.npz" in str(f))
if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
]
files = [x.replace("_idx.npz", "") for x in files]
files = [x.replace(".idx", "") for x in files]

if len(files) > 1:
ret = []
Expand Down
1 change: 0 additions & 1 deletion model_zoo/ernie-1.0/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ def parse_args(MODEL_CLASSES):
parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the training logs and checkpoints will be written.")
parser.add_argument("--split", type=str, default='949,50,1', help="Train/valid/test data split.")
parser.add_argument("--data_impl", type=str, default='mmap', help="mmap/lazy format converted from preprocessed data.")

parser.add_argument("--binary_head", type=strtobool, default=True, help="True for NSP task.")
parser.add_argument("--max_seq_len", type=int, default=1024, help="Max sequence length.")
parser.add_argument("--micro_batch_size", default=8, type=int, help="Batch size per device for one step training.", )
Expand Down
Loading