Skip to content

Commit

Permalink
fix T5 readme (#6659)
Browse files Browse the repository at this point in the history
Support megatron dataset for T5
  • Loading branch information
LaiXinyi823 authored Sep 13, 2023
1 parent d9f1b76 commit a43138b
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 15 deletions.
18 changes: 11 additions & 7 deletions examples/language_model/t5/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,19 @@

数据流是预训练的非常重要的,[预处理文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md)提供了整体的数据变动的流程示意,用户可以查看数据制作的细节文档。

在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:[`baike_sample_ids.npy`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_ids.npy), 文章索引信息[`baike_sample_idx.npz`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_idx.npz).(这里提供了一个处理好的预训练数据,可点击链接下载)
在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:[`t5_openwebtext.bin`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/t5_openwebtext.bin), 文章索引信息[`t5_openwebtext.idx`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/t5_openwebtext.idx).(这里提供了一个处理好的预训练数据,可点击链接下载)

```shell
python -u create_pretraining_data.py \
--model_name t5-small \
--tokenizer_name T5Tokenizer \
--input_path baike_sample.jsonl \
--split_sentences\
--output_prefix baike_sample \
--data_format JSON \
--input_path openwebtext/2020-04.jsonl.zst \
--split_sentences \
--output_prefix t5_openwebtext \
--workers 1 \
--log_interval 5
--log_interval 5 \
--data_impl mmap
```

#### 2. 开始训练
Expand Down Expand Up @@ -73,8 +75,9 @@ python -u -m paddle.distributed.launch \
--disable_tqdm true \
--do_train \
--do_eval \
--seed 1234\
--device "gpu"
--seed 1234 \
--device "gpu" \
--data_impl "mmap"
```

其中参数释义如下:
Expand All @@ -95,6 +98,7 @@ python -u -m paddle.distributed.launch \
- `dataloader_num_workers` DataLoader采样进程,当数据输入为瓶颈时,可尝试提高采样进程数目。
- `eval_steps` 模型评估间隔。
- `device` 训练设备,默认为GPU。
- `data_impl` 指定输入文件数据制作类型,默认为mmap,可指定mmap或lazy。mmap格式在读入数据时会建立内存映射,lazy格式在读入数据时直接从文件读取。

### GLUE任务

Expand Down
20 changes: 13 additions & 7 deletions examples/language_model/t5/t5_run_pretrain_trainer.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,10 @@ class DataArguments:
default=3,
metadata={"help": "Max N Grams"},
)
data_impl: str = field(
default="mmap",
metadata={"help": "mmap/lazy format converted from preprocessed data."},
)


@dataclass
Expand Down Expand Up @@ -183,12 +187,13 @@ def create_pretrained_dataset(

def print_dataset(data, mode="train"):
logger.info(f"Sample data for {mode} mode")
# text_enc, text_dec, labels, loss_mask, truncated, enc_mask, dec_mask, enc_dec_mask = data
# print("line 195 t5 run pretain trainer", text_enc)
print(data)
print(tokenizer.convert_ids_to_tokens(token for token in list(data["text_enc"])))
print(tokenizer.convert_ids_to_tokens(token for token in list(data["text_dec"])))
# print(tokenizer.convert_ids_to_tokens(token for token in list(data["labels"])))
text_enc, text_dec = data["text_enc"], data["text_dec"]
if tokenizer.pad_token_id in text_enc:
text_enc = text_enc[0 : list(text_enc).index(tokenizer.pad_token_id)]
logger.info(tokenizer._decode(text_enc))
if tokenizer.pad_token_id in text_dec:
text_dec = text_dec[0 : list(text_dec).index(tokenizer.pad_token_id)]
logger.info(tokenizer._decode(text_dec))

print_dataset(train_ds[0], "train")
print_dataset(valid_ds[0], "valid")
Expand Down Expand Up @@ -224,9 +229,10 @@ def get_train_data_file(args):
files = [
os.path.join(args.input_dir, f)
for f in os.listdir(args.input_dir)
if (os.path.isfile(os.path.join(args.input_dir, f)) and "_idx.npz" in str(f))
if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
]
files = [x.replace("_idx.npz", "") for x in files]
files = [x.replace(".idx", "") for x in files]

if len(files) > 1:
ret = []
Expand Down
1 change: 0 additions & 1 deletion model_zoo/ernie-1.0/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ def parse_args(MODEL_CLASSES):
parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the training logs and checkpoints will be written.")
parser.add_argument("--split", type=str, default='949,50,1', help="Train/valid/test data split.")
parser.add_argument("--data_impl", type=str, default='mmap', help="mmap/lazy format converted from preprocessed data.")

parser.add_argument("--binary_head", type=strtobool, default=True, help="True for NSP task.")
parser.add_argument("--max_seq_len", type=int, default=1024, help="Max sequence length.")
parser.add_argument("--micro_batch_size", default=8, type=int, help="Batch size per device for one step training.", )
Expand Down

0 comments on commit a43138b

Please sign in to comment.