Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT2 #205

Merged
merged 13 commits into from
Sep 15, 2023
49 changes: 49 additions & 0 deletions training/benchmarks/gpt2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
### 模型信息
- 模型介绍

GPT-2 Medium is the 345M parameter version of Megatron-GPT2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

>[Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

- 模型代码来源

This case includes code from open source project at https://github.com/NVIDIA/Megatron-LM/tree/v3.0/megatron

Some of the files in this directory were modified by BAAI in 2023 to support FlagPerf.


### 数据集
- 数据集下载地址
> Dataset website:https://huggingface.co/datasets/lambada

> The training data should be downloaded from huggingface. First, download training data in a loose json format, with one json containing a text sample per line. For example in python interpreter:

```
from datasets import load_dataset

train_data = load_dataset('lambada', split='train')
train_data.to_json("lambada.train.json", lines=True)
```

- 预处理
> The training data requires preprocessing.
The loose json is then processed into a binary format for training. To convert the json into mmap format use preprocess_data.py. An example script to prepare data for GPT2 training is:

``` bash
python tools/preprocess_data.py \
--input lambada.train.json \
--output-prefix lambada \
--vocab gpt2-vocab.json \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt \
--append-eod \
--workers 32 \
--chunk-size 25 \
```


### 框架与芯片支持情况
| | Pytorch |Paddle|TensorFlow2|
| ---- | ---- | ---- | ---- |
| Nvidia GPU | ✅ |N/A |N/A|
2 changes: 2 additions & 0 deletions training/benchmarks/gpt2/pytorch/config/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from ._base import *
from .mutable_params import mutable_params
122 changes: 122 additions & 0 deletions training/benchmarks/gpt2/pytorch/config/_base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Required parameters

vendor: str = None
shh2000 marked this conversation as resolved.
Show resolved Hide resolved
data_dir: str = None
name: str = "GPT2"
cudnn_benchmark: bool = False
cudnn_deterministic: bool = True

use_env: bool = True
log_freq: int = 1
device: str = None

# =========================================================
# train config
# =========================================================

seed: int = 1234
gradient_accumulation_steps: int = 1

max_steps: int = 23070
train_batch_size: int = 4

eval_iter_start_samples: int = 3200
eval_interval_samples: int = 3200

target_acc: float = 0.60

# =========================================================
# data
# =========================================================

train_data_prefix: str = "lambada_train_text_document"
test_data_prefix: str = "lambada_test.json"
vocab_file: str = "gpt2-vocab.json"
merge_file: str = "gpt2-merges.txt"

# =========================================================
# loss scale
# =========================================================
clip_grad: float = 1.0

# =========================================================
# optimizer & lr scheduler & weight decay
# =========================================================
optimizer: str = "adam"
adam_beta1: float = 0.9
adam_beta2: float = 0.999
adam_eps: float = 1e-8

lr: float = 0.00015
min_lr: float = 1e-05
lr_warmup_fraction: float = 0.01
lr_warmup_iters: int = 0
lr_warmup_samples: int = 0
lr_decay_style: str = "cosine"
lr_decay_samples: int=None

weight_decay: float = 0.01
start_weight_decay: float = 0.01
end_weight_decay: float = 0.01
weight_decay_incr_style: str = "constant"

use_distributed_optimizer: bool = False
barrier_with_L1_time: bool = True

# =========================================================
# transformer
# =========================================================

num_layers: int = 24
encoder_num_layers: str = 24

num_attention_heads: int = 16
hidden_size: int = 1024
ffn_hidden_size: int = 4096
kv_channels: int = 64
seq_length: int = 1024
attention_dropout: float = 0.1
hidden_dropout: float = 0.1
transformer_impl: str = "local"
use_flash_attn: bool = False

layernorm_epsilon: float = 1e-05

fp16: bool = False
bf16: bool = False

init_method_std: float = 0.02
import torch
params_dtype = torch.float32
masked_softmax_fusion: bool = True
bias_gelu_fusion: bool = True
bias_dropout_fusion: bool = True
apply_residual_connection_post_layernorm: bool = False
apply_query_key_layer_scaling: bool = True
fp16_lm_cross_entropy: bool = False
fp32_residual_connection: bool = False
attention_softmax_in_fp32: bool = False

# =========================================================
# dataset
# =========================================================

tokenizer_type: str = "GPT2BPETokenizer"
num_workers: int = 2
mmap_warmup: bool = False
padded_vocab_size: int = 0
make_vocab_size_divisible_by: int = 128
max_position_embeddings: int = 1024

reset_position_ids: bool = False
reset_attention_mask: bool = False
eod_mask_loss: bool = False

# =========================================================
# distributed parallel
# =========================================================

dist_backend: str = None
DDP_impl: str = "native"
gradient_accumulation_fusion: bool = False
use_contiguous_buffers_in_local_ddp: bool = False
6 changes: 6 additions & 0 deletions training/benchmarks/gpt2/pytorch/config/mutable_params.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
mutable_params = [
'vendor', 'data_dir', 'lr', 'weight_decay',
"gradient_accumulation_steps", "max_steps",
"train_batch_size", "eval_iter_start_samples", "eval_interval_samples",
'dist_backend', 'device',
]
3 changes: 3 additions & 0 deletions training/benchmarks/gpt2/pytorch/dataloaders/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

from .tokenizer import get_tokenizer
33 changes: 33 additions & 0 deletions training/benchmarks/gpt2/pytorch/dataloaders/dataloader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

"""Dataloaders."""

import torch

from mpu import get_data_parallel_rank, get_data_parallel_world_size


def build_data_loader(dataset, train_batch_size, num_workers, drop_last,
task_collate_fn=None):
"""Data loader. Note that batch-size is the local (per GPU) batch-size."""

# Sampler.
if torch.distributed.is_initialized():
world_size = get_data_parallel_world_size()
rank = get_data_parallel_rank()
sampler = torch.utils.data.distributed.DistributedSampler(
dataset, num_replicas=world_size, rank=rank)
else:
sampler = None

# Data loader. Note that batch size is the per GPU batch size.
data_loader = torch.utils.data.DataLoader(dataset,
batch_size=train_batch_size,
sampler=sampler,
shuffle=False,
num_workers=num_workers,
drop_last=drop_last,
pin_memory=True,
collate_fn=task_collate_fn)

return data_loader
Loading