Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kunlunxin] add kunlun2 llama2-7b #348

Merged
merged 38 commits into from
Dec 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
64a1cbf
add kunlun2 llama2-7b
shenzhu1993 Dec 1, 2023
bd494ab
[kunlunxin] add kunlun2 llama2-7b
shenzhu1993 Dec 1, 2023
4405c20
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 8, 2023
76634d1
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 8, 2023
1c3aa38
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 9, 2023
ca7ef21
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 9, 2023
c1d1c3b
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 9, 2023
dd356d6
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 9, 2023
429626e
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 15, 2023
362d75b
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 15, 2023
c837e02
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 15, 2023
995ef6f
Delete training/run_benchmarks/config/cluster_conf.py
shenzhu1993 Dec 15, 2023
5451cfa
Delete training/benchmarks/llama2_7b/deepspeed/config/config_A100x1x8.py
shenzhu1993 Dec 15, 2023
6e60cce
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 15, 2023
37edf66
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 15, 2023
6bf428a
Delete training/benchmarks/llama2_7b/deepspeed/run_llama.sh
shenzhu1993 Dec 15, 2023
8545849
Delete training/benchmarks/llama2_7b/deepspeed/run_llama.sh
shenzhu1993 Dec 15, 2023
d0d9673
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 15, 2023
bb75dee
Delete training/benchmarks/llama2_7b/deepspeed/dataset/llama_dataset.py
shenzhu1993 Dec 15, 2023
2cae4ea
Delete training/benchmarks/llama2_7b/deepspeed/dataset/llama_dataset.py
shenzhu1993 Dec 15, 2023
da49983
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
f575bfd
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
476fe5f
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
9f82254
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
07a09ee
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
476192b
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
c118469
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
0987f8a
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
06c7101
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
1b07da4
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
c900dfc
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
301ad3f
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
de8fb14
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
4667a79
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
0c60298
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
e897eae
Merge branch 'main' into kunlunxin_llama2
shenzhu1993 Dec 20, 2023
94ada4e
Merge branch 'main' into kunlunxin_llama2
shenzhu1993 Dec 20, 2023
48e6e08
Merge branch 'kunlunxin_llama2' of https://github.com/shenzhu1993/Fla…
shenzhu1993 Dec 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion training/benchmarks/llama2_7b/deepspeed/ds_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
"steps_per_print": 50,
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"bf16": {
"fp16": {
"enabled": true
},
"activation_checkpointing": {
Expand Down
11 changes: 6 additions & 5 deletions training/benchmarks/llama2_7b/deepspeed/run_pretraining.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,14 @@ def get_argument_parser():
type=int,
required=True,
help="how many processes will run on each host.")

return parser


def train(model_engine, dataloader):
model_engine.train()
ave_loss = 0.0
for step, data in enumerate(dataloader):

fake_data = torch.tensor(data).long()
input_ids = fake_data.to(args.local_rank)
labels = fake_data.to(args.local_rank)
Expand All @@ -71,13 +71,15 @@ def train(model_engine, dataloader):
ave_loss = 0.0


def get_deepspeed_engine(args, model_config_dir, flashattn):
def get_deepspeed_engine(args, model_config_dir, flashattn, gradient_checkpointing):
with deepspeed.zero.Init(config_dict_or_path=args.deepspeed_config,
enabled=True,
mem_efficient_linear=False,
mpu=None):
model = get_llama_model(model_config_dir, flashattn)

if gradient_checkpointing:
model.gradient_checkpointing_enable()
model_engine, _, _, _ = deepspeed.initialize(
args=args, model=model, model_parameters=model.parameters())
return model_engine
Expand All @@ -98,7 +100,6 @@ def get_metric(texts):
flagperf_config = {}
sys.path.append(os.path.dirname(args.flagperf_config))
config_file = os.path.basename(args.flagperf_config).split('.')[0]

module = import_module(config_file)

seqlength = getattr(module, 'seqlength')
Expand All @@ -107,10 +108,10 @@ def get_metric(texts):
theoryflops = getattr(module, 'theoryflops')
epochs = getattr(module, 'epochs')
flashattn = getattr(module, 'flashattn')

gradient_checkpointing = getattr(module, 'gradient_checkpointing')
deepspeed.init_distributed()
model_engine = get_deepspeed_engine(args, os.path.join("llama2_7b_hf"),
flashattn)
flashattn,gradient_checkpointing)
dataset = get_llama_dataset(args, seqlength, datafilename)

logger = logging.getLogger("DeepSpeed")
Expand Down
8 changes: 8 additions & 0 deletions training/kunlunxin/docker_image/deepspeed/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# TODO: this is a temporary docker image from Docker Hub, need update to kunlunxin's Harbor registry.
FROM 127.0.0.1:9999/xpytorch/kunlunxin-deepspeed:v1.0

RUN /bin/bash -c "pip config set global.index-url https://mirror.baidu.com/pypi/simple"

WORKDIR /workspace

ENV PATH /root/miniconda/envs/python38_torch201_cuda/bin:$PATH
59 changes: 59 additions & 0 deletions training/kunlunxin/llama2_7b-deepspeed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
### 昆仑芯XPU配置与运行信息参考
#### 环境配置
- ##### 硬件环境
- 机器型号: 昆仑芯AI加速器组R480-X8
- 加速卡型号: 昆仑芯AI加速卡R300
- CPU型号: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60
- 多机网络类型、带宽: InfiniBand,200Gb/s

- ##### 软件环境
- OS版本:Ubuntu 20.04
- OS kernel版本: 5.4.0-26-generic
- 加速卡驱动版本:4.23
- Docker镜像和版本:xpytorch/kunlunxin-deepspeed:v1.0
- 训练框架版本:XPyTorch 2.0.1 + deepspeed 0.10.1
- 依赖软件版本:transformers 4.32.1

- ##### 并行策略

- 并行技术:sharded data parallel
- 实施者:deepspeed ZeRO-DP
- 实施细节:ZeRO-DP O3, DP_SIZE=8

- ##### 优化策略

- gradient_checkpointing
- FC算子的分块策略调优



### 运行情况

* 输入批尺寸
1. local_batchsize(micro_batchsize),简写为LBS,即实际进入模型的张量批尺寸,为config_A100x1x8.py中所写,在本case中默认为1
2. seqlength(max_position_embedding),简写为MPE,即实际进入模型的序列长度,为config_A100x1x8.py中所写,在本case中默认为4096
3. gradient_accumulate_steps,简写为GAS,即梯度累加步数,为ds_config.json中所写,在本case中默认为1
4. global_batchsize恒等于local_batchsize\*gradient_accumulate_steps\*data_parallel_size,简写为GBS。在本case中,只存在数据并行,因此data_parallel_size=world_size。

* 通用指标

| 指标名称 | 指标值 | 特殊说明 |
| -------------- | ----------------------- | ------------------------------------------- |
| 任务类别 | 自然语言理解 | |
| 模型 | deepspeed-llama2-7b | |
| 数据集 | openwebtext | 如无特殊说明,训练前1亿个token |
| 数据精度 | fp16 | |
| 超参修改 | fix_hp,见“性能指标” | 跑满硬件设备评测吞吐量所需特殊超参 |
| 硬件设备简称 | R300 | |
| 硬件存储使用 | mem,见“性能指标” | 通常称为“显存”,单位为GiB |
| 吞吐量 | token/p/s,见“性能指标” | 平均单卡每秒处理的token数 |
| 损失值 | loss,见“性能指标” | 训练损失值 |
| 计算使用率 | MFU,见“性能指标” | 参见PaLM论文定义 |

* 性能指标

| 配置 | fix_hp | tokens/p/s | loss | mem | MFU |
| ------------------- | ------------------- | -------- | ----- | ------- | ------ |
| R300单机8卡(1x8) | MPE=512 LBS=12 | | 5.27 | 29/32 | |


Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
seqlength = 512
batchsize = 12
datafilename = "openwebtext_llama2_100M.npy"
theoryflops = 128000000000000.0
epochs = 1
flashattn = False
gradient_checkpointing = True
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

export PATH=/root/miniconda/envs/python38_torch201_cuda/bin:$PATH

export XDNN_FC_GEMM_DTYPE="float32"
export BKCL_FORCE_SYNC=1
export XPU_FC_AUTOTUNE_FILE="/data/dataset/llama2-7b/fc_autotune_fp16.log"
9 changes: 5 additions & 4 deletions training/run_benchmarks/config/test_conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,12 +72,12 @@
#"tacotron2:pytorch_1.13:A100:1:8:1": "/raid/dataset/tacotron2/LJSpeech/",
# "resnet50:pytorch_1.8:A100:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
# "mask_rcnn:pytorch_1.8:A100:1:8:1": "/raid/dataset/maskrcnn/coco2017",

# "wav2vec2:pytorch_1.13:A100:1:8:1": "/raid/dataset/wav2vec2_data/LibriSpeech",
# "WaveGlow:pytorch_1.13:A100:1:8:1": "/raid/dataset/LJSpeech/",

# "distilbert:pytorch_1.12:A100:1:8:1": "/raid/dataset/distilbert/",

# "transformer:pytorch_1.13:A100:1:8:1": "/raid/dataset/transformer/wmt14_en_de_joined_dict",
# "swin_transformer:pytorch_1.8:A100:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
# "transformer_xl:pytorch_1.8:A100:1:8:1": "/raid/dataset/transformer_xl/",
Expand All @@ -87,10 +87,10 @@
# "bert_hf:pytorch_1.13:A100:1:8:1": "/raid/dataset/bert_hf_train",
# "longformer:pytorch_1.12:A100:1:8:1": "/raid/dataset/longformer_train/",
# "detr:pytorch_1.13:A100:1:8:1": "/raid/dataset/detr/coco2017/",

# "llama2_7b:deepspeed:A100:1:8:1": "/raid/dataset/llama2_7b_pretrain",
# "aquila2_7b:flagscale:A100:1:8:1": "/raid/dataset/aquila2_7b_pretrain",

# "llama1_7B:paddle_2.5.1:TP1PP1SH2SP8A10040G:1:8:1":"/raid/dataset/llama/"
# "llama1_7B:paddle_2.5.1:TP2PP1SH1SP4A10040G:1:8:1":"/raid/dataset/llama/"
# "llama1_7B:paddle_2.5.1:TP2PP1SH2SP4A10040G:1:8:1":"/raid/dataset/llama/"
Expand All @@ -110,6 +110,7 @@
# "gpt3_13B:paddle_2.5.1:TP2PP4SH1SP1A10040G:1:8:1":"/raid/dataset/gpt-3/"

# kunlunxin cases
# "llama2_7b:deepspeed:R300:1:8:1": "/data/dataset/llama2-7b",
# "gpt2:pytorch:R300:1:8:1": "/raid/dataset/gpt2",
# "resnet50:pytorch:R300:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
# "mask_rcnn:pytorch:R300:1:8:1": "/raid/dataset/coco2017/",
Expand Down