Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(API): improve usability #414

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
d90c1da
enable fsdp training and support huggingface models with ckpt in or out
zigzagcai Jan 16, 2025
0bc7552
Merge remote-tracking branch 'origin/develop' into feat/refactor-impl
zigzagcai Feb 18, 2025
a70aaf6
initial refactor: (1) reorg src structure to avoid cyclic imports (2)…
zigzagcai Feb 14, 2025
1123c08
Merge branch 'develop' into feat/refactor-impl
zigzagcai Feb 24, 2025
2a17817
fix ci
zigzagcai Feb 24, 2025
9ef40ee
fix ci
zigzagcai Feb 24, 2025
99b3555
ljx adapt npu
li126com Feb 24, 2025
55e6055
update npu adapt
zigzagcai Feb 24, 2025
e0dda21
Merge branch 'develop' into feat/refactor-impl
zigzagcai Feb 25, 2025
0167129
fix pylint
zigzagcai Feb 25, 2025
73c3ce1
update setup
zigzagcai Feb 25, 2025
70865ce
remove unused settings and temporarily remove other model_implementat…
zigzagcai Mar 3, 2025
991f07c
Merge branch 'develop' into feat/refactor-impl
zigzagcai Mar 3, 2025
6498236
fix pylint
zigzagcai Mar 3, 2025
ded6daa
Merge branch 'develop' into feat/refactor-impl
zigzagcai Mar 4, 2025
25cf10f
Merge branch 'develop' into feat/refactor-impl
zigzagcai Mar 5, 2025
bdf5b0b
fix merge
zigzagcai Mar 5, 2025
05b58a9
rename transformers to huggingface_models to avoid name conflict
zigzagcai Mar 5, 2025
4cd207e
update args sanity checks and add support for FP8
zigzagcai Mar 5, 2025
1cf9ee6
add 7B_internlm2_hf config and refine some fsdp or fp8 codes
zigzagcai Mar 5, 2025
f77b4f2
fix pylint
zigzagcai Mar 5, 2025
e854c90
typo fix
zigzagcai Mar 7, 2025
8e04b09
update fsdp wrap
zigzagcai Mar 12, 2025
cbf73d6
Merge branch 'develop' into feat/refactor-impl
zigzagcai Mar 13, 2025
ca942a2
typo fix
zigzagcai Mar 13, 2025
f48c4af
Merge branch 'develop' into feat/refactor-impl
zigzagcai Mar 14, 2025
5da0471
support ep for fsdp
zigzagcai Mar 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/demo_in_readme.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ jobs:
export GITHUB_WORKSPACE=$GITHUB_WORKSPACE
export SLURM_PARTITION=$SLURM_PARTITION
source activate ${evo_env_torch21_flash2}
export PYTHONPATH=$PWD:$PYTHONPATH
sh ./ci_scripts/train/slurm_train.sh ${GITHUB_RUN_ID}-${GITHUB_JOB}
EOF

Expand Down Expand Up @@ -97,6 +98,7 @@ jobs:
export GITHUB_WORKSPACE=$GITHUB_WORKSPACE
export SLURM_PARTITION=$SLURM_PARTITION
source activate ${evo_env_torch21_flash2}
export PYTHONPATH=$PWD:$PYTHONPATH
sh ./ci_scripts/train/torchrun.sh ${GITHUB_RUN_ID}-${GITHUB_JOB}
rm -rf $GITHUB_WORKSPACE/llm_ckpts
EOF
Expand Down
8 changes: 2 additions & 6 deletions .github/workflows/lint_check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,25 +18,21 @@ jobs:
run: |
pip install flake8==v3.8.4
FLAKE_DISABLE_LIST="F403,F405,W504,W503,E203"
flake8 --max-line-length=120 --ignore=$FLAKE_DISABLE_LIST --exclude=./internlm/model/ops/ring_flash_attn/zigzag_ring_flash_attn_with_sliding_window.py ./internlm/*
flake8 --max-line-length=120 --ignore=$FLAKE_DISABLE_LIST ./train.py
flake8 --max-line-length=120 --ignore=$FLAKE_DISABLE_LIST --exclude=./internlm/model/model_ops/ops/ring_flash_attn/zigzag_ring_flash_attn_with_sliding_window.py ./internlm/*

- name: lint-isort
run: |
pip install isort==5.12.0
isort --check --profile=black ./internlm/*
isort --check --profile=black ./train.py

- name: lint-black
run: |
pip install black==22.8.0
BLACK_EXCLUDE_SETTINGS='\.venv/|\.local/|\.cache/|\.git/'
black --line-length=120 --check --exclude $BLACK_EXCLUDE_SETTINGS ./internlm/*
black --line-length=120 --check --exclude $BLACK_EXCLUDE_SETTINGS ./train.py

- name: lint-pylint
run: |
pip install pylint==v2.17.2
PYLINT_DISABLE_LIST="C0114,C0415,W0212,W0235,W0238,W0621,C0103,R1735,C2801,E0402,C0412,W0719,R1728,W1514,W0718,W0105,W0707,C0209,W0703,W1203"
pylint --rcfile .pylintrc --disable=$PYLINT_DISABLE_LIST --ignore=./internlm/model/ops/ring_flash_attn/zigzag_ring_flash_attn_with_sliding_window.py ./internlm/*
pylint --rcfile .pylintrc --disable=$PYLINT_DISABLE_LIST ./train.py
pylint --rcfile .pylintrc --disable=$PYLINT_DISABLE_LIST --ignore=./internlm/model/model_ops/ops/ring_flash_attn/zigzag_ring_flash_attn_with_sliding_window.py ./internlm/*
6 changes: 3 additions & 3 deletions README-ja-JP.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ data = dict(

Slurm環境で2ノード16カードを使用する場合、コマンドは以下の通りです:
```bash
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
```

torchを使用し、1ノード8カードで実行する場合、コマンドは以下の通りです:
Expand Down Expand Up @@ -166,8 +166,8 @@ $ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py -
</td>
<td>
<ul>
<li><a href="tools/transformers/README.md">Convert ckpt to HF</a></li>
<li><a href="tools/transformers/README.md">Revert ckpt from HF</a></li>
<li><a href="huggingface_models/README.md">Convert ckpt to HF</a></li>
<li><a href="huggingface_models/README.md">Revert ckpt from HF</a></li>
<li><a href="tools/tokenizer.py">Raw Data Tokenizer</a></li>
<li><a href="tools/alpaca_tokenizer.py">Alpaca data Tokenizer</a></li>
</ul>
Expand Down
6 changes: 3 additions & 3 deletions README-zh-Hans.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ data = dict(

slurm环境,双机16卡,启动训练命令如下:
```bash
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
```

torch环境,单机8卡,启动训练命令如下:
Expand Down Expand Up @@ -166,8 +166,8 @@ $ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py -
</td>
<td>
<ul>
<li><a href="tools/transformers/README-zh-Hans.md">将ckpt转为huggingface格式</a></li>
<li><a href="tools/transformers/README-zh-Hans.md">将ckpt从huggingface格式转为InternEvo格式</a></li>
<li><a href="huggingface_models/README-zh-Hans.md">将ckpt转为huggingface格式</a></li>
<li><a href="huggingface_models/README-zh-Hans.md">将ckpt从huggingface格式转为InternEvo格式</a></li>
<li><a href="tools/tokenizer.py">原始数据分词器</a></li>
<li><a href="tools/alpaca_tokenizer.py">Alpaca数据分词器</a></li>
</ul>
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Training can be started on slurm or torch distributed environment.

On slurm, using 2 nodes and 16 cards, the command is as follows:
```bash
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
```

On torch, using 1 node and 8 cards, the command is as follows:
Expand Down Expand Up @@ -166,8 +166,8 @@ Please refer to the [System Architecture document](./doc/en/structure.md) for ar
</td>
<td>
<ul>
<li><a href="tools/transformers/README.md">Convert ckpt to HF</a></li>
<li><a href="tools/transformers/README.md">Revert ckpt from HF</a></li>
<li><a href="huggingface_models/README.md">Convert ckpt to HF</a></li>
<li><a href="huggingface_models/README.md">Revert ckpt from HF</a></li>
<li><a href="tools/tokenizer.py">Raw Data Tokenizer</a></li>
<li><a href="tools/alpaca_tokenizer.py">Alpaca data Tokenizer</a></li>
</ul>
Expand Down
2 changes: 1 addition & 1 deletion ci_scripts/model/convert_to_hf.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ if [[ -d ${CKPTS_OUTPUT} ]]; then
fi
fi

python ./transformers/convert2hf_internlm.py --src ${CKPTS_INPUT} --tgt ${CKPTS_OUTPUT} --tokenizer ./tools/tokenizer_internlm.model
python ./huggingface_models/convert2hf_internlm.py --src ${CKPTS_INPUT} --tgt ${CKPTS_OUTPUT} --tokenizer ./tools/tokenizer_internlm.model
[[ $? -ne 0 ]] && { echo "test convert2hf_internlm.py failed."; exit_code=$(($exit_code + 1)); }

#assert exists model
Expand Down
2 changes: 0 additions & 2 deletions ci_scripts/train/ci_7B_sft.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,14 +101,12 @@
model = dict(
checkpoint=False,
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.bfloat16",
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
Expand Down
2 changes: 1 addition & 1 deletion ci_scripts/train/generate_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import os

from ci_scripts.common import com_func
from internlm.core.context import Config
from internlm.utils.config import Config


def generate_new_config(config_py_file, test_config_json, case_name):
Expand Down
2 changes: 1 addition & 1 deletion ci_scripts/train/load_ckpt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ if [[ ! -f ${file} ]]; then
exit_code=$(($exit_code + 1))
fi

srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$2 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ${file}
srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$2 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python internlm/launcher/launch.py --config ${file}
[[ $? -ne 0 ]] && { echo "test slurm training failed."; exit_code=$(($exit_code + 1)); }


Expand Down
2 changes: 1 addition & 1 deletion ci_scripts/train/slurm_train.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ if [[ -d ${CKPTS20_PATH} ]]; then
fi
fi

srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./ci_scripts/train/ci_7B_sft.py
srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python internlm/launcher/launch.py --config ./ci_scripts/train/ci_7B_sft.py
[[ $? -ne 0 ]] && { echo "test slurm training failed."; exit_code=$(($exit_code + 1)); }

num=$(num_files "${CKPTS20_OUTPUT}")
Expand Down
2 changes: 1 addition & 1 deletion ci_scripts/train/torchrun.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ if [[ -d ${CKPTS20_PATH} ]]; then
fi
fi

srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$1 -N 1 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29501 train.py --config ./ci_scripts/train/ci_7B_sft.py --launcher torch
srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$1 -N 1 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29501 internlm/launcher/launch.py --config ./ci_scripts/train/ci_7B_sft.py --launcher torch
[[ $? -ne 0 ]] && { echo "test torch training failed."; exit_code=$(($exit_code + 1)); }

num=$(num_files "${CKPTS_OUTPUT}")
Expand Down
2 changes: 0 additions & 2 deletions configs/1.8B_MoE16_sft.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,14 +136,12 @@
model = dict(
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=False,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.bfloat16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
Expand Down
Loading
Loading