Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions pretrain/scripts/v3-converter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# チェックポイント変換スクリプト

このスクリプトは、Megatron形式のチェックポイントをHugging Face形式に変換します。

## スペック
- 必要リソース: gpu 1ノード
- VRAMは使用せず、pytorch上でのCUDAチェックにのみ利用

## 実行方法

### 注意事項
このスクリプトを実行する前に、環境に適したインストーラを実験ディレクトリにインストールしてください (例: /data/experiments/{exp-id}/enviroment)。
以前のチェックポイントが保存されていることを確認してください (例: /data/experiments/{exp-id}/checkpoints/)。

### 実行手順

1. 実験ディレクトリに移動します:
```shell
cd /data/experiments/{exp-id}
```

2. スクリプトを実行環境と同じディレクトリにコピーし、ログ出力フォルダを作成します:
```shell
cp {this directory}/convert.sh .
mkdir outputs
```

3. スクリプトを実行します:
```shell
# For a cluster with SLURM
sbatch --partition {partition} convert.sh SOURCE_DIR TARGET_DIR
# For a cluster without SLURM
bash convert.sh SOURCE_DIR TARGET_DIR > outputs/convert.out 2> outputs/convert.err
```


### パラメータ
- `SOURCE_DIR`: `iter_NNNNNNN`を含むMegatronチェックポイントディレクトリ
- `TARGET_DIR`: Hugging Face形式の出力ディレクトリ

### サンプルコード
```shell
sbatch convert.sh /data/experiments/{exp-id}/checkpoints/iter_0001000 /data/experiments/{exp-id}/hf_checkpoints/iter_0001000
```

### 作業ディレクトリについて
実行中、$HOME上作業用ディレクトリ(`ckpt_convert_YYYYMMDDHHSSMM`)が作成されます。
実行エラーが起きてもデバッグのために残る仕様のため各自で削除してください。
71 changes: 71 additions & 0 deletions pretrain/scripts/v3-converter/convert.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/bin/bash
# Model conversion script for converting Megatron format checkpoints into Hugging Face format
#
# This script needs one node on the `gpu` partition of the cluster.
# However, a GPU is necessary to verify CUDA functionality, even though no VRAM will be used.
#
# Usage:
# On a cluster with SLURM:
# Run `sbatch --partition {partition} convert.sh SOURCE_DIR TARGET_DIR`
# On a cluster without SLURM:
# Run `bash convert.sh SOURCE_DIR TARGET_DIR TEMPORAL_DIR > outpus/convert.out 2> outputs/convert.err`
# - SOURCE_DIR: Megatron checkpoint directory including `iter_NNNNNNN`
# - TARGET_DIR: Output directory for the Hugging Face format
#
# Example:
# sbatch convert.sh /data/experiments/{exp-id}/checkpoints/iter_0001000 /data/experiments/{exp-id}/hf_checkpoints/iter_0001000
#
#SBATCH --job-name=ckpt-convert
#SBATCH --partition=<FIX_ME>
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --output=outputs/%x-%j.out
#SBATCH --error=outputs/%x-%j.err

set -e

MEGATRON_CHECKPOINT_DIR=${1%/}
HF_CHECKPOINT_DIR=$2

ENV_DIR=environment

source ${ENV_DIR}/scripts/environment.sh
source ${ENV_DIR}/venv/bin/activate

TOKENIZER_MODEL_DIR=${ENV_DIR}/src/llm-jp-tokenizer/hf/ver3.0/llm-jp-tokenizer-100k.ver3.0b2

TARGET_ITER_DIR=$(basename $MEGATRON_CHECKPOINT_DIR) # iter_NNNNNNN
ITER=$(echo $TARGET_ITER_DIR | sed 's/^iter_0*//') # NNNNNNN (no 0 padding)
if [[ -z "$ITER" || ! "$ITER" =~ ^[0-9]+$ ]]; then # check if directory is valid
echo "Error: ITER is not a valid number. Exiting."
exit 1
fi

# Create a unique temporal working directory to avoid affecting the original directory and
# to allow multiple runs to execute simultaneously.
TMP_DIR=${HOME}/ckpt_convert_$(date +%Y%m%d%H%M%S)
mkdir -p "${TMP_DIR}"
ln -s $(readlink -f $MEGATRON_CHECKPOINT_DIR) ${TMP_DIR}/${TARGET_ITER_DIR}
echo $ITER > "${TMP_DIR}/latest_checkpointed_iteration.txt"

echo "Converting $MEGATRON_CHECKPOINT_DIR"

python ${ENV_DIR}/src/Megatron-LM/tools/checkpoint/convert.py \
--model-type GPT \
--loader mcore \
--saver llama2_hf \
--load-dir $TMP_DIR \
--save-dir $HF_CHECKPOINT_DIR \
--hf-tokenizer-path $TOKENIZER_MODEL_DIR \
--save-dtype bfloat16 \
--loader-transformer-impl "transformer_engine" \
--megatron-path ${ENV_DIR}/src/Megatron-LM

cp ${TOKENIZER_MODEL_DIR}/* $HF_CHECKPOINT_DIR

rm -r $TMP_DIR
echo "Done"