Description
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
我在按照飞书云文档lora微调minicpm-v-2.6,bash finetune_lora.sh时报错:
,我的脚本是这样的:#!/bin/bash
GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001
MODEL="/home/xit/minicpm-v-2_6"
DATA="/home/xit/dataset/alldata.json"
EVAL_DATA="/home/xit/dataset/alldata.json"
LLM_TYPE="qwen2"
export NCCL_P2P_DISABLE=1 # a100等支持nccl_p2p的显卡去掉此行
export NCCL_IB_DISABLE=1 # a100等显卡去掉此行
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"
CUDA_VISIBLE_DEVICES="4,5,6,7" torchrun $DISTRIBUTED_ARGS finetune.py
--model_name_or_path $MODEL
--llm_type $LLM_TYPE
--data_path $DATA
--eval_data_path $EVAL_DATA
--remove_unused_columns false
--label_names "labels"
--prediction_loss_only false
--bf16 false
--bf16_full_eval false
--fp16 true
--fp16_full_eval true
--do_train
--do_eval
--tune_vision true
--tune_llm false
--use_lora true
--lora_target_modules "llm..*layers.\d+.self_attn.(q_proj|k_proj|v_proj)"
--model_max_length 2048
--max_slice_nums 9
--max_steps 10000
--eval_steps 1000
--output_dir "/home/xit/output/loramodel"
--logging_dir "/home/xit/output/logging"
--logging_strategy "steps"
--per_device_train_batch_size 2
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "steps"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 10
--learning_rate 1e-6
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--gradient_checkpointing true
--deepspeed ds_config_zero2.json
--report_to "tensorboard"
期望行为 | Expected Behavior
1希望能解决这个报错
2我的服务器上有8张显卡,我只想使用4、5、6、7显卡,我在CUDA_VISIBLE_DEVICES="4,5,6,7" torchrun这个部分添加了内容,不知道是否正确。
3谢谢
复现方法 | Steps To Reproduce
1.conda创建新环境并安装pyhon3.9
2.pip requiremnet.txt
3.执行了以下命令:
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
DS_BUILD_FUSED_ADAM=1 pip install .
4.pip install peft
4.bash finetune_lora.sh然后报错
运行环境 | Environment
- OS:ubuntu22
- Python:3.9
- Transformers:4.40.0
- PyTorch:2.4.0
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.4
我的pip list:
accelerate 1.2.0
addict 2.4.0
aiofiles 23.2.1
aiohappyeyeballs 2.6.1
aiohttp 3.11.16
aiosignal 1.3.2
annotated-types 0.7.0
anyio 4.9.0
async-timeout 5.0.1
attrs 25.3.0
autoawq 0.2.7.post2
bitsandbytes 0.45.5
Brotli 1.0.9
certifi 2025.1.31
charset-normalizer 3.3.2
click 8.1.8
cloudpickle 3.1.1
cmake 4.0.0
colorama 0.4.6
contourpy 1.3.0
cycler 0.12.1
datasets 3.5.0
decord 0.6.0
deepspeed 0.16.6+a21e5b9d
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
e 1.4.5
editdistance 0.6.2
einops 0.7.0
et_xmlfile 2.0.0
exceptiongroup 1.2.2
fairscale 0.4.0
fastapi 0.115.12
ffmpy 0.5.0
filelock 3.13.1
fonttools 4.57.0
frozenlist 1.5.0
fsspec 2024.12.0
gmpy2 2.2.1
gradio 4.41.0
gradio_client 1.3.0
h11 0.14.0
hjson 3.1.0
httpcore 1.0.7
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.30.1
idna 3.7
importlib_resources 6.5.2
interegular 0.3.3
Jinja2 3.1.6
jiter 0.9.0
joblib 1.4.2
jsonlines 4.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
kiwisolver 1.4.7
lark 1.2.2
llvmlite 0.43.0
lm-format-enforcer 0.10.3
lxml 5.3.2
markdown-it-py 3.0.0
markdown2 2.4.10
MarkupSafe 2.1.5
matplotlib 3.7.4
mdurl 0.1.2
mkl_fft 1.3.11
mkl_random 1.2.8
mkl-service 2.4.0
modelscope_studio 0.4.0.9
more-itertools 10.1.0
mpmath 1.3.0
msgpack 1.1.0
multidict 6.3.2
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.2.1
ninja 1.11.1.4
nltk 3.8.1
numba 0.60.0
numpy 1.24.4
nvidia-cublas-cu12 12.4.2.65
nvidia-cuda-cupti-cu12 12.4.99
nvidia-cuda-nvrtc-cu12 12.4.99
nvidia-cuda-runtime-cu12 12.4.99
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.0.44
nvidia-curand-cu12 10.3.5.119
nvidia-cusolver-cu12 11.6.0.99
nvidia-cusparse-cu12 12.3.0.142
nvidia-ml-py 12.570.86
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.4.99
nvidia-nvtx-cu12 12.4.99
openai 1.70.0
opencv-python 4.11.0.86
opencv-python-headless 4.5.5.64
openpyxl 3.1.2
orjson 3.10.16
outlines 0.0.46
packaging 23.2
pandas 2.2.3
peft 0.11.1
Pillow 10.1.0
pip 25.0
portalocker 3.1.1
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.1.0
propcache 0.3.1
protobuf 4.25.0
psutil 7.0.0
py-cpuinfo 9.0.0
pyairports 2.1.1
pyarrow 19.0.1
pycountry 24.6.1
pydantic 2.9.2
pydantic_core 2.23.4
pydub 0.25.1
Pygments 2.19.1
pyparsing 3.2.3
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-dotenv 1.1.0
python-multipart 0.0.20
pytz 2025.2
PyYAML 6.0.2
pyzmq 26.4.0
ray 2.44.1
referencing 0.36.2
regex 2024.11.6
requests 2.32.3
rich 14.0.0
rpds-py 0.24.0
ruff 0.11.5
sacrebleu 2.3.2
safetensors 0.5.3
scipy 1.13.1
seaborn 0.13.0
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 75.8.0
shellingham 1.5.4
shortuuid 1.0.11
six 1.17.0
sniffio 1.3.1
socksio 1.0.0
starlette 0.46.1
sympy 1.13.3
tabulate 0.9.0
tiktoken 0.9.0
timm 0.9.10
tokenizers 0.19.1
tomlkit 0.12.0
torch 2.4.0+cu124
torchaudio 2.4.0
torchvision 0.19.0+cu124
tqdm 4.66.1
transformers 4.44.0
triton 3.0.0
typer 0.15.2
typing_extensions 4.8.0
typing-inspection 0.4.0
tzdata 2025.2
ultralytics 8.3.104
ultralytics-thop 2.0.14
urllib3 2.3.0
uvicorn 0.24.0.post1
uvloop 0.21.0
vllm 0.5.4
vllm-flash-attn 2.6.1
watchfiles 1.0.4
websockets 12.0
wheel 0.45.1
xformers 0.0.27.post2
xxhash 3.5.0
yarl 1.19.0
zipp 3.21.0
zstandard 0.23.0
备注 | Anything else?
No response