动态图模式下，分布式训练报错 #1780

dayunyan · 2024-10-27T05:18:36Z

Describe the bug/ 问题描述 (Mandatory / 必填)
IA3微调Qwen2-7b-instruct模型，在mindnlp.core.nn.modules.container.py处raise了一个错误：

我把 p.size() 修改为 p.shape 后，raise了一个unexpected error。

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

GPU 0, 1

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) : 2.2.14
-- Python version (e.g., Python 3.7.5) : 3.9
-- OS platform and distribution (e.g., Linux Ubuntu 16.04): 22.04
-- GCC/Compiler version (if compiled from source):
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

/mode graph

To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

tokenizer = AutoTokenizer.from_pretrained(
    args.model_name_or_path, mirror="modelscope", revision="master"
)
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path, mirror="modelscope", revision="master"
).half()
peft_config = IA3Config(
    peft_type=TaskType.SEQ_2_SEQ_LM,
    inference_mode=False,
    target_modules=["q_proj", "v_proj"],  # ["query_key_value"]
    feedforward_cells=[],
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir=args.save_dir,
    evaluation_strategy="epoch",
    per_device_train_batch_size=args.batch_size,
    per_device_eval_batch_size=1,
    learning_rate=args.learning_rate,
    num_train_epochs=args.num_epochs,
    lr_scheduler_type="polynomial",
    lr_scheduler_kwargs={
        "lr_end": args.learning_rate * 1e-5,
        "power": args.power,
    },
    logging_steps=200,
    save_strategy="epoch",
    save_total_limit=1,
    # load_best_model_at_end=True,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()

Expected behavior / 预期结果 (Mandatory / 必填)
A clear and concise description of what you expected to happen.

Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.

Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

lvyufeng · 2024-10-29T09:38:07Z

用的GPU？

dayunyan · 2024-10-29T10:48:12Z

用的GPU？

是的，两个3090

dayunyan added the bug Something isn't working label Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

动态图模式下，分布式训练报错 #1780

动态图模式下，分布式训练报错 #1780

dayunyan commented Oct 27, 2024

lvyufeng commented Oct 29, 2024

dayunyan commented Oct 29, 2024

动态图模式下，分布式训练报错 #1780

动态图模式下，分布式训练报错 #1780

Comments

dayunyan commented Oct 27, 2024

lvyufeng commented Oct 29, 2024

dayunyan commented Oct 29, 2024