Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

动态图模式下,分布式训练报错 #1780

Open
dayunyan opened this issue Oct 27, 2024 · 2 comments
Open

动态图模式下,分布式训练报错 #1780

dayunyan opened this issue Oct 27, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@dayunyan
Copy link

Describe the bug/ 问题描述 (Mandatory / 必填)
IA3微调Qwen2-7b-instruct模型,在mindnlp.core.nn.modules.container.py处raise了一个错误:
image

我把 p.size() 修改为 p.shape 后,raise了一个unexpected error。
image

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

GPU 0, 1

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) : 2.2.14
    -- Python version (e.g., Python 3.7.5) : 3.9
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04): 22.04
    -- GCC/Compiler version (if compiled from source):

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

/mode graph

To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

tokenizer = AutoTokenizer.from_pretrained(
    args.model_name_or_path, mirror="modelscope", revision="master"
)
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path, mirror="modelscope", revision="master"
).half()
peft_config = IA3Config(
    peft_type=TaskType.SEQ_2_SEQ_LM,
    inference_mode=False,
    target_modules=["q_proj", "v_proj"],  # ["query_key_value"]
    feedforward_cells=[],
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir=args.save_dir,
    evaluation_strategy="epoch",
    per_device_train_batch_size=args.batch_size,
    per_device_eval_batch_size=1,
    learning_rate=args.learning_rate,
    num_train_epochs=args.num_epochs,
    lr_scheduler_type="polynomial",
    lr_scheduler_kwargs={
        "lr_end": args.learning_rate * 1e-5,
        "power": args.power,
    },
    logging_steps=200,
    save_strategy="epoch",
    save_total_limit=1,
    # load_best_model_at_end=True,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()

Expected behavior / 预期结果 (Mandatory / 必填)
A clear and concise description of what you expected to happen.

Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.

Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.

@dayunyan dayunyan added the bug Something isn't working label Oct 27, 2024
@lvyufeng
Copy link
Collaborator

用的GPU?

@dayunyan
Copy link
Author

用的GPU?

是的,两个3090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants