Fix training of pipeline based peft's lora model #5477

xuanhua · 2024-04-29T12:18:21Z

Hi, guys

I find there is an assert failure when I train huggingface's lora based model in pipeline style.

Here is the whole steps that I created my model:

Load the pre-trained chatglm-6b model from huggingface, as Model_A
Use huggingface's peft's get_peft_model(...) and my LoraConfig(...) from Model_A to create the lora model, as Model_B
Create my own pipeline based model Model_C from Model_B

And I run Model_C under 2 3090ti GPUs. And the assertion failure looks like this:

Traceback (most recent call last):
  File "/home/ubuntu/proj/chatglm-finetuning/train_pipeline.py", line 372, in <module>
    main()
  File "/home/ubuntu/proj/chatglm-finetuning/train_pipeline.py", line 351, in main
    loss = engine.train_batch(data_iter=train_dataloader)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 375, in train_batch
    self._exec_schedule(sched)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1375, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 276, in _exec_reduce_tied_grads
    dist.all_reduce(grad, group=group)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 496, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 159, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=async_op)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1520, in all_reduce
    _check_single_tensor(tensor, "tensor")
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 463, in _check_single_tensor
    raise RuntimeError(
RuntimeError: Invalid function argument. Expected parameter `tensor` to be of type torch.Tensor.

After some debugging, I find out the root cause is that my configuration of lora (in below) only add extra lora layer(part) in qkv related layers but not the embedding layer. So the whole embedding layer's parameters are freezed.

lora_config = LoraConfig(r=8, # copied from finetuning_lora.py
                        lora_alpha=32,
                        target_modules=["query_key_value"],
                        lora_dropout=0.1,
                        bias="none",
                        task_type="CAUSAL_LM",
                        inference_mode=False,
                        )

And in my implementation of pipeline based model, I declared the embeding layer as a tied-layer. So the whole thing is that there are no gradients at all for embedding layer, but embedding layer as the tied layer needs to be synced between two gpus. The value of gradient is None but is still passed to all_reduce operation.

Current, my fix is simple and add a check if this grad is None.

This reverts commit 95b3287.

xuanhua · 2024-05-07T03:11:12Z

@duli2012 Hi, I'm not sure if this pull request meet the project's requirement ? Or any suggestions on this PR, expect your reply :)

tohtana

@xuanhua Sorry for the delay. Let's merge this after the tests pass.

xuanhua · 2024-09-23T15:45:00Z

@tohtana , thank you for your reply, I saw some unit test failures above, do I need to look into it ?

tohtana · 2024-09-23T15:47:53Z

@xuanhua I wonder if this is an issue on our CI. Let us take a look and restart after it is fixed.

xuanhua added 3 commits April 29, 2024 19:12

Fix exception when using lora in pipeline

95b3287

Revert "Fix exception when using lora in pipeline"

1af09a7

This reverts commit 95b3287.

Fix training of pipeline based peft's lora model

09cbb38

xuanhua requested a review from duli2012 as a code owner April 29, 2024 12:18

Merge branch 'master' into axu/fix-pipeline-with-lora

f6c5839

Merge branch 'master' into axu/fix-pipeline-with-lora

bd6d4bc

loadams requested review from tjruwase and tohtana May 22, 2024 17:17

xuanhua and others added 3 commits July 7, 2024 00:06

Merge branch 'master' into axu/fix-pipeline-with-lora

f0839d7

Merge branch 'master' into axu/fix-pipeline-with-lora

a56ce1a

Merge branch 'master' into axu/fix-pipeline-with-lora

767ef6a

tohtana approved these changes Sep 20, 2024

View reviewed changes

Merge branch 'master' into axu/fix-pipeline-with-lora

9368891

loadams self-requested a review as a code owner October 28, 2024 20:08

loadams added this pull request to the merge queue Oct 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 29, 2024

loadams added this pull request to the merge queue Oct 29, 2024

Merged via the queue into microsoft:master with commit e4a247e Oct 29, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix training of pipeline based peft's lora model #5477

Fix training of pipeline based peft's lora model #5477

xuanhua commented Apr 29, 2024

xuanhua commented May 7, 2024

tohtana left a comment

xuanhua commented Sep 23, 2024

tohtana commented Sep 23, 2024

Fix training of pipeline based peft's lora model #5477

Fix training of pipeline based peft's lora model #5477

Conversation

xuanhua commented Apr 29, 2024

xuanhua commented May 7, 2024

tohtana left a comment

Choose a reason for hiding this comment

xuanhua commented Sep 23, 2024

tohtana commented Sep 23, 2024