Skip to content

[BUG]: shardformer: pipeline forward error with customized layer distribution #5187

@insujang

Description

@insujang

🐛 Describe the bug

Hi, I am trying to implement a custom shard policy with different layer distribution, but it seems all built-in policies have the following inconsistent implementation:

In get_held_layers(), a policy uses self.distribute_layers() and self.get_stage_index(), which are customizable:

layers_per_stage = self.distribute_layers(len(module.h), stage_manager.num_stages)
if stage_manager.is_first_stage():
held_layers.append(module.wte)
held_layers.append(module.wpe)
held_layers.append(module.drop)
start_idx, end_idx = self.get_stage_index(layers_per_stage, stage_manager.stage)

But in set_pipeline_forward(), the policy uses Policy.distribute_layers() and Policy.get_stage_index():

layers_per_stage = Policy.distribute_layers(len(module.h), stage_manager.num_stages)
stage_index = Policy.get_stage_index(layers_per_stage, stage_manager.stage)

which will raise an error during pipeline forward due to layer inconsistency if the functions are overridden.

How to reproduce

I tested with examples/language/gpt/hybridparallelism/finetune.py.
For hybrid_parallel plugin, add a custom policy:

 elif args.plugin == "hybrid_parallel":
        BATCH_SIZE = 128
        from colossalai.shardformer.policies.base_policy import Policy
        from colossalai.shardformer.policies.gpt2 import GPT2ForSequenceClassificationPolicy
        class CustomGPT2Policy(GPT2ForSequenceClassificationPolicy):
            @staticmethod
            def distribute_layers(num_layers: int, num_stages: int) -> List[int]:
                layers_per_stage = Policy.distribute_layers(num_layers - 4, num_stages)
                layers_per_stage[0] += 4
                return layers_per_stage

        plugin = HybridParallelPlugin(
            tp_size=1,
            pp_size=4,
            num_microbatches=None,
            microbatch_size=8,
            zero_stage=0,
            precision="fp16",
            initial_scale=1,
            custom_policy=CustomGPT2Policy(),
        )

which distributes layers in a slightly different way: first stage has 4 more layers.

This leads the following error:

...
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 312, in forward
    query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/pytorch_utils.py", line 107, in forward
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
TypeError: addmm(): argument 'input' (position 1) must be Tensor, not NoneType

Environment

torch 2.1.0 + cu118

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions