-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
🐛 Describe the bug
Hi, I am trying to implement a custom shard policy with different layer distribution, but it seems all built-in policies have the following inconsistent implementation:
In get_held_layers(), a policy uses self.distribute_layers() and self.get_stage_index(), which are customizable:
ColossalAI/colossalai/shardformer/policies/gpt2.py
Lines 170 to 175 in 79718fa
| layers_per_stage = self.distribute_layers(len(module.h), stage_manager.num_stages) | |
| if stage_manager.is_first_stage(): | |
| held_layers.append(module.wte) | |
| held_layers.append(module.wpe) | |
| held_layers.append(module.drop) | |
| start_idx, end_idx = self.get_stage_index(layers_per_stage, stage_manager.stage) |
But in set_pipeline_forward(), the policy uses Policy.distribute_layers() and Policy.get_stage_index():
ColossalAI/colossalai/shardformer/policies/gpt2.py
Lines 192 to 193 in 79718fa
| layers_per_stage = Policy.distribute_layers(len(module.h), stage_manager.num_stages) | |
| stage_index = Policy.get_stage_index(layers_per_stage, stage_manager.stage) |
which will raise an error during pipeline forward due to layer inconsistency if the functions are overridden.
How to reproduce
I tested with examples/language/gpt/hybridparallelism/finetune.py.
For hybrid_parallel plugin, add a custom policy:
elif args.plugin == "hybrid_parallel":
BATCH_SIZE = 128
from colossalai.shardformer.policies.base_policy import Policy
from colossalai.shardformer.policies.gpt2 import GPT2ForSequenceClassificationPolicy
class CustomGPT2Policy(GPT2ForSequenceClassificationPolicy):
@staticmethod
def distribute_layers(num_layers: int, num_stages: int) -> List[int]:
layers_per_stage = Policy.distribute_layers(num_layers - 4, num_stages)
layers_per_stage[0] += 4
return layers_per_stage
plugin = HybridParallelPlugin(
tp_size=1,
pp_size=4,
num_microbatches=None,
microbatch_size=8,
zero_stage=0,
precision="fp16",
initial_scale=1,
custom_policy=CustomGPT2Policy(),
)
which distributes layers in a slightly different way: first stage has 4 more layers.
This leads the following error:
...
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 312, in forward
query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/pytorch_utils.py", line 107, in forward
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
TypeError: addmm(): argument 'input' (position 1) must be Tensor, not NoneType
Environment
torch 2.1.0 + cu118