Bug fix for the "Link bit16 and fp32 parameters in partition" #5681
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the function
_link_all_hp_params
link:dp_world_size = dist.get_world_size(group=self.dp_process_group)
ensures thatdp_world_size
is always the global data parallel word size.However, for the MoEs parameter group, the line
partition_size = self.bit16_groups_flat[i].numel() // dp_world_size
results in an incorrectpartition_size
whenep_size > 1
(when expert parallelism is enabled).This causes only some of the MoEs parameters to be correctly executed in
link_hp_params
link, while the remaining parameters have_hp_mapping
set to None.Consequently, this leads to some parameters not being mapped in
self._param_slice_mappings = self._create_param_mapping()
, which directly causes errors in storing the optimizer state file for MoEs parameters.To fix this bug, we need to use the correct
dp_world_size
for each parameter group: