Skip to content

Conformer/Transformer has same initial param value in each layer #216

@albertz

Description

@albertz

Now that the lazy init was removed (see #212, #215), all params are always created directly. E.g. in nn.Linear, this logic:

self.weight = nn.Parameter((nn.dim_match_priority_when_needed(self.in_dim, self.out_dim), self.out_dim))
self.weight.initial = nn.init.Glorot()

Setting Parameter.initial with a ParamInit type (Glorot) will directly call the ParamInit and then assign the corresponding tensor:

  @initial.setter
  def initial(self, value: Optional[Union[nn.Tensor, RawTensorTypes, nn.init.ParamInit]]):
    if isinstance(value, nn.init.ParamInit):
      value = value(shape=self.shape_ordered, dtype=self.dtype)
    ...

Now in Conformer and Transformer (TransformerEncoder, TransformerDecoder), we use copy.deepcopy on the layers/blocks. This effectively will copy the same Parameter.initial value for each layer. PyTorch actually has the problem, as I described here: #109 (comment), pytorch/pytorch#86274

A potential solution is to not call the ParamInit directly in the initial setter but to delay it to some later point. Then a deepcopy would actually only copy the ParamInit object but not the tensor, and the the ParamInit will get called independently for each Parameter copy, so this would solve it. It's only a bit unclear when exactly it should be called. It could be in prepare_for_config_serialization but not sure. This is maybe an unexpected side effect when serializing the model.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions