-
Notifications
You must be signed in to change notification settings - Fork 89
feat(transformers/models): add dinov3_vit and dinov3_convnext (v4.57.1) #1439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @zhangfeiran, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR adds dinov3_vit and dinov3_convnext models. The changes include model implementations, configurations, and tests. The overall structure is good, but there are several critical issues in the implementation that need to be addressed. These include incorrect weight initialization syntax (using PyTorch's .data API which is not supported in MindSpore), a bug in the image preprocessing logic, and usage of PyTorch-specific .contiguous() calls. I've also pointed out some minor issues like wildcard imports. Please address these points to ensure the models work correctly.
| def _init_weights(self, module): | ||
| """Initialize the weights""" | ||
| if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)): | ||
| # Slightly different from the TF version which uses truncated_normal for initialization | ||
| # cf https://github.com/pytorch/pytorch/pull/5617 | ||
| module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) | ||
| if module.bias is not None: | ||
| module.bias.data.zero_() | ||
| elif isinstance(module, (mint.nn.LayerNorm, DINOv3ConvNextLayerNorm)): | ||
| module.bias.data.zero_() | ||
| module.weight.data.fill_(1.0) | ||
| elif isinstance(module, DINOv3ConvNextLayer): | ||
| if module.gamma is not None: | ||
| module.gamma.data.fill_(self.config.layer_scale_init_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weight initialization method _init_weights uses PyTorch-style in-place modification on .data, which is not supported for mindspore.Parameter. You should use helper functions like normal_, zeros_, and constant_ from mindone.models.utils to initialize the parameters correctly. Please also add from mindone.models.utils import constant_, normal_, zeros_ to the imports at the top of the file.
| def _init_weights(self, module): | |
| """Initialize the weights""" | |
| if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)): | |
| # Slightly different from the TF version which uses truncated_normal for initialization | |
| # cf https://github.com/pytorch/pytorch/pull/5617 | |
| module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) | |
| if module.bias is not None: | |
| module.bias.data.zero_() | |
| elif isinstance(module, (mint.nn.LayerNorm, DINOv3ConvNextLayerNorm)): | |
| module.bias.data.zero_() | |
| module.weight.data.fill_(1.0) | |
| elif isinstance(module, DINOv3ConvNextLayer): | |
| if module.gamma is not None: | |
| module.gamma.data.fill_(self.config.layer_scale_init_value) | |
| def _init_weights(self, module): | |
| """Initialize the weights""" | |
| if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)): | |
| # Slightly different from the TF version which uses truncated_normal for initialization | |
| # cf https://github.com/pytorch/pytorch/pull/5617 | |
| normal_(module.weight, mean=0.0, std=self.config.initializer_range) | |
| if module.bias is not None: | |
| zeros_(module.bias) | |
| elif isinstance(module, (mint.nn.LayerNorm, DINOv3ConvNextLayerNorm)): | |
| zeros_(module.bias) | |
| constant_(module.weight, 1.0) | |
| elif isinstance(module, DINOv3ConvNextLayer): | |
| if module.gamma is not None: | |
| constant_(module.gamma, self.config.layer_scale_init_value) |
| for i in range(len(stacked_images)): | ||
| image = stacked_images[i] | ||
| # TODO mindspore.dataset.vision.Resize could only support (H, W, 3) format, | ||
| # batch_size stacked image should be computed in one iteration | ||
| # batch_size, channels = stacked_images.shape[0], stacked_images.shape[1] | ||
| # stacked_images_updated = mint.zeros((batch_size, channels, resized_height, resized_width), dtype=stacked_images.dtype) | ||
| # TODO: current implementation of resize require input to be unscaled image, so the order is changed to: | ||
| # resize -> rescale -> normalize, causing ~e-3 precision difference | ||
| if do_resize: | ||
| image = self.resize( | ||
| image=image, size=size, interpolation=interpolation, antialias=True | ||
| ) | ||
| if do_rescale: | ||
| image = self.rescale(image, rescale_factor) | ||
| stacked_images_updated.append(image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug in the _preprocess method. The stacked_images_updated.append(image) call is outside the for loop, which means only the last image of each group is processed and kept. This should be inside the loop to process all images.
| for i in range(len(stacked_images)): | |
| image = stacked_images[i] | |
| # TODO mindspore.dataset.vision.Resize could only support (H, W, 3) format, | |
| # batch_size stacked image should be computed in one iteration | |
| # batch_size, channels = stacked_images.shape[0], stacked_images.shape[1] | |
| # stacked_images_updated = mint.zeros((batch_size, channels, resized_height, resized_width), dtype=stacked_images.dtype) | |
| # TODO: current implementation of resize require input to be unscaled image, so the order is changed to: | |
| # resize -> rescale -> normalize, causing ~e-3 precision difference | |
| if do_resize: | |
| image = self.resize( | |
| image=image, size=size, interpolation=interpolation, antialias=True | |
| ) | |
| if do_rescale: | |
| image = self.rescale(image, rescale_factor) | |
| stacked_images_updated.append(image) | |
| for i in range(len(stacked_images)): | |
| image = stacked_images[i] | |
| # TODO mindspore.dataset.vision.Resize could only support (H, W, 3) format, | |
| # batch_size stacked image should be computed in one iteration | |
| # batch_size, channels = stacked_images.shape[0], stacked_images.shape[1] | |
| # stacked_images_updated = mint.zeros((batch_size, channels, resized_height, resized_width), dtype=stacked_images.dtype) | |
| # TODO: current implementation of resize require input to be unscaled image, so the order is changed to: | |
| # resize -> rescale -> normalize, causing ~e-3 precision difference | |
| if do_resize: | |
| image = self.resize( | |
| image=image, size=size, interpolation=interpolation, antialias=True | |
| ) | |
| if do_rescale: | |
| image = self.rescale(image, rescale_factor) | |
| stacked_images_updated.append(image) |
| attn_weights = attn_weights * attention_mask | ||
|
|
||
| attn_output = mint.matmul(attn_weights, value) | ||
| attn_output = attn_output.transpose(1, 2).contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .contiguous() method is a PyTorch-specific call and is not available for MindSpore tensors. It should be removed. The transpose operation in MindSpore returns a contiguous tensor by default in most cases.
| attn_output = attn_output.transpose(1, 2).contiguous() | |
| attn_output = attn_output.transpose(1, 2) |
| **kwargs, | ||
| ) | ||
|
|
||
| attn_output = attn_output.reshape(batch_size, patches, -1).contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .contiguous() method is a PyTorch-specific call and is not available for MindSpore tensors. It should be removed. The reshape operation in MindSpore returns a contiguous tensor.
| attn_output = attn_output.reshape(batch_size, patches, -1).contiguous() | |
| attn_output = attn_output.reshape(batch_size, patches, -1) |
| def _init_weights(self, module) -> None: | ||
| """Initialize the weights""" | ||
| if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)): | ||
| trunc_normal_(module.weight,mean=0.0, std=self.config.initializer_range) | ||
| if module.bias is not None: | ||
| module.bias.data.zero_() | ||
| elif isinstance(module, mint.nn.LayerNorm): | ||
| module.bias.data.zero_() | ||
| module.weight.data.fill_(1.0) | ||
| elif isinstance(module, DINOv3ViTEmbeddings): | ||
| trunc_normal_(module.cls_token.data,mean=0.0, std=self.config.initializer_range) | ||
| if module.config.num_register_tokens > 0: | ||
| trunc_normal_(module.register_tokens,mean=0.0, std=self.config.initializer_range) | ||
| module.mask_token.data.zero_() | ||
| elif isinstance(module, DINOv3ViTLayerScale): | ||
| module.lambda1.data.fill_(self.config.layerscale_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weight initialization method _init_weights uses PyTorch-style in-place modification on .data, which is not supported for mindspore.Parameter. You should use helper functions like zeros_ and constant_ from mindone.models.utils to initialize the parameters correctly. Also, trunc_normal_ should be called on the Parameter object directly, not on its .data attribute. Please also add from mindone.models.utils import zeros_, constant_ to the imports at the top of the file.
| def _init_weights(self, module) -> None: | |
| """Initialize the weights""" | |
| if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)): | |
| trunc_normal_(module.weight,mean=0.0, std=self.config.initializer_range) | |
| if module.bias is not None: | |
| module.bias.data.zero_() | |
| elif isinstance(module, mint.nn.LayerNorm): | |
| module.bias.data.zero_() | |
| module.weight.data.fill_(1.0) | |
| elif isinstance(module, DINOv3ViTEmbeddings): | |
| trunc_normal_(module.cls_token.data,mean=0.0, std=self.config.initializer_range) | |
| if module.config.num_register_tokens > 0: | |
| trunc_normal_(module.register_tokens,mean=0.0, std=self.config.initializer_range) | |
| module.mask_token.data.zero_() | |
| elif isinstance(module, DINOv3ViTLayerScale): | |
| module.lambda1.data.fill_(self.config.layerscale_value) | |
| def _init_weights(self, module) -> None: | |
| """Initialize the weights""" | |
| if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)): | |
| trunc_normal_(module.weight, mean=0.0, std=self.config.initializer_range) | |
| if module.bias is not None: | |
| zeros_(module.bias) | |
| elif isinstance(module, mint.nn.LayerNorm): | |
| zeros_(module.bias) | |
| constant_(module.weight, 1.0) | |
| elif isinstance(module, DINOv3ViTEmbeddings): | |
| trunc_normal_(module.cls_token, mean=0.0, std=self.config.initializer_range) | |
| if module.config.num_register_tokens > 0: | |
| trunc_normal_(module.register_tokens, mean=0.0, std=self.config.initializer_range) | |
| zeros_(module.mask_token) | |
| elif isinstance(module, DINOv3ViTLayerScale): | |
| constant_(module.lambda1, self.config.layerscale_value) |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| from .modeling_dinov3_convnext import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wildcard imports (from ... import *) are discouraged by PEP 8 as they make it unclear which names are present in the namespace. It's better to explicitly import the required names. Based on __all__ in modeling_dinov3_convnext.py, you should import DINOv3ConvNextModel and DINOv3ConvNextPreTrainedModel.
| from .modeling_dinov3_convnext import * | |
| from .modeling_dinov3_convnext import DINOv3ConvNextModel, DINOv3ConvNextPreTrainedModel |
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| from .image_processing_dinov3_vit_fast import DINOv3ViTImageProcessorFast | ||
| from .modeling_dinov3_vit import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wildcard imports (from ... import *) are discouraged by PEP 8 as they make it unclear which names are present in the namespace. It's better to explicitly import the required names. Based on __all__ in modeling_dinov3_vit.py, you should import DINOv3ViTModel and DINOv3ViTPreTrainedModel.
| from .modeling_dinov3_vit import * | |
| from .modeling_dinov3_vit import DINOv3ViTModel, DINOv3ViTPreTrainedModel |
What does this PR do?
Fixes # (issue)
Adds # (feature)
Before submitting
What's New. Here are thedocumentation guidelines
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@xxx