Skip to content

Conversation

@zhangfeiran
Copy link

What does this PR do?

Fixes # (issue)

Adds # (feature)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
    documentation guidelines
  • Did you build and run the code without any errors?
  • Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

@zhangfeiran zhangfeiran requested a review from vigo999 as a code owner November 25, 2025 10:05
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zhangfeiran, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the mindone.transformers library by introducing the DINOv3 family of vision models, specifically the ConvNext and Vision Transformer variants. This addition provides users with state-of-the-art self-supervised learning models for computer vision tasks, complete with their respective image processing capabilities, all adapted for the MindSpore framework. The integration into the auto-configuration system ensures seamless usability.

Highlights

  • New Model Integration: Added DINOv3ConvNext and DINOv3ViT models to the mindone.transformers library.
  • Image Processor: Introduced DINOv3ViTImageProcessorFast for efficient image preprocessing specific to the DINOv3 ViT model.
  • Auto-Configuration Support: Integrated the new models and their configurations into the mindone.transformers auto-configuration system, enabling easy loading and usage.
  • MindSpore Adaptation: The new models and image processor are adapted from Hugging Face Transformers to run on the MindSpore framework.
  • Comprehensive Testing: Included dedicated unit tests for both DINOv3ConvNextModel and DINOv3ViTModel to ensure correctness and compatibility.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds dinov3_vit and dinov3_convnext models. The changes include model implementations, configurations, and tests. The overall structure is good, but there are several critical issues in the implementation that need to be addressed. These include incorrect weight initialization syntax (using PyTorch's .data API which is not supported in MindSpore), a bug in the image preprocessing logic, and usage of PyTorch-specific .contiguous() calls. I've also pointed out some minor issues like wildcard imports. Please address these points to ensure the models work correctly.

Comment on lines +196 to +209
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, (mint.nn.LayerNorm, DINOv3ConvNextLayerNorm)):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
elif isinstance(module, DINOv3ConvNextLayer):
if module.gamma is not None:
module.gamma.data.fill_(self.config.layer_scale_init_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The weight initialization method _init_weights uses PyTorch-style in-place modification on .data, which is not supported for mindspore.Parameter. You should use helper functions like normal_, zeros_, and constant_ from mindone.models.utils to initialize the parameters correctly. Please also add from mindone.models.utils import constant_, normal_, zeros_ to the imports at the top of the file.

Suggested change
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, (mint.nn.LayerNorm, DINOv3ConvNextLayerNorm)):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
elif isinstance(module, DINOv3ConvNextLayer):
if module.gamma is not None:
module.gamma.data.fill_(self.config.layer_scale_init_value)
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
normal_(module.weight, mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
zeros_(module.bias)
elif isinstance(module, (mint.nn.LayerNorm, DINOv3ConvNextLayerNorm)):
zeros_(module.bias)
constant_(module.weight, 1.0)
elif isinstance(module, DINOv3ConvNextLayer):
if module.gamma is not None:
constant_(module.gamma, self.config.layer_scale_init_value)

Comment on lines +65 to +79
for i in range(len(stacked_images)):
image = stacked_images[i]
# TODO mindspore.dataset.vision.Resize could only support (H, W, 3) format,
# batch_size stacked image should be computed in one iteration
# batch_size, channels = stacked_images.shape[0], stacked_images.shape[1]
# stacked_images_updated = mint.zeros((batch_size, channels, resized_height, resized_width), dtype=stacked_images.dtype)
# TODO: current implementation of resize require input to be unscaled image, so the order is changed to:
# resize -> rescale -> normalize, causing ~e-3 precision difference
if do_resize:
image = self.resize(
image=image, size=size, interpolation=interpolation, antialias=True
)
if do_rescale:
image = self.rescale(image, rescale_factor)
stacked_images_updated.append(image)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a bug in the _preprocess method. The stacked_images_updated.append(image) call is outside the for loop, which means only the last image of each group is processed and kept. This should be inside the loop to process all images.

Suggested change
for i in range(len(stacked_images)):
image = stacked_images[i]
# TODO mindspore.dataset.vision.Resize could only support (H, W, 3) format,
# batch_size stacked image should be computed in one iteration
# batch_size, channels = stacked_images.shape[0], stacked_images.shape[1]
# stacked_images_updated = mint.zeros((batch_size, channels, resized_height, resized_width), dtype=stacked_images.dtype)
# TODO: current implementation of resize require input to be unscaled image, so the order is changed to:
# resize -> rescale -> normalize, causing ~e-3 precision difference
if do_resize:
image = self.resize(
image=image, size=size, interpolation=interpolation, antialias=True
)
if do_rescale:
image = self.rescale(image, rescale_factor)
stacked_images_updated.append(image)
for i in range(len(stacked_images)):
image = stacked_images[i]
# TODO mindspore.dataset.vision.Resize could only support (H, W, 3) format,
# batch_size stacked image should be computed in one iteration
# batch_size, channels = stacked_images.shape[0], stacked_images.shape[1]
# stacked_images_updated = mint.zeros((batch_size, channels, resized_height, resized_width), dtype=stacked_images.dtype)
# TODO: current implementation of resize require input to be unscaled image, so the order is changed to:
# resize -> rescale -> normalize, causing ~e-3 precision difference
if do_resize:
image = self.resize(
image=image, size=size, interpolation=interpolation, antialias=True
)
if do_rescale:
image = self.rescale(image, rescale_factor)
stacked_images_updated.append(image)

attn_weights = attn_weights * attention_mask

attn_output = mint.matmul(attn_weights, value)
attn_output = attn_output.transpose(1, 2).contiguous()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The .contiguous() method is a PyTorch-specific call and is not available for MindSpore tensors. It should be removed. The transpose operation in MindSpore returns a contiguous tensor by default in most cases.

Suggested change
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.transpose(1, 2)

**kwargs,
)

attn_output = attn_output.reshape(batch_size, patches, -1).contiguous()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The .contiguous() method is a PyTorch-specific call and is not available for MindSpore tensors. It should be removed. The reshape operation in MindSpore returns a contiguous tensor.

Suggested change
attn_output = attn_output.reshape(batch_size, patches, -1).contiguous()
attn_output = attn_output.reshape(batch_size, patches, -1)

Comment on lines +450 to +465
def _init_weights(self, module) -> None:
"""Initialize the weights"""
if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)):
trunc_normal_(module.weight,mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, mint.nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
elif isinstance(module, DINOv3ViTEmbeddings):
trunc_normal_(module.cls_token.data,mean=0.0, std=self.config.initializer_range)
if module.config.num_register_tokens > 0:
trunc_normal_(module.register_tokens,mean=0.0, std=self.config.initializer_range)
module.mask_token.data.zero_()
elif isinstance(module, DINOv3ViTLayerScale):
module.lambda1.data.fill_(self.config.layerscale_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The weight initialization method _init_weights uses PyTorch-style in-place modification on .data, which is not supported for mindspore.Parameter. You should use helper functions like zeros_ and constant_ from mindone.models.utils to initialize the parameters correctly. Also, trunc_normal_ should be called on the Parameter object directly, not on its .data attribute. Please also add from mindone.models.utils import zeros_, constant_ to the imports at the top of the file.

Suggested change
def _init_weights(self, module) -> None:
"""Initialize the weights"""
if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)):
trunc_normal_(module.weight,mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, mint.nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
elif isinstance(module, DINOv3ViTEmbeddings):
trunc_normal_(module.cls_token.data,mean=0.0, std=self.config.initializer_range)
if module.config.num_register_tokens > 0:
trunc_normal_(module.register_tokens,mean=0.0, std=self.config.initializer_range)
module.mask_token.data.zero_()
elif isinstance(module, DINOv3ViTLayerScale):
module.lambda1.data.fill_(self.config.layerscale_value)
def _init_weights(self, module) -> None:
"""Initialize the weights"""
if isinstance(module, (mint.nn.Linear, mint.nn.Conv2d)):
trunc_normal_(module.weight, mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
zeros_(module.bias)
elif isinstance(module, mint.nn.LayerNorm):
zeros_(module.bias)
constant_(module.weight, 1.0)
elif isinstance(module, DINOv3ViTEmbeddings):
trunc_normal_(module.cls_token, mean=0.0, std=self.config.initializer_range)
if module.config.num_register_tokens > 0:
trunc_normal_(module.register_tokens, mean=0.0, std=self.config.initializer_range)
zeros_(module.mask_token)
elif isinstance(module, DINOv3ViTLayerScale):
constant_(module.lambda1, self.config.layerscale_value)

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .modeling_dinov3_convnext import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Wildcard imports (from ... import *) are discouraged by PEP 8 as they make it unclear which names are present in the namespace. It's better to explicitly import the required names. Based on __all__ in modeling_dinov3_convnext.py, you should import DINOv3ConvNextModel and DINOv3ConvNextPreTrainedModel.

Suggested change
from .modeling_dinov3_convnext import *
from .modeling_dinov3_convnext import DINOv3ConvNextModel, DINOv3ConvNextPreTrainedModel

# See the License for the specific language governing permissions and
# limitations under the License.
from .image_processing_dinov3_vit_fast import DINOv3ViTImageProcessorFast
from .modeling_dinov3_vit import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Wildcard imports (from ... import *) are discouraged by PEP 8 as they make it unclear which names are present in the namespace. It's better to explicitly import the required names. Based on __all__ in modeling_dinov3_vit.py, you should import DINOv3ViTModel and DINOv3ViTPreTrainedModel.

Suggested change
from .modeling_dinov3_vit import *
from .modeling_dinov3_vit import DINOv3ViTModel, DINOv3ViTPreTrainedModel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant