Skip to content

Conversation

@konstantinos-p
Copy link
Contributor

What does this PR do?

This PR introduces the DINO DETR (DEtection TRansformer with DIstillation) model (https://arxiv.org/abs/2203.03605) to the Hugging Face Transformers library. DINO DETR is a state-of-the-art object detection model that builds upon the original DETR architecture, incorporating improvements such as:

  • Contrastive denoising training to enhance object queries.
  • Mixed query selection for more robust matching between predictions and ground truths.
  • Look forward twice (LFT) mechanism to refine object boxes.
  • Efficient matching and distillation techniques that stabilize bipartite matching loss.

The model achieves strong performance on COCO test-dev (https://paperswithcode.com/sota/object-detection-on-coco).

Fixes #36205

What's included

  • Implementation of the DinoDetrModel, and DinoDetrForObjectDetection
  • Implementation of DinoDetrImageProcessor

Resources I've used

Who can review?

@amyeroberts, @qubvel

@konstantinos-p
Copy link
Contributor Author

Hello @qubvel! The state of the PR is that I've gotten the forward pass to match the original implementation up to the required precision. I've marked this as a draft because I wanted to just get a first opinion on if I'm modifying the correct files in the codebase. Let me know if I'm missing anything big. Regarding tests, i've copied some from Deformable Detr but haven't tried getting them to work. Let me know if I need to add any apart from what's already there.

@qubvel
Copy link
Contributor

qubvel commented Mar 14, 2025

Hi @konstantinos-p! Thanks a lot for working on the model, super excited to see it merged! 🚀

Before diving into the implementation details, here are some general comments we need to address in the PR:

  • Integration Tests
    Let's set up integration tests to ensure output consistency is maintained while refactoring the modeling code. Please use torch.testing.assert_close when comparing tensors (not use self.assertTrue(torch.allclose(...))

  • Modeling Code
    It would be super nice to use modular approach! It's the way to use inheritance in transformers while keeping a one-model-one-file format, please see more information here: https://huggingface.co/docs/transformers/main/en/modular_transformers and in examples/modular-transformers folder of the repo. Also, Siglip2 and RT-DETRv2 model were added using modular.

  • Consistent Naming for Classes & Modules

    • Model names and types: DinoDetr and dino_detr
    • All module names should be prefixed with the model name. For example, instead of ResidualBlock, use DinoDetrResidualBlock.
  • Code Paths & Cleanup

    • No asserts in code and minimum raise statements, config params should be validated in config.

    • Remove unused conditional statements if no existing configuration requires them. Also, remove the unnecessary config parameters.

      # Instead of:
      if config.use_last_layer_norm:
          self.norm = nn.LayerNorm()
      else:
          self.norm = nn.Identity()
      # If `use_last_layer_norm = True` in all configs, simplify it to:  
      self.norm = nn.LayerNorm()
  • Code Style

    • Use clear, descriptive variable names: avoid single-letter (x) or placeholder (tmp) variables.
    • Add comments for non-obvious code.
    • Use type hints for modules (e.g. in __init__ and forward).
  • Conversion Script Standard
    Please follow the mllama model format for the conversion script. Define a single key mapping using regex (see rt_detr_v2, ijepa, mllama, superglue models for reference). This is our new standard for all newly added models.

Thanks again! Looking forward to the updates 🚀

@konstantinos-p
Copy link
Contributor Author

Thanks for the comments! I'll start addressing them!

@konstantinos-p
Copy link
Contributor Author

konstantinos-p commented Apr 10, 2025

Hello @qubvel I'm facing a similar issue to this one. Basically I'm rewriting the model to use modular but instead of DinoDetr Dino get's captured by the compiler as the model name. Is there a fix for this? Alternatively I could apply modular only on the processor. @Cyrilvallez was mentioned as possibly having faced something similar.

@Cyrilvallez
Copy link
Member

Yes, see my answer here #36895 (comment)!

@konstantinos-p
Copy link
Contributor Author

konstantinos-p commented May 11, 2025

@Cyrilvallez thanks for the fix, it seems to work for me!

@konstantinos-p konstantinos-p marked this pull request as ready for review May 11, 2025 17:16
@konstantinos-p
Copy link
Contributor Author

konstantinos-p commented May 11, 2025

Hello @qubvel ! I've addressed your initial comments, I've added tests which pass + a first draft of docs etc. I think it would be a good time to have a second look! CircleCI tests were passing as well but now are failing after I rebased on recent changes. I'll need some help with these as well!

@ArthurZucker ArthurZucker requested review from qubvel and removed request for ArthurZucker and Rocketknight1 July 7, 2025 12:18
@konstantinos-p
Copy link
Contributor Author

Hello @qubvel! It would be great if we could make a final push and merge this! I'd rather work on it while the key points are still fresh.

DINO Detr is still very competitive and an important model in the Detr line on top of which one could create new architectures.

@Rocketknight1
Copy link
Member

cc @NielsRogge @yonigozlan @molbap

@molbap molbap requested review from molbap and removed request for qubvel October 7, 2025 08:25
@molbap
Copy link
Contributor

molbap commented Oct 9, 2025

Hey @konstantinos-p , I will be taking a look at your PR now! First of all, I'm seeing a few main conflicts that are mainly due to #40760, could you

git pull main 
git merge main
pip install -e .[quality]

and for the code_quality run make fixup once that's done.
I will do a pass on your PR now, and another one when the conflicts are solved and the branch up-to-date. Let me know if you encounter issues!

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial review and tried to leave pointers for modular, when it is not possible to inherit from another model let's try and stay aligned with existing methods! Feel free to ping me when you've had time to update the branch and adress initial comments 🤗

@molbap
Copy link
Contributor

molbap commented Oct 13, 2025

@konstantinos-p can you merge main into your branch by the way please? 🤗 The base version is from May and this particular one was yanked 😅

@konstantinos-p
Copy link
Contributor Author

@konstantinos-p can you merge main into your branch by the way please? 🤗 The base version is from May and this particular one was yanked 😅

Hello @molbap ! Sounds good! I've been doing fetch/rebase main. (Not really familiar with the finer differences between rebase/merge) What is the difference with doing merge? (Also could you elaborate a bit on the yanked part :) ) I've addressed the comments as far as I can tell and also did another passthrough over the modular/modeling code to add some comments/cleanup. I guess you can have a second look!

@molbap
Copy link
Contributor

molbap commented Oct 15, 2025

Hey @konstantinos-p, are you sure you've rebased on origin:main? merge or rebase make no difference here as we'll squash in the end, but the transformers version should be v5-dev, not 4.52 (and that one in particular was yanked, i.e. removed from PyPi because it had several critical issues)

taking another look now :)

@konstantinos-p
Copy link
Contributor Author

konstantinos-p commented Oct 15, 2025

I'll have a look at the rebase! Btw, I tried using the SinePositionEmbedding from deformable detr but unfortunately it doesn't exactly match as far as I can tell. Although I think I can use LearnedPositionEmbedding as is.

I've also removed alot of the original config parameters. I did this by looking at the original repo and checking if they change with any configs for other variants of the model (I also used my own judgement). I think it might be possible to also remove the two_stage_type and simplify things further. Then the only implementation would be the standard one which is the default.

@konstantinos-p
Copy link
Contributor Author

konstantinos-p commented Oct 20, 2025

Regarding the rebase: I see for example that my commit history on this branch includes

72a3fc2

from two weeks ago, because I rebased then. At that time I ran

git fetch upstream
git rebase upstream/main

I see also that the test_modeling_common.py keeps having conflicts even though I ran make style, I'll try to fix that. (should be fixed by a rebase or adding manually the longer lines)

@molbap molbap self-requested a review October 21, 2025 09:38
@molbap
Copy link
Contributor

molbap commented Oct 21, 2025

To fix conflicts usually I just do (after syncing my fork):

git checkout main
git pull 
git checkout my_branch_to_update
git merge main

then fix conflicts that appeared (test_modeling_common should pop out), among a few.

I did that with your branch and got 3 conflicting files, mostly linked to our removal of support for tensorflow and torchscript + some changes on deformable detr testing.
image
You can just accept all incoming changes.

Once this is done, I rather suggest you use fixup, once back to your branch

pip install -e .[quality]
make fixup

that should help clear out some of the code quality issues. The linter has been updated, a few things have changed regarding typing (list, not List), etc, so doing that should clear up the code and the CI a bit 🤗

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, dino_detr

@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=36711&sha=5a3fec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Request to add DINO object detector

6 participants