Refactor object-detection models #36514

qubvel · 2025-03-03T14:23:56Z

What does this PR do?`

Improves code readability for object-detection models, should be merged after

Fix fp16 ONNX export for RT-DETR and RT-DETRv2 #36460

Note:

RT-DETR and RT-DETR-v2 are refactored, for other models just Freezed BatchNorm2d module updated.

RT-DETR refactoring:

Make code closer to transformers standards (no one-letter vars, common var names, ...)
Remove some unused outputs (seems not necessary, but add overhead on implementation and docs)
Clean up signatures from unused arguments (where it's possible)
Better comments and shape comments
Add default docstring for object-detection
Unify RT-DETR and RT-DETRv2 implementations (simpler modular file for v2)

In addition:

Fix CoreML export for RT-DETR (avoid 6D tensors in deformable_attention)

Mostly, the changes are backward compatible, but some internal modules' signatures have changed. I suspect RT-DETR internal modules are used on their own somewhere outside the transformers repository. A quick search on GitHub shows only transformers forks, and I have never seen custom code samples on HF Hub for object detection. Therefore, I think we can change them to establish better standards for upcoming models such as D-Fine (a state-of-the-art real-time object detector) and RT-DETRv3, both of which are based on RT-DETR.

Test changes

I tried to keep tests untouched, but there are a few modifications:

the number of outputs in ModelOuput is reduced

Who can review?

cc @SangbumChoi @jadechoghari as model contributors, in case you want to have a look. Please let me know if anything is missing / incorrectly re-implemented

HuggingFaceDocBuilderDev · 2025-03-03T14:51:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SangbumChoi

Good :) (In general I think the PR is in still draft.)

SangbumChoi · 2025-03-13T00:10:25Z

src/transformers/models/conditional_detr/modeling_conditional_detr.py

            backbone = load_backbone(config)

        # replace batch norm by frozen batch norm
-        with torch.no_grad():


Does this replace_batch_norm has no_grad function inside the definitions?

No, we are just copying tensor data, so no need for no-grad. I tested with no-grad removed, replace_batch_norm works fine

SangbumChoi · 2025-03-13T00:13:33Z

src/transformers/models/rt_detr/modeling_rt_detr.py

@@ -326,141 +224,411 @@ class RTDetrObjectDetectionOutput(ModelOutput):
    pred_boxes: torch.FloatTensor = None
    auxiliary_outputs: Optional[List[Dict]] = None
    last_hidden_state: torch.FloatTensor = None
-    intermediate_hidden_states: torch.FloatTensor = None


I also agree removing these variables. Just for the record I thought this was kind of legacy code that other detection model all had this style of this code. HF team might want to refactor all this implementation since it is not used for inference. However, some of the transformer based model might require this for fine-tuning such as auxiliary or intermediate loss

Thanks, I will double-check, but it seems redundant. As far as I recall it is similar to decoder hidden states (shifted by one, because taken after the layer applied)

SangbumChoi · 2025-03-13T00:16:49Z

src/transformers/models/rt_detr/modeling_rt_detr.py

-        # add position embeddings to the hidden states before projecting to queries and keys
-        if position_embeddings is not None:
-            hidden_states = self.with_pos_embed(hidden_states, position_embeddings)
+        params = {"kernel_size": 1, "stride": 1, "activation": config.activation_function}


Is there reason for writing params as dictionary rather than just manually write all variables in the corresponding function?

I agree it might not be the best approach, I did it just for code readability:

No line breaks for module definitions (in my opinion, it's a bit easier to read)

It also shows that parameters are identical across modules

However, it's very subjective.

SangbumChoi · 2025-03-13T00:26:52Z

src/transformers/models/rt_detr/modeling_rt_detr.py

-            num_heads=config.decoder_attention_heads,
-            dropout=config.attention_dropout,
+            embed_dim=config.encoder_hidden_dim,
+            num_heads=config.num_attention_heads,


Are you sure about this change? Even though the default config is same, there are two different configuration for each decoder and encoder layer for num_heads.

also same comment for the below attention_dropout and dropout is quite different.

Nice catch! That's indeed should be reverted, thanks a lot

Ah, that's actually git diff issue! It's two different modules DecoderLayer (red) and EncoderLayer (green). So there were no changes aclually

SangbumChoi · 2025-03-13T00:30:57Z

src/transformers/models/rt_detr/modeling_rt_detr.py

+            level_anchors = torch.concat([grid_xy, grid_wh], dim=-1).reshape(height * width, 4)
+            anchors.append(level_anchors)
+        anchors = torch.concat(anchors).unsqueeze(0)
+
        # define the valid range for anchor coordinates
        eps = 1e-2


maybe define this value as an additional argument?

…der_attention_heads, attention_dropout)

qubvel added 26 commits February 27, 2025 18:00

Fix FP16 ONNX export

00e2ead

Fix typo

0a3fc6e

Sync omdet-turbo

cbbb776

Refactor encoder for better readability

7e85bd1

Fix _no_split_modules

12bd68e

Fix int -> torch_int

f1d7441

Fix rt_detr

4aefd1a

Apply to rt-detr-v2

11ce236

Fixup

1b1a901

Fix copies

1b5a7b9

Refactor model init

0dccf3f

Refactor decoder init

534f036

Refactor forward for Hybrid encoder

4b451d5

Refactor encoder part of Model forward

a7366e9

Refactor projection

abbab8f

Refactoring Model forward

eaba476

Remove get clones

0fe706f

Remove double sigmoid

9072359

Remove unused args from HybridEncoder

d3a6fde

Comment, rename

ec876b6

Fix

684f71b

Add typehints

2ba83d8

Refactor generate_anchors

30efbfc

Add object-detection default docstring

389d851

Update docstring

f824136

Update test tolerance

eadfd7a

qubvel added the Vision label Mar 3, 2025

qubvel added 2 commits March 6, 2025 11:09

Nit refactor for frozen BN

f1238d2

Update oneformer BN as well

6f2e69b

qubvel added 26 commits March 11, 2025 15:02

Refactor Hybrid encoder forward

e65a42d

Cache position embeddings

eb8aa6c

nn.dropout -> Dropout()

c5831c5

RTDetrRepVggBlock x -> feature_map

754ed02

RTDetrCSPRepLayer hidden_state -> feature_map

9284306

RTDetrConvNormLayer hidden_state -> feature_map

a918ba9

Move level start index closer to to the kernel computation

5897686

Better docstring for RTDetrDecoderLayer

0a82dde

Move RTDetrPreTrainedModel lower

cb00b56

Move modules

05095e7

🚨🚨🚨 Remove unused signature args 🚨🚨🚨

e7c5e95

Significantly simplify return_dict logic

4a584ad

Activation from ACT2FN

5936c7c

🚨🚨🚨 Refactor decoder forward + remove most of the intermediate outputs

998cef8

Remove unnecessary attributes

b70b58d

RTDetrDecoderLayer refactor FFN

acda4ce

Simplify a bit more (attention weights return mechanism)

8dace20

WIP refactoring v2 deform attention (working checkpoint)

b77a0ae

V2 refactored modeling file

4917863

Apply modular for RT-DETR V2

74213d9

Adjust RT-DETR v2 tests a bit

21775f3

Fix-copies

5e37179

Consistency

0efbac0

Fixup

ab704bd

Update rt-detr doc and Head forward

3dbd04f

Revert logits

92b63d8

SangbumChoi reviewed Mar 13, 2025

View reviewed changes

qubvel added 2 commits March 13, 2025 10:22

Revert params for RTDetrEncoderLayer's RTDetrMultiheadAttention (deco…

810547a

…der_attention_heads, attention_dropout)

Undo last commit

6d071d8

qubvel mentioned this pull request Mar 24, 2025

Use deformable_detr kernel from the Hub #36853

Merged

5 tasks

Refactor object-detection models #36514

Are you sure you want to change the base?

Refactor object-detection models #36514

Uh oh!

Conversation

qubvel commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?`

Test changes

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 3, 2025

Uh oh!

SangbumChoi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qubvel Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qubvel commented Mar 3, 2025 •

edited

Loading

qubvel Mar 13, 2025 •

edited

Loading