-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Training is in progress] [Feature] Support RT-DETR #10498
base: dev-3.x
Are you sure you want to change the base?
Conversation
Current StatusTraining Performance is not reproduced yetWith the current config, I got below result. The performance fluctuates after 18 epochs.
I thought the reason is due to the difference in train_pipeline = [
dict(type='LoadImageFromFile', backend_args={{_base_.backend_args}}),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='PhotoMetricDistortion'),
dict(
type='Expand',
mean=[123.675, 116.28, 103.53],
to_rgb=True,
ratio_range=(1, 4)),
dict(type='RandomCrop', crop_size=(0.3, 1.0), crop_type='relative_range'),
dict(type='RandomFlip', prob=0.5),
dict(type='RandomChoiceResize',
scales=[(480, 480), (512, 512), (544, 544), (576, 576),
(608, 608), (640, 640), (640, 640), (640, 640),
(672, 672), (704, 704), (736, 736), (768, 768),
(800, 800)],
keep_ratio=False),
dict(type='PackDetInputs')
] But, it shows slower convergence and poor result.
I'm still trying to figure out this. |
You can refer to the rt-detr reproduced in yolov8, maybe you will find something new. |
Unfortunately, I think it seems that the maintainers of yolov8 did not reproduce the performance in the paper. from ultralytics import RTDETR
model = RTDETR()
model.info() # display model information
model.train(data="coco.yaml") # train
model.predict("path/to/image.jpg") # predict log
Now, I'm going to compare it with one in ppdet module by module. |
@hhaAndroid I came back from holidays. Sorry for the late progress. It seems that the repository only provides the inference code.
The training loss seems quite different between the ported one and the original one, especially in loss_class.
|
Above log from ppdet has an issue. PaddlePaddle/PaddleDetection#8409
|
@nijkah So does this mean there is an issue with the official code? Or is it that the official code training is fine, but there are issues with reproducing it in mmdetection? Converting from Paddle to PyTorch is difficult, so if it's too challenging, perhaps we can wait for the official release of the PyTorch code or try training rtdetr in yolov8 to see if we can reproduce it. |
@nijkah I think it's worth spending some more time going through the model section. Because I just found out that the else:
target = output_memory.gather(dim=1, \
index=topk_ind.unsqueeze(-1).repeat(1, 1, output_memory.shape[-1]))
if denoising_class is not None:
target = torch.concat([denoising_class, target], 1)
return target.detach(), reference_points_unact.detach(), enc_topk_bboxes, enc_topk_logits Apart from the model section, I also noticed a few differences: (1) syncbn, and (2) the decay parameter. However, it doesn't seem to have a significant impact. optimizer:
type: AdamW
params:
-
params: '^(?=.*backbone)(?=.*norm).*$'
lr: 0.00001
weight_decay: 0. # backbone.norm
-
params: '^(?=.*backbone)(?!.*norm).*$'
lr: 0.00001 # backbone 除了 norm 之外的参数
-
params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bias)).*$'
weight_decay: 0. # 其余层的 norm 和 bias
lr: 0.0001
betas: [0.9, 0.999]
weight_decay: 0.0001 The data augmentation section does have some differences, so if we want to rule out whether it's really the data augmentation causing the impact, we can choose the same weights and samples and run 100 iterations to see if the loss differentiation is consistent. If it's confirmed that data augmentation is indeed the cause, then it might be worth trying to replace it with data augmentation from torchvision. There are still quite a few uncertainties at the moment, so we need to investigate them one by one. |
@nijkah The author of rt-detr made some modifications in the init query section, taking references from both deformable detr and dino. I'm not sure if it was intentionally set up that way. |
r50vd, 4 bs X 4 gpu Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.516
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.707
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.557
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.334
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.563
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.684
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.705
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.707
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.707
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.550
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.747
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.865
08/25 18:16:31 - mmengine - INFO - bbox_mAP_copypaste: 0.516 0.707 0.557 0.334 0.563 0.684
08/25 18:16:34 - mmengine - INFO - Epoch(val) [72][1250/1250] coco/bbox_mAP: 0.5160 coco/bbox_mAP_50: 0.7070 coco/bbox_mAP_75: 0.5570 coco/bbox_mAP_s: 0.3340 coco/bbox_mAP_m: 0.5630 coco/bbox_mAP_l: 0.6840 data_time: 0.0013 time: 0.0266 |
@nijkah Hi. Shouldn't we focus on migrating the data augmentation pipeline and confirm if it's the cause of the issue? This part has a significant impact. |
@hhaAndroid Okay. Actually, I couldn't investigate the model section part yet. I'll follow these steps.
|
Hi, my experiment has done. I also faced with metrics drop. Interesting that I can't reproduce author's result 46.5 that was claimed in PaddleDetection repo. My augmentation a litte bit different, but not same to paddle, and I got result 44.6 instead of 44.1 like @nijkah. |
@nijkah @rydenisbak I trained the official code r18vd using 4x3090, and the performance of the best model is 46.0. Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.460
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.627
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.498
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.282
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.494
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.622
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.361
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.616
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.686
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.492
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.732
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.862
best_stat: {'epoch': 69, 'coco_eval_bbox': 0.45973773294988696} rtdetr_r18vd_dlc1m8i5txcccsq8-master-0_2023-08-30 10_21_35.txt |
r18vd, 32 bs X 4 gpu Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.456
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.641
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.495
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.299
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.490
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.606
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.669
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.672
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.672
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.501
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.710
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.835
09/12 18:00:05 - mmengine - INFO - bbox_mAP_copypaste: 0.456 0.641 0.495 0.299 0.490 0.606
09/12 18:00:08 - mmengine - INFO - Epoch(val) [72][40/40] coco/bbox_mAP: 0.4560 coco/bbox_mAP_50: 0.6410 coco/bbox_mAP_75: 0.4950 coco/bbox_mAP_s: 0.2990 coco/bbox_mAP_m: 0.4900 coco/bbox_mAP_l: 0.6060 |
valid_idx = labels < self.cls_out_channels | ||
# assign iou score to the corresponding label | ||
cls_iou_targets[valid_idx, | ||
labels[valid_idx]] = iou_score[valid_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest for amp training
iou_score[valid_idx] -> iou_score[valid_idx].to(cls_iou_targets.dtype)
valid_idx = labels < self.cls_out_channels | ||
# assign iou score to the corresponding label | ||
cls_iou_targets[valid_idx, | ||
labels[valid_idx]] = iou_score[valid_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest for amp training
iou_score[valid_idx] -> iou_score[valid_idx].to(cls_iou_targets.dtype)
r18, 4x3090 batch_size 16 The data augmentation section does have some differences and decay parameter |
Can you provide your configuration file, Thank you. |
backbone using rtdetr official backbone because bn name in mmdet resnet deep_stem |
I have followed your advice for training, but have not made any changes to the source code. paramwise_cfg=dict(custom_keys={'backbone': dict(lr_mult=0.1)},
|
can you provide version of source code, Thank you. I can't get same results by batch size 16 |
0, W - 1, W, dtype=torch.float32, device=device)) | ||
grid = torch.cat([grid_x.unsqueeze(-1), grid_y.unsqueeze(-1)], -1) | ||
|
||
valid_wh = torch.tensor([H, W], dtype=torch.float32, device=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to change this line to
valid_wh = torch.tensor([W, H], dtype=torch.float32, device=device)
otherwise I got problem for non square image.
This bug also is in original code, so I made issue about it
PaddlePaddle/PaddleDetection#8680
My code and weights is based https://github.com/nijkah/mmdetection/blob/rtdetr/ and #10498 (comment). _base_ = ['../_base_/datasets/coco_detection.py', '../_base_/default_runtime.py']
max_epochs = 72
train_batch_size_per_gpu = 32
train_num_workers = 8
persistent_workers = True
eval_size = (640, 640)
norm_cfg = dict(type='SyncBN', requires_grad=True)
pretrained = './pretrained/resnet18vd_pretrained.pth' # noqa
model = dict(
type='RTDETR',
num_queries=300, # num_matching_queries
with_box_refine=True,
as_two_stage=True,
eval_size=eval_size,
data_preprocessor=dict(
type='DetDataPreprocessor',
mean=[0, 0, 0],
std=[255, 255, 255],
bgr_to_rgb=True,
pad_size_divisor=32,
batch_augments=[dict(type='BatchSyncRandomResize', random_size_range=(480, 800))]),
backbone=dict(type='ResNetV1d',
depth=18,
num_stages=4,
out_indices=(1, 2, 3),
frozen_stages=-1,
norm_cfg=norm_cfg,
norm_eval=False,
style='pytorch',
init_cfg=dict(type='Pretrained', checkpoint=pretrained)),
neck=dict(
type='HybridEncoder',
num_encoder_layers=1,
in_channels=[128, 256, 512],
use_encoder_idx=[2],
expansion=0.5,
norm_cfg=norm_cfg,
layer_cfg=dict(
self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0), # 0.1 for DeformDETR
ffn_cfg=dict(
embed_dims=256,
feedforward_channels=1024, # 1024 for DeformDETR
ffn_drop=0.0,
act_cfg=dict(type='GELU'))),
projector=dict(type='ChannelMapper',
in_channels=[256, 256, 256],
kernel_size=1,
out_channels=256,
act_cfg=None,
norm_cfg=norm_cfg,
num_outs=3)), # 0.1 for DeformDETR
encoder=None,
decoder=dict(
num_layers=3,
eval_idx=-1,
layer_cfg=dict(
self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0), # 0.1 for DeformDETR
cross_attn_cfg=dict(
embed_dims=256,
num_levels=3, # 4 for DeformDETR
dropout=0.0), # 0.1 for DeformDETR
ffn_cfg=dict(
embed_dims=256,
feedforward_channels=1024, # 2048 for DINO
ffn_drop=0.0)), # 0.1 for DeformDETR
post_norm_cfg=None),
positional_encoding=dict(
num_feats=128,
normalize=True,
offset=0.0, # -0.5 for DeformDETR
temperature=20), # 10000 for DeformDETR
bbox_head=dict(
type='RTDETRHead',
num_classes=80,
sync_cls_avg_factor=True,
loss_cls=dict(
type='VarifocalLoss',
use_sigmoid=True,
use_rtdetr=True,
gamma=2.0,
alpha=0.75, # 0.25 in DINO
loss_weight=1.0), # 2.0 in DeformDETR
loss_bbox=dict(type='L1Loss', loss_weight=5.0),
loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
dn_cfg=dict(
label_noise_scale=0.5,
box_noise_scale=1.0, # 0.4 for DN-DETR
group_cfg=dict(dynamic=True, num_groups=None, num_dn_queries=100)),
# training and testing settings
train_cfg=dict(assigner=dict(type='HungarianAssigner',
match_costs=[
dict(type='FocalLossCost', weight=2.0),
dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
dict(type='IoUCost', iou_mode='giou', weight=2.0)
])),
test_cfg=dict(max_per_img=300)) # 100 for DeformDETR
# train_pipeline, NOTE the img_scale and the Pad's size_divisor is different
# from the default setting in mmdet.
train_pipeline = [
dict(type='LoadImageFromFile', backend_args={{_base_.backend_args}}),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='PhotoMetricDistortion', prob=0.8),
# dict(type='Expand', mean=[103.53, 116.28, 123.675], to_rgb=True, ratio_range=(1, 4), prob=0.5),
dict(type='Expand', mean=[0, 0, 0], to_rgb=True, ratio_range=(1, 4), prob=0.5),
# dict(type='RandomCrop', crop_size=(0.3, 1.0), crop_type='relative_range', prob=0.8),
dict(type='MinIoURandomCrop', aspect_ratio=[.5, 2.], prob=0.8),
dict(type='RandomFlip', prob=0.5),
dict(type='Resize', scale=eval_size, keep_ratio=False, interpolation='bicubic'),
dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1), keep_empty=False),
dict(type='PackDetInputs')
]
test_pipeline = [
dict(type='LoadImageFromFile', backend_args={{_base_.backend_args}}),
dict(type='Resize', scale=eval_size, keep_ratio=False, interpolation='bicubic'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor'))
]
train_dataloader = dict(batch_size=train_batch_size_per_gpu,
num_workers=train_num_workers,
persistent_workers=persistent_workers,
dataset=dict(filter_cfg=dict(filter_empty_gt=False),
pipeline=train_pipeline))
val_dataloader = dict(batch_size=train_batch_size_per_gpu,
num_workers=train_num_workers,
persistent_workers=persistent_workers,
dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
# optimizer
optim_wrapper = dict(
type='OptimWrapper',
optimizer=dict(
type='AdamW',
lr=0.0001, # 0.0002 for DeformDETR
weight_decay=0.0001),
clip_grad=dict(max_norm=0.1, norm_type=2),
paramwise_cfg=dict(custom_keys={'backbone': dict(lr_mult=0.1)},
norm_decay_mult=0,
bias_decay_mult=0)
) # custom_keys contains sampling_offsets and reference_points in DeformDETR # noqa
# learning policy
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=max_epochs, val_interval=1)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
param_scheduler = [
dict(type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=2000),
dict(type='MultiStepLR', begin=0, end=max_epochs, by_epoch=True, milestones=[100], gamma=1.0)
]
# NOTE: `auto_scale_lr` is for automatically scaling LR,
# USER SHOULD NOT CHANGE ITS VALUES.
# base_batch_size = (8 GPUs) x (2 samples per GPU)
auto_scale_lr = dict(enable=True, base_batch_size=16)
custom_hooks = [
dict(type='EMAHook',
ema_type='ExpMomentumEMA',
momentum=0.0001,
update_buffers=True,
priority=49),
] |
@jiesonshan @hhaAndroid @nijkah Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.470
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.642
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.510
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.298
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.505
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.622
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.690
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.693
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.693
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.505
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.731
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.859
10/29 07:11:54 - mmengine - INFO - bbox_mAP_copypaste: 0.470 0.642 0.510 0.298 0.505 0.622
10/29 07:11:58 - mmengine - INFO - Epoch(val) [72][40/40] coco/bbox_mAP: 0.4700 coco/bbox_mAP_50: 0.6420 coco/bbox_mAP_75: 0.5100 coco/bbox_mAP_s: 0.2980 coco/bbox_mAP_m: 0.5050 coco/bbox_mAP_l: 0.6220 data_time: 0.0351 time: 0.2554
10/29 07:11:58 - mmengine - INFO - The previous best checkpoint /dahuafs/userdata/229288/00_deeplearning/08_meta/02_OD/mmdetection/runs/rtdetr/debug_1026/best_coco_bbox_mAP_epoch_69.pth is removed
10/29 07:12:02 - mmengine - INFO - The best checkpoint with 0.4700 coco/bbox_mAP at 72 epoch is saved to best_coco_bbox_mAP_epoch_72.pth. Expand with dict(type='Expand', mean=[0, 0, 0], to_rgb=True, ratio_range=(1, 4), prob=0.5) or dict(type='Expand', mean=[103.53, 116.28, 123.675], to_rgb=True, ratio_range=(1, 4), prob=0.5) the result has no difference. |
I plan to add support for RT-DETR to MMDetection and have already completed the code for the r18vd arch model in rtdetr. I didn't notice this existing PR during my coding, and there are some differences between my implementation and this PR. I haven't encountered this situation before. @hhaAndroid What should I do next? Merge my work into this PR? Or submit a new PR? |
any plan to merge this? is this pr reproduce the result? |
Any update on this? |
I trained rtdetr_r18vd on 1 gpu (V100) with total batch size 16, with amp
without amp
|
I trained rtdetr_r50vd on 1 gpu (V100) with total batch size 16,
|
After fixing the initialization method, I got
@hhaAndroid I think the reproduction of rtdetr is done ( |
Can you help to propose a PR? |
非常期待rt-detr上线 |
Really looking forward for the release of RT-DETR |
+1 |
Checklist
Motivation
Support RT-DETR https://arxiv.org/abs/2304.08069
resolves #10186
Consideration
Modification
COCO_val evaluation.