Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent evaluation results #2594

Open
GuoSicen opened this issue Feb 14, 2023 · 13 comments
Open

Inconsistent evaluation results #2594

GuoSicen opened this issue Feb 14, 2023 · 13 comments
Assignees
Labels

Comments

@GuoSicen
Copy link

GuoSicen commented Feb 14, 2023

I use "tools/test.py --eval" to test the test set of the results, and I also save the predicted pictures after the test, then they are compared with the ground truth to get the iou, fscore result. Two result is not consistent, if my config file on the test set is wrong, which due to the different result, the config file is as follows.

norm_cfg = dict(type='BN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        type='ANNHead',
        in_channels=[1024, 2048],
        in_index=[2, 3],
        channels=512,
        project_channels=256,
        query_scales=(1, ),
        key_pool_scales=(1, 3, 6, 8),
        dropout_ratio=0.1,
        num_classes=21,
        norm_cfg=dict(type='BN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=21,
        norm_cfg=dict(type='BN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
    train_cfg=dict(),
    test_cfg=dict(mode='whole'))
dataset_type = 'PascalVOCDataset'
data_root = 'data/VOC'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2048, 512),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=4,
    workers_per_gpu=4,
    train=dict(
        type='PascalVOCDataset',
        data_root='data/VOC',
        img_dir='JPEGImages',
        ann_dir='SegmentationClassPNG',#
        split=['ImageSets/Segmentation/train.txt'],#
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
            dict(type='RandomFlip', prob=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ]),
    val=dict(
        type='PascalVOCDataset',
        data_root='data/VOC',#
        img_dir='JPEGImages',
        ann_dir='SegmentationClassPNG',#
        split='ImageSets/Segmentation/val.txt',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='PascalVOCDataset',
        data_root='data/VOC',#
        img_dir='JPEGImages',
        ann_dir='SegmentationClassPNG',#
        split='ImageSets/Segmentation/test.txt',#
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=True),
        dict(type='TensorboardLoggerHook')
    ])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
runner = dict(type='IterBasedRunner', max_iters=4000)
checkpoint_config = dict(by_epoch=False, interval=100)
evaluation = dict(interval=100, metric='mIoU', pre_eval=True)
work_dir = './work_dirs/ann_r50-d8_512x512_20k_voc12aug/pretrain2'
gpu_ids = [0]
auto_resume = False
@Rowan-L
Copy link

Rowan-L commented Feb 16, 2023

I also encountered this problem and later found out that it was the different way of calculating IoU. The mmseg is calculated for the whole dataset, not the average of the sum of iou.

@GuoSicen
Copy link
Author

If there are 1000 images, 600 are used for training, 100 for validation, and 300 for testing, what does "for the whole dataset" mean here? Does it mean dividing by the total number of images, not the number of images in each set? I can't understand. Could you please explain it in detail? Thank you. And could you please share a suitable solution?

@Rowan-L
Copy link

Rowan-L commented Feb 16, 2023

如果有 1000 张图片,600 张用于训练,100 张用于验证,300 张用于测试,这里的“for the whole dataset”是什么意思?这是否意味着除以图像总数,而不是每组图像的数量?我不明白。你能详细解释一下吗?谢谢。你能分享一个合适的解决方案吗?
I have noticed from some open source projects that two ways of calculating metrics exist. The first, as used by mmseg, is to calculate the intersection and concatenation of labels and predictions for the entire test set, which can be imagined as stitching the entire test set into one big picture and then calculating the IoU of this big picture. The second, as I have seen in other projects, is to calculate the IoU of a single sample in the test set , then sum up the IoU of each sample and divide by the number of samples. These two calculations will lead to different results.

@GuoSicen
Copy link
Author

But before calculating the metrics, the pictures will be resized into the same size (refer to the code in config files: img_scale=(2048, 512),). in this situation, two ways you mentioned above will lead to the same results. I don't understand.

@Rowan-L
Copy link

Rowan-L commented Feb 17, 2023

But before calculating the metrics, the pictures will be resized into the same size (refer to the code in config files: img_scale=(2048, 512),). in this situation, two ways you mentioned above will lead to the same results. I don't understand.

In fact, the two approaches yield different results, or we are not talking about the same issue

@GuoSicen
Copy link
Author

When the size of the dataset image is the same, the calculated results of the two methods you mentioned should be the same. Because the code part has the relevant code to resize the image size, I don't think that is the reason for the different results.

@GuoSicen
Copy link
Author

@xiexinch 请问可以帮我看一下问题出在哪里了吗,还是有点懵

@GuoSicen
Copy link
Author

@xiexinch @Rowan-L 我好像发现了问题,resize后跟了一个参数keep_ratio=true,如果为true,前面resize的img_scale不为最终resize的大小,而是一个最大最小区间,参见https://zhuanlan.zhihu.com/p/381117525 ,但有一个问题,如果是这样的话,测试结果是否正确,如果正确,如何得到resize过的预测图片,而不是原图大小?

@MeowZheng MeowZheng added the FAQ label Feb 22, 2023
@MeowZheng
Copy link
Collaborator

MeowZheng commented Feb 22, 2023

keep_ratio=True just make the length-to-width ratio is same as before resize. The prediction of model will be resized as the original size of image. If you use the original image to test, just modify test pipeline as

test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='Normalize',mean=[123.675, 116.28, 103.53],std=[58.395, 57.12, 57.375],to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
]

@GuoSicen
Copy link
Author

Through the same config file, regardless of the keep_ratio=True operation, the result of the evaluation using tools/test.py should be the same as the result of iou test between the predicted images and ground truth, but why is it not the same?

@MeowZheng
Copy link
Collaborator

Would you like to tell me specific different mIoU?
Please check there is no randomness in the model, and please check model is eval mode and check the checkpoint you used is consistent.

@GuoSicen
Copy link
Author

The iou result obtained using the test.py file is 0.7593. The result obtained by predicting the image by the model, saving the image (in png format) and comparing the ground truth is 0.6960. There should be no randomness in the model, as the same result is obtained several times using the test.py file.

@FlorinAndrei
Copy link

@GuoSicen I have not examined your report in detail, so I can't be sure, but the magnitude of the difference suggests it may be related to this:

#2655

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants