Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-shot performance about YOLOWorldPromptDetector #154

Open
taofuyu opened this issue Mar 19, 2024 · 42 comments
Open

Zero-shot performance about YOLOWorldPromptDetector #154

taofuyu opened this issue Mar 19, 2024 · 42 comments
Labels
bug Something isn't working discussions The issue might be helpful or contains useful information

Comments

@taofuyu
Copy link
Contributor

taofuyu commented Mar 19, 2024

I rush into the same question like before, #71 , #78 .
I modify the config in configs/prompt_tuning_coco/, generate custom embedding file, to fine-tune my dataset which has 4 categories.
When inference, I generate a new embedding file which has 7 categories(4 old classes seen in training and 3 new classes) and replace the old embedding file in the config.
These 3 new classes CAN NOT be detected, even setting score threshold to 0.01
It seems like losing open-vocabulary/zero-shot ability.

@wondervictor
Copy link
Collaborator

Hi @taofuyu, you need to freeze all parameters (backbone, head, and neck) except the embeddings. However, I need to double-check whether all layers are frozen.

@taofuyu
Copy link
Contributor Author

taofuyu commented Mar 19, 2024

Ok, I will have a try and update the result.

@wondervictor
Copy link
Collaborator

You can evaluate the 4-category detection and 3-category detection separately and then perform the joint evaluation.

@taofuyu
Copy link
Contributor Author

taofuyu commented Mar 19, 2024

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

@taofuyu
Copy link
Contributor Author

taofuyu commented Mar 19, 2024

It seems to validate my idea. After running 10 epochs now, the model can only detect 'car', which appears in the pre-trained datasets, other new categories can not be detected (can be detected when not freeze the model)

@Hudaodao99
Copy link

@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning

@taofuyu
Copy link
Contributor Author

taofuyu commented Mar 19, 2024

@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning

you can compare these two files, by VSCode or something.
The main difference is the value of freeze_all, True or False

@wondervictor
Copy link
Collaborator

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

@taofuyu
Copy link
Contributor Author

taofuyu commented Mar 19, 2024

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

@wondervictor

@Hudaodao99
Copy link

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

Thanks for your answer!

@wondervictor
Copy link
Collaborator

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

@taofuyu I'll check it.

@Hudaodao99
Copy link

Hudaodao99 commented Mar 20, 2024

@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)

class= [1 2 4 4 3 6]
confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985]
Traceback (most recent call last):
File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in
inference_detector(runner,
File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 108, in inference_detector
labels = [
File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 109, in
f"{texts[class_id][0]} {confidence:0.2f}" for class_id, confidence in
IndexError: list index out of range

Have you met the same question?

@taofuyu
Copy link
Contributor Author

taofuyu commented Mar 20, 2024

@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)

class= [1 2 4 4 3 6] confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985] Traceback (most recent call last): File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in inference_detector(runner, File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 108, in inference_detector labels = [ File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 109, in f"{texts[class_id][0]} {confidence:0.2f}" for class_id, confidence in IndexError: list index out of range

Have you met the same question?

detections的结果还是config里embeddings\num_classes的设置的样子,而texts是你命令行里直接输入的,数量不一样就导致维度不匹配了。
正确的做法应该是你测试的时候,需要哪几个类别,就生成哪几类的新的embeddings 并修改对应的num_classes,并与命令行的texts保持一致。

@Hudaodao99
Copy link

detections的结果还是config里embeddings\num_classes的设置的样子,而texts是你命令行里直接输入的,数量不一样就导致维度不匹配了。 正确的做法应该是你测试的时候,需要哪几个类别,就生成哪几类的新的embeddings 并修改对应的num_classes,并与命令行的texts保持一致。

Thanks!

@taofuyu
Copy link
Contributor Author

taofuyu commented Mar 21, 2024

I attempt to find a way out this issue thus going to learn more about OVD algorithms. In MM-grounding-DINO, it mentions that close-set fine-tuing will lose OVD generality.
Maybe this is the reason why my model can not detect these 3 new classes. I'm not sure. You can take this as a reference.
@wondervictor

@taofuyu
Copy link
Contributor Author

taofuyu commented Mar 21, 2024

Furthermore, it mentions that, mix COCO data with some of the pre-trained data will improve performance on the COCO dataset as much as possible without compromising generalization.
My experiments demonstrate it is right. I mix flicker30k/QGA with my custom data to train YOLOWorldDetector, the model can detect my categories and remain OVD-ability.
But, if so, it means YOLOWorldPromptDetecor, can only be fine-tuned as a close-set detector, cause grounding data can not be used during training YOLOWorldPromptDetecor.

@wondervictor
Copy link
Collaborator

We did not expect it, the original intention of prompt tuning is to retain the zero-shot capability and generalization and to achieve stronger performance on custom datasets.

@wondervictor wondervictor added bug Something isn't working discussions The issue might be helpful or contains useful information labels Mar 21, 2024
@wondervictor
Copy link
Collaborator

Hi @taofuyu, it seems that the configs in configs/prompt_tuning_coco wrongly use base_lr=2e-3. It's a mistake I've made. For fine-tuning all modules, the base_lr should be set to 2e-4. As for training prompts only, I'm going to check again.

@taofuyu
Copy link
Contributor Author

taofuyu commented Mar 21, 2024

Hi @taofuyu, it seems that the configs in configs/prompt_tuning_coco wrongly use base_lr=2e-3. It's a mistake I've made. For fine-tuning all modules, the base_lr should be set to 2e-4. As for training prompts only, I'm going to check again.

thanks, I already changed the lr to 2e-4 during my fine-tuning.

@Hudaodao99
Copy link

Hudaodao99 commented Mar 22, 2024

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

@wondervictor Hi! I'm not quite sure what the difference is between the purpose of all-tuning and prompt-tuning? Can all-tuning achieve open-vocabulary detection and custom detection together, like prompt-tuning? Also, through the prompt-tuning, can we generate and export our own custom npy file?

@mio410
Copy link

mio410 commented Mar 26, 2024

@taofuyu 你好,请问你把学习率调整为2e-4后微调效果如何呢?能否解决微调后失去开集检测能力的问题呢?

@xiyangyang99
Copy link

我也有同样的 问题,我在本地微调自己的数据集之后,自己的数据集20个类,每个类有不同的text prompt,我想在微调自己数据集之后,保留原始预训练权重的clip的zeroshot能力。但是似乎结果不是这样的。比如常用的peroson、people、human都可以检测,但是自己的 数据集中,不同文本就检测不了。

@wondervictor wondervictor pinned this issue Mar 27, 2024
@taofuyu
Copy link
Contributor Author

taofuyu commented Apr 3, 2024

@mio410 No
@xiyangyang99 same question
@wondervictor Hello, any updates on this question ?

@wondervictor
Copy link
Collaborator

Hi @taofuyu, @xiyangyang99, @Hudaodao99, and @mio410, sorry for the delay. I'll check it and provide solutions asap. Please stay tuned and please let me know if you have any updates.

@Yindong-Zhang
Copy link

Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu

@taofuyu
Copy link
Contributor Author

taofuyu commented Apr 8, 2024

Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu

sorry, could you please explain this in detail ?

@Yindong-Zhang
Copy link

Yindong-Zhang commented Apr 8, 2024

One text prompt may interfere the inference process of the other, you can refer to the text-guided CSPlayer in the paper. I would also like use the prompt tuning technic, hope to solve this issue. like mentioned in :
#154 (comment)
if separate inference and evaluation is correct, it may overpass the problem.

@Yindong-Zhang
Copy link

@taofuyu any update?in case you don't notice the answer above.

@wondervictor
Copy link
Collaborator

@Yindong-Zhang, ongoing

@taofuyu
Copy link
Contributor Author

taofuyu commented Apr 16, 2024

I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.

@wondervictor
Copy link
Collaborator

Adding VG(or GoldG) for fine-tuning does maintain the zero-shot performance. I'm now seeking more efficient ways such as regularization for efficient fine-tuning.

@mandyxiaomeng
Copy link

Hi all,
I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?

Thank you!

@taofuyu
Copy link
Contributor Author

taofuyu commented May 10, 2024

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?

Thank you!

tuning custom data with GoldG

@trihook
Copy link

trihook commented May 14, 2024

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!

tuning custom data with GoldG

does mean the grounding dataset is key to build open-vocabulary/zero-shot ability.

@taofuyu
Copy link
Contributor Author

taofuyu commented May 15, 2024

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!

tuning custom data with GoldG

does mean the grounding dataset is key to build open-vocabulary/zero-shot ability.

Yes, I think so

@wondervictor wondervictor unpinned this issue May 23, 2024
@Ricardoluffy
Copy link

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!

tuning custom data with GoldG

你好,我使用COCO+GQA来进行微调,但是遇到一个问题,无论如何设置参数,训练了几个epoch之后,grad_norm开始变得很大,loss也变得很大,随后就一直为0,想请教下这是什么原因?

@Ricardoluffy
Copy link

我使用的配置文件如下:

base = ('../../third_party/mmyolo/configs/yolov8/'
'yolov8_l_syncbn_fast_8xb16-500e_coco.py')
custom_imports = dict(imports=['yolo_world'],
allow_failed_imports=False)

hyper-parameters

num_classes = 80
num_training_classes = 80
max_epochs = 30 # Maximum training epochs
close_mosaic_epochs = 30
save_epoch_intervals = 2
text_channels = 512
neck_embed_channels = [128, 256, base.last_stage_out_channels // 2]
neck_num_heads = [4, 8, base.last_stage_out_channels // 2 // 32]
base_lr = 1e-4
weight_decay = 0.05
train_batch_size_per_gpu = 8

load_from = '/mnt/sdc/lishen/yolo-world-model/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.pth'
text_model_name = 'openai/clip-vit-base-patch32'

model = dict(
type='YOLOWorldDetector',
mm_neck=True,
num_train_classes=num_training_classes,
num_test_classes=num_classes,
data_preprocessor=dict(type='YOLOWDetDataPreprocessor'),
backbone=dict(
delete=True,
type='MultiModalYOLOBackbone',
image_model={{base.model.backbone}},
text_model=dict(
type='HuggingCLIPLanguageBackbone',
model_name=text_model_name,
frozen_modules=['all'])),
neck=dict(type='YOLOWorldPAFPN',
guide_channels=text_channels,
embed_channels=neck_embed_channels,
num_heads=neck_num_heads,
block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')),
bbox_head=dict(type='YOLOWorldHead',
head_module=dict(type='YOLOWorldHeadModule',
use_bn_head=True,
embed_dims=text_channels,
num_classes=num_training_classes)),
train_cfg=dict(assigner=dict(num_classes=num_training_classes)))

text_transform = [
dict(type='RandomLoadText',
num_neg_samples=(num_classes, num_classes),
max_num_samples=num_training_classes,
padding_to_max=True,
padding_value=''),
dict(type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
'flip_direction', 'texts'))
]

train_pipeline = [
*base.pre_transform,
dict(type='MultiModalMosaic',
img_scale=base.img_scale,
pad_val=114.0,
pre_transform=base.pre_transform),
dict(
type='YOLOv5RandomAffine',
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(1 - base.affine_scale, 1 + base.affine_scale),
max_aspect_ratio=base.max_aspect_ratio,
border=(-base.img_scale[0] // 2, -base.img_scale[1] // 2),
border_val=(114, 114, 114)),
*base.last_transform[:-1],
text_transform,
]
train_pipeline_stage2 = [
base.train_pipeline_stage2[:-1], *text_transform]

mg_train_dataset = dict(type='YOLOv5MixedGroundingDataset',
data_root='/mnt/sdc/lishen/Dataset/GQA',
ann_file='annotations/final_mixed_train_no_coco.json',
data_prefix=dict(img='images/'),
filter_cfg=dict(filter_empty_gt=False, min_size=32),
pipeline=train_pipeline)

coco_train_dataset = dict( type='MultiModalDataset',
dataset=dict(
type='YOLOv5CocoDataset',
data_root='/mnt/sdc/Datasets/public/COCO',
ann_file='annotations/instances_train2017.json',
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=False,
min_size=32)),
class_text_path='data/texts/coco_class_texts.json',
pipeline=train_pipeline)

train_dataloader = dict(batch_size=train_batch_size_per_gpu,
collate_fn=dict(type='yolow_collate'),
dataset=dict(delete=True,
type='ConcatDataset',
datasets=[
mg_train_dataset,
coco_train_dataset
],
ignore_keys=['classes', 'palette']))

test_pipeline = [
*base.test_pipeline[:-1],
dict(type='LoadText'),
dict(type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor', 'pad_param', 'texts'))
]
coco_val_dataset = dict(
delete=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5CocoDataset',
data_root='/mnt/sdc/Datasets/public/COCO',
test_mode=True,
ann_file='annotations/instances_val2017.json',
data_prefix=dict(img='val2017/'),
batch_shapes_cfg=None),
class_text_path='data/texts/coco_class_texts.json',
pipeline=test_pipeline)

val_dataloader = dict(dataset=coco_val_dataset)
test_dataloader = val_dataloader

val_evaluator = dict(delete=True,
type='mmdet.CocoMetric',
proposal_nums=(100, 1, 10),
ann_file='/mnt/sdc/Datasets/public/COCO/annotations/instances_val2017.json',
metric='bbox')
test_evaluator = val_evaluator

default_hooks = dict(
param_scheduler=dict(
scheduler_type='linear',
lr_factor=0.01,
max_epochs=max_epochs),
checkpoint=dict(
max_keep_ckpts=-1,
save_best=None,
interval=save_epoch_intervals))
custom_hooks = [
dict(
type='EMAHook',
ema_type='ExpMomentumEMA',
momentum=0.0001,
update_buffers=True,
strict_load=False,
priority=49),
dict(
type='mmdet.PipelineSwitchHook',
switch_epoch=max_epochs - close_mosaic_epochs,
switch_pipeline=train_pipeline_stage2)
]
train_cfg = dict(
max_epochs=max_epochs,
val_interval=2,
dynamic_intervals=[((max_epochs - close_mosaic_epochs),
base.val_interval_stage2)])

optim_wrapper = dict(optimizer=dict(
delete=True,
type='SGD',
lr=base_lr,
momentum=0.937,
nesterov=True,
weight_decay=weight_decay,
batch_size_per_gpu=train_batch_size_per_gpu),
paramwise_cfg=dict(
custom_keys={
'backbone.text_model': dict(lr_mult=0.01),
'logit_scale': dict(weight_decay=0.0)
}),
constructor='YOLOWv5OptimizerConstructor')

@lvke9529
Copy link

@taofuyu

I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.我认为,只需使用 GoldG 调整自定义数据就可以了。模型可以检测自定义类别,同时保留 OVD 能力。

您好,您的意思是只使用带有goldG的配置文件训练自定义数据集就可以了吗,比如yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py 不需要其他数据集比如flik,QGA这些数据集混合训练?

@taofuyu
Copy link
Contributor Author

taofuyu commented Aug 8, 2024

@taofuyu

I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.我认为,只需使用 GoldG 调整自定义数据就可以了。模型可以检测自定义类别,同时保留 OVD 能力。

您好,您的意思是只使用带有goldG的配置文件训练自定义数据集就可以了吗,比如yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py 不需要其他数据集比如flik,QGA这些数据集混合训练?

goldG就是flickr那几个grounding数据集的总称

@taofuyu
Copy link
Contributor Author

taofuyu commented Aug 8, 2024

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!

tuning custom data with GoldG

你好,我使用COCO+GQA来进行微调,但是遇到一个问题,无论如何设置参数,训练了几个epoch之后,grad_norm开始变得很大,loss也变得很大,随后就一直为0,想请教下这是什么原因?

config看着没什么问题,具体原因不太清楚了。。

@goodbbboy
Copy link

但是,backbone、head 和 neck 的参数都被冻结了,唯一更新的参数 'embeddings' 没有保存到硬盘中(在推理过程中,仍然使用预先计算的嵌入文件),所以模型中似乎没有任何变化?

我会检查一下。

你好,这里的embeddings指的是text embeddings嘛?他是如何更新的呢?通过I-Pooling Attention?

@qiiiiiiiiiiiiiiiii
Copy link

@taofuyu Hello, I also want to add the gold dataset to keep zero-shot, but don't know how to set it up. Would you like to show me your relevant profile? If you can, I hope you send it to this email: mr.pengc@foxmail.com. Thank you very much for sharing

Furthermore, it mentions that, mix COCO data with some of the pre-trained data will improve performance on the COCO dataset as much as possible without compromising generalization.此外,它提到, mix COCO data with some of the pre-trained dataimprove performance on the COCO dataset as much as possible without compromising generalization . My experiments demonstrate it is right. I mix flicker30k/QGA with my custom data to train YOLOWorldDetector, the model can detect my categories and remain OVD-ability.我的实验证明这是正确的。我将 flicker30k/QGA 与我的自定义数据混合以训练 YOLOWorldDetector,该模型可以检测我的类别并保持 OVD 能力。 But, if so, it means YOLOWorldPromptDetecor, can only be fine-tuned as a close-set detector, cause grounding data can not be used during training YOLOWorldPromptDetecor.但是,如果是这样,则意味着 YOLOWorldPromptDetecor 只能作为紧密设置的检测器进行微调,因为在训练 YOLOWorldPromptDetecor 期间无法使用接地数据。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussions The issue might be helpful or contains useful information
Projects
None yet
Development

No branches or pull requests