Zero-shot performance about YOLOWorldPromptDetector #154

taofuyu · 2024-03-19T08:31:36Z

I rush into the same question like before, #71 , #78 .
I modify the config in configs/prompt_tuning_coco/, generate custom embedding file, to fine-tune my dataset which has 4 categories.
When inference, I generate a new embedding file which has 7 categories(4 old classes seen in training and 3 new classes) and replace the old embedding file in the config.
These 3 new classes CAN NOT be detected, even setting score threshold to 0.01
It seems like losing open-vocabulary/zero-shot ability.

wondervictor · 2024-03-19T09:00:15Z

Hi @taofuyu, you need to freeze all parameters (backbone, head, and neck) except the embeddings. However, I need to double-check whether all layers are frozen.

taofuyu · 2024-03-19T09:06:36Z

Ok, I will have a try and update the result.

wondervictor · 2024-03-19T09:17:45Z

You can evaluate the 4-category detection and 3-category detection separately and then perform the joint evaluation.

taofuyu · 2024-03-19T09:30:32Z

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

taofuyu · 2024-03-19T09:36:48Z

It seems to validate my idea. After running 10 epochs now, the model can only detect 'car', which appears in the pre-trained datasets, other new categories can not be detected (can be detected when not freeze the model)

Hudaodao99 · 2024-03-19T09:47:16Z

@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning

taofuyu · 2024-03-19T09:49:12Z

@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning

you can compare these two files, by VSCode or something.
The main difference is the value of freeze_all, True or False

wondervictor · 2024-03-19T10:08:36Z

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

taofuyu · 2024-03-19T12:50:43Z

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

@wondervictor

Hudaodao99 · 2024-03-20T01:35:24Z

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

Thanks for your answer!

wondervictor · 2024-03-20T06:18:50Z

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

@taofuyu I'll check it.

Hudaodao99 · 2024-03-20T07:42:24Z

@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)

class= [1 2 4 4 3 6]
confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985]
Traceback (most recent call last):
File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in
inference_detector(runner,
File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 108, in inference_detector
labels = [
File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 109, in
f"{texts[class_id][0]} {confidence:0.2f}" for class_id, confidence in
IndexError: list index out of range

Have you met the same question?

taofuyu · 2024-03-20T08:46:53Z

@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)

class= [1 2 4 4 3 6] confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985] Traceback (most recent call last): File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in inference_detector(runner, File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 108, in inference_detector labels = [ File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 109, in f"{texts[class_id][0]} {confidence:0.2f}" for class_id, confidence in IndexError: list index out of range

Have you met the same question?

detections的结果还是config里embeddings\num_classes的设置的样子，而texts是你命令行里直接输入的，数量不一样就导致维度不匹配了。
正确的做法应该是你测试的时候，需要哪几个类别，就生成哪几类的新的embeddings 并修改对应的num_classes，并与命令行的texts保持一致。

Hudaodao99 · 2024-03-21T01:43:16Z

detections的结果还是config里embeddings\num_classes的设置的样子，而texts是你命令行里直接输入的，数量不一样就导致维度不匹配了。正确的做法应该是你测试的时候，需要哪几个类别，就生成哪几类的新的embeddings 并修改对应的num_classes，并与命令行的texts保持一致。

Thanks!

taofuyu · 2024-03-21T06:53:56Z

I attempt to find a way out this issue thus going to learn more about OVD algorithms. In MM-grounding-DINO, it mentions that close-set fine-tuing will lose OVD generality.
Maybe this is the reason why my model can not detect these 3 new classes. I'm not sure. You can take this as a reference.
@wondervictor

taofuyu · 2024-03-21T07:26:52Z

Furthermore, it mentions that, mix COCO data with some of the pre-trained data will improve performance on the COCO dataset as much as possible without compromising generalization.
My experiments demonstrate it is right. I mix flicker30k/QGA with my custom data to train YOLOWorldDetector, the model can detect my categories and remain OVD-ability.
But, if so, it means YOLOWorldPromptDetecor, can only be fine-tuned as a close-set detector, cause grounding data can not be used during training YOLOWorldPromptDetecor.

wondervictor · 2024-03-21T07:40:54Z

We did not expect it, the original intention of prompt tuning is to retain the zero-shot capability and generalization and to achieve stronger performance on custom datasets.

wondervictor · 2024-03-21T08:33:37Z

Hi @taofuyu, it seems that the configs in configs/prompt_tuning_coco wrongly use base_lr=2e-3. It's a mistake I've made. For fine-tuning all modules, the base_lr should be set to 2e-4. As for training prompts only, I'm going to check again.

taofuyu · 2024-03-21T08:50:41Z

Hi @taofuyu, it seems that the configs in configs/prompt_tuning_coco wrongly use base_lr=2e-3. It's a mistake I've made. For fine-tuning all modules, the base_lr should be set to 2e-4. As for training prompts only, I'm going to check again.

thanks, I already changed the lr to 2e-4 during my fine-tuning.

Hudaodao99 · 2024-03-22T10:41:37Z

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

@wondervictor Hi! I'm not quite sure what the difference is between the purpose of all-tuning and prompt-tuning? Can all-tuning achieve open-vocabulary detection and custom detection together, like prompt-tuning? Also, through the prompt-tuning, can we generate and export our own custom npy file?

mio410 · 2024-03-26T06:59:00Z

@taofuyu 你好，请问你把学习率调整为2e-4后微调效果如何呢？能否解决微调后失去开集检测能力的问题呢？

xiyangyang99 · 2024-03-27T00:59:23Z

我也有同样的问题，我在本地微调自己的数据集之后，自己的数据集20个类，每个类有不同的text prompt，我想在微调自己数据集之后，保留原始预训练权重的clip的zeroshot能力。但是似乎结果不是这样的。比如常用的peroson、people、human都可以检测，但是自己的数据集中，不同文本就检测不了。

taofuyu · 2024-04-03T03:40:07Z

@mio410 No
@xiyangyang99 same question
@wondervictor Hello, any updates on this question ?

wondervictor · 2024-04-03T08:43:37Z

Hi @taofuyu, @xiyangyang99, @Hudaodao99, and @mio410, sorry for the delay. I'll check it and provide solutions asap. Please stay tuned and please let me know if you have any updates.

Yindong-Zhang · 2024-04-08T07:36:19Z

Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu

taofuyu · 2024-04-08T08:41:02Z

Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu

sorry, could you please explain this in detail ?

Yindong-Zhang · 2024-04-08T13:22:28Z

One text prompt may interfere the inference process of the other, you can refer to the text-guided CSPlayer in the paper. I would also like use the prompt tuning technic, hope to solve this issue. like mentioned in :
#154 (comment)
if separate inference and evaluation is correct, it may overpass the problem.

Yindong-Zhang · 2024-04-15T13:13:18Z

@taofuyu any update?in case you don't notice the answer above.

wondervictor · 2024-04-15T14:05:04Z

@Yindong-Zhang, ongoing

taofuyu · 2024-04-16T01:44:23Z

I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.

wondervictor · 2024-04-16T02:33:29Z

Adding VG(or GoldG) for fine-tuning does maintain the zero-shot performance. I'm now seeking more efficient ways such as regularization for efficient fine-tuning.

mandyxiaomeng · 2024-05-09T18:51:55Z

Hi all,
I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?

Thank you!

taofuyu · 2024-05-10T08:16:23Z

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?

Thank you!

tuning custom data with GoldG

trihook · 2024-05-14T03:02:25Z

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!

tuning custom data with GoldG

does mean the grounding dataset is key to build open-vocabulary/zero-shot ability.

taofuyu · 2024-05-15T03:17:03Z

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!

tuning custom data with GoldG

does mean the grounding dataset is key to build open-vocabulary/zero-shot ability.

Yes, I think so

Ricardoluffy · 2024-07-19T01:48:26Z

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!

tuning custom data with GoldG

你好，我使用COCO+GQA来进行微调，但是遇到一个问题，无论如何设置参数，训练了几个epoch之后，grad_norm开始变得很大，loss也变得很大，随后就一直为0，想请教下这是什么原因？

Ricardoluffy · 2024-07-19T01:53:59Z

我使用的配置文件如下：

base = ('../../third_party/mmyolo/configs/yolov8/'
'yolov8_l_syncbn_fast_8xb16-500e_coco.py')
custom_imports = dict(imports=['yolo_world'],
allow_failed_imports=False)

hyper-parameters

num_classes = 80
num_training_classes = 80
max_epochs = 30 # Maximum training epochs
close_mosaic_epochs = 30
save_epoch_intervals = 2
text_channels = 512
neck_embed_channels = [128, 256, base.last_stage_out_channels // 2]
neck_num_heads = [4, 8, base.last_stage_out_channels // 2 // 32]
base_lr = 1e-4
weight_decay = 0.05
train_batch_size_per_gpu = 8

load_from = '/mnt/sdc/lishen/yolo-world-model/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.pth'
text_model_name = 'openai/clip-vit-base-patch32'

model = dict(
type='YOLOWorldDetector',
mm_neck=True,
num_train_classes=num_training_classes,
num_test_classes=num_classes,
data_preprocessor=dict(type='YOLOWDetDataPreprocessor'),
backbone=dict(
delete=True,
type='MultiModalYOLOBackbone',
image_model={{base.model.backbone}},
text_model=dict(
type='HuggingCLIPLanguageBackbone',
model_name=text_model_name,
frozen_modules=['all'])),
neck=dict(type='YOLOWorldPAFPN',
guide_channels=text_channels,
embed_channels=neck_embed_channels,
num_heads=neck_num_heads,
block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')),
bbox_head=dict(type='YOLOWorldHead',
head_module=dict(type='YOLOWorldHeadModule',
use_bn_head=True,
embed_dims=text_channels,
num_classes=num_training_classes)),
train_cfg=dict(assigner=dict(num_classes=num_training_classes)))

text_transform = [
dict(type='RandomLoadText',
num_neg_samples=(num_classes, num_classes),
max_num_samples=num_training_classes,
padding_to_max=True,
padding_value=''),
dict(type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
'flip_direction', 'texts'))
]

train_pipeline = [
*base.pre_transform,
dict(type='MultiModalMosaic',
img_scale=base.img_scale,
pad_val=114.0,
pre_transform=base.pre_transform),
dict(
type='YOLOv5RandomAffine',
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(1 - base.affine_scale, 1 + base.affine_scale),
max_aspect_ratio=base.max_aspect_ratio,
border=(-base.img_scale[0] // 2, -base.img_scale[1] // 2),
border_val=(114, 114, 114)),
*base.last_transform[:-1],
text_transform,
]
train_pipeline_stage2 = [base.train_pipeline_stage2[:-1], *text_transform]

mg_train_dataset = dict(type='YOLOv5MixedGroundingDataset',
data_root='/mnt/sdc/lishen/Dataset/GQA',
ann_file='annotations/final_mixed_train_no_coco.json',
data_prefix=dict(img='images/'),
filter_cfg=dict(filter_empty_gt=False, min_size=32),
pipeline=train_pipeline)

coco_train_dataset = dict( type='MultiModalDataset',
dataset=dict(
type='YOLOv5CocoDataset',
data_root='/mnt/sdc/Datasets/public/COCO',
ann_file='annotations/instances_train2017.json',
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=False,
min_size=32)),
class_text_path='data/texts/coco_class_texts.json',
pipeline=train_pipeline)

train_dataloader = dict(batch_size=train_batch_size_per_gpu,
collate_fn=dict(type='yolow_collate'),
dataset=dict(delete=True,
type='ConcatDataset',
datasets=[
mg_train_dataset,
coco_train_dataset
],
ignore_keys=['classes', 'palette']))

test_pipeline = [
*base.test_pipeline[:-1],
dict(type='LoadText'),
dict(type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor', 'pad_param', 'texts'))
]
coco_val_dataset = dict(
delete=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5CocoDataset',
data_root='/mnt/sdc/Datasets/public/COCO',
test_mode=True,
ann_file='annotations/instances_val2017.json',
data_prefix=dict(img='val2017/'),
batch_shapes_cfg=None),
class_text_path='data/texts/coco_class_texts.json',
pipeline=test_pipeline)

val_dataloader = dict(dataset=coco_val_dataset)
test_dataloader = val_dataloader

val_evaluator = dict(delete=True,
type='mmdet.CocoMetric',
proposal_nums=(100, 1, 10),
ann_file='/mnt/sdc/Datasets/public/COCO/annotations/instances_val2017.json',
metric='bbox')
test_evaluator = val_evaluator

default_hooks = dict(
param_scheduler=dict(
scheduler_type='linear',
lr_factor=0.01,
max_epochs=max_epochs),
checkpoint=dict(
max_keep_ckpts=-1,
save_best=None,
interval=save_epoch_intervals))
custom_hooks = [
dict(
type='EMAHook',
ema_type='ExpMomentumEMA',
momentum=0.0001,
update_buffers=True,
strict_load=False,
priority=49),
dict(
type='mmdet.PipelineSwitchHook',
switch_epoch=max_epochs - close_mosaic_epochs,
switch_pipeline=train_pipeline_stage2)
]
train_cfg = dict(
max_epochs=max_epochs,
val_interval=2,
dynamic_intervals=[((max_epochs - close_mosaic_epochs),
base.val_interval_stage2)])

optim_wrapper = dict(optimizer=dict(
delete=True,
type='SGD',
lr=base_lr,
momentum=0.937,
nesterov=True,
weight_decay=weight_decay,
batch_size_per_gpu=train_batch_size_per_gpu),
paramwise_cfg=dict(
custom_keys={
'backbone.text_model': dict(lr_mult=0.01),
'logit_scale': dict(weight_decay=0.0)
}),
constructor='YOLOWv5OptimizerConstructor')

lvke9529 · 2024-07-26T07:37:56Z

@taofuyu

I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.我认为，只需使用 GoldG 调整自定义数据就可以了。模型可以检测自定义类别，同时保留 OVD 能力。

您好，您的意思是只使用带有goldG的配置文件训练自定义数据集就可以了吗，比如yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py 不需要其他数据集比如flik,QGA这些数据集混合训练？

taofuyu · 2024-08-08T03:40:23Z

@taofuyu

I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.我认为，只需使用 GoldG 调整自定义数据就可以了。模型可以检测自定义类别，同时保留 OVD 能力。

您好，您的意思是只使用带有goldG的配置文件训练自定义数据集就可以了吗，比如yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py 不需要其他数据集比如flik,QGA这些数据集混合训练？

goldG就是flickr那几个grounding数据集的总称

taofuyu · 2024-08-08T03:44:39Z

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!

tuning custom data with GoldG

你好，我使用COCO+GQA来进行微调，但是遇到一个问题，无论如何设置参数，训练了几个epoch之后，grad_norm开始变得很大，loss也变得很大，随后就一直为0，想请教下这是什么原因？

config看着没什么问题，具体原因不太清楚了。。

goodbbboy · 2024-08-13T15:04:35Z

但是，backbone、head 和 neck 的参数都被冻结了，唯一更新的参数 'embeddings' 没有保存到硬盘中（在推理过程中，仍然使用预先计算的嵌入文件），所以模型中似乎没有任何变化？

我会检查一下。

你好，这里的embeddings指的是text embeddings嘛？他是如何更新的呢？通过I-Pooling Attention？

qiiiiiiiiiiiiiiiii · 2024-10-21T12:00:29Z

@taofuyu Hello, I also want to add the gold dataset to keep zero-shot, but don't know how to set it up. Would you like to show me your relevant profile? If you can, I hope you send it to this email: mr.pengc@foxmail.com. Thank you very much for sharing

Furthermore, it mentions that, mix COCO data with some of the pre-trained data will improve performance on the COCO dataset as much as possible without compromising generalization.此外，它提到， mix COCO data with some of the pre-trained data 将 improve performance on the COCO dataset as much as possible without compromising generalization . My experiments demonstrate it is right. I mix flicker30k/QGA with my custom data to train YOLOWorldDetector, the model can detect my categories and remain OVD-ability.我的实验证明这是正确的。我将 flicker30k/QGA 与我的自定义数据混合以训练 YOLOWorldDetector，该模型可以检测我的类别并保持 OVD 能力。 But, if so, it means YOLOWorldPromptDetecor, can only be fine-tuned as a close-set detector, cause grounding data can not be used during training YOLOWorldPromptDetecor.但是，如果是这样，则意味着 YOLOWorldPromptDetecor 只能作为紧密设置的检测器进行微调，因为在训练 YOLOWorldPromptDetecor 期间无法使用接地数据。

wondervictor mentioned this issue Mar 21, 2024

Roadmap of YOLO-World #109

Open

16 tasks

wondervictor added bug Something isn't working discussions The issue might be helpful or contains useful information labels Mar 21, 2024

wondervictor pinned this issue Mar 27, 2024

wondervictor unpinned this issue May 23, 2024

Zero-shot performance about YOLOWorldPromptDetector #154

Zero-shot performance about YOLOWorldPromptDetector #154

Comments

taofuyu commented Mar 19, 2024 • edited Loading

wondervictor commented Mar 19, 2024

taofuyu commented Mar 19, 2024

wondervictor commented Mar 19, 2024

taofuyu commented Mar 19, 2024

taofuyu commented Mar 19, 2024 • edited Loading

Hudaodao99 commented Mar 19, 2024

taofuyu commented Mar 19, 2024 • edited Loading

wondervictor commented Mar 19, 2024

taofuyu commented Mar 19, 2024

Hudaodao99 commented Mar 20, 2024

wondervictor commented Mar 20, 2024

Hudaodao99 commented Mar 20, 2024 • edited Loading

taofuyu commented Mar 20, 2024

Hudaodao99 commented Mar 21, 2024

taofuyu commented Mar 21, 2024 • edited Loading

taofuyu commented Mar 21, 2024 • edited Loading

wondervictor commented Mar 21, 2024

wondervictor commented Mar 21, 2024

taofuyu commented Mar 21, 2024

Hudaodao99 commented Mar 22, 2024 • edited Loading

mio410 commented Mar 26, 2024

xiyangyang99 commented Mar 27, 2024

taofuyu commented Apr 3, 2024

wondervictor commented Apr 3, 2024

Yindong-Zhang commented Apr 8, 2024

taofuyu commented Apr 8, 2024

Yindong-Zhang commented Apr 8, 2024 • edited Loading

Yindong-Zhang commented Apr 15, 2024

wondervictor commented Apr 15, 2024

taofuyu commented Apr 16, 2024

wondervictor commented Apr 16, 2024

mandyxiaomeng commented May 9, 2024

taofuyu commented May 10, 2024

trihook commented May 14, 2024

taofuyu commented May 15, 2024

Ricardoluffy commented Jul 19, 2024

Ricardoluffy commented Jul 19, 2024

hyper-parameters

lvke9529 commented Jul 26, 2024

taofuyu commented Aug 8, 2024

taofuyu commented Aug 8, 2024

goodbbboy commented Aug 13, 2024

qiiiiiiiiiiiiiiiii commented Oct 21, 2024

taofuyu commented Mar 19, 2024 •

edited

Loading

taofuyu commented Mar 19, 2024 •

edited

Loading

taofuyu commented Mar 19, 2024 •

edited

Loading

Hudaodao99 commented Mar 20, 2024 •

edited

Loading

taofuyu commented Mar 21, 2024 •

edited

Loading

taofuyu commented Mar 21, 2024 •

edited

Loading

Hudaodao99 commented Mar 22, 2024 •

edited

Loading

Yindong-Zhang commented Apr 8, 2024 •

edited

Loading