bugfix: image demo & support image and text prompts

AILab-CVC · Mar 16, 2024 · ee57525 · ee57525
1 parent 323386a
commit ee57525
Show file tree

Hide file tree

Showing 13 changed files with 475 additions and 78 deletions.
diff --git a/README.md b/README.md
@@ -31,12 +31,13 @@
 </div>
 
 
-## Updates 
-`🔥[2024-3-3]:` We add the **high-resolution YOLO-World**, which supports `1280x1280` resolution with higher accuracy and better performance for small objects!  
-`🔥[2024-2-29]:` We release the newest version of [ **YOLO-World-v2**](./docs/updates.md) with higher accuracy and faster speed! We hope the community can join us to improve YOLO-World!  
-`🔥[2024-2-28]:` Excited to announce that YOLO-World has been accepted by **CVPR 2024**! We're continuing to make YOLO-World faster and stronger, as well as making it better to use for all.  
-`🔥[2024-2-22]:` We sincerely thank [RoboFlow](https://roboflow.com/) and [@Skalskip92](https://twitter.com/skalskip92) for the [**Video Guide**](https://www.youtube.com/watch?v=X7gKBGVz4vs) about YOLO-World, nice work!  
-`🔥[2024-2-18]:` We thank [@Skalskip92](https://twitter.com/skalskip92) for developing the wonderful segmentation demo via connecting YOLO-World and EfficientSAM. You can try it now at the [🤗 HuggingFace Spaces](https://huggingface.co/spaces/SkalskiP/YOLO-World).   
+## 🔥 Updates 
+`[2024-3-16]:` We fix the bugs about the demo ([#110](https://github.com/AILab-CVC/YOLO-World/issues/110),[#94](https://github.com/AILab-CVC/YOLO-World/issues/94),[#129](https://github.com/AILab-CVC/YOLO-World/issues/129), [#125](https://github.com/AILab-CVC/YOLO-World/issues/125)) with visualizations of segmentation masks, and release [**YOLO-World with Embeddings**](./docs/prompt_yolo_world.md), which supports prompt tuning, text prompts and image prompts.  
+`[2024-3-3]:` We add the **high-resolution YOLO-World**, which supports `1280x1280` resolution with higher accuracy and better performance for small objects!  
+`[2024-2-29]:` We release the newest version of [ **YOLO-World-v2**](./docs/updates.md) with higher accuracy and faster speed! We hope the community can join us to improve YOLO-World!  
+`[2024-2-28]:` Excited to announce that YOLO-World has been accepted by **CVPR 2024**! We're continuing to make YOLO-World faster and stronger, as well as making it better to use for all.  
+`[2024-2-22]:` We sincerely thank [RoboFlow](https://roboflow.com/) and [@Skalskip92](https://twitter.com/skalskip92) for the [**Video Guide**](https://www.youtube.com/watch?v=X7gKBGVz4vs) about YOLO-World, nice work!  
+`[2024-2-18]:` We thank [@Skalskip92](https://twitter.com/skalskip92) for developing the wonderful segmentation demo via connecting YOLO-World and EfficientSAM. You can try it now at the [🤗 HuggingFace Spaces](https://huggingface.co/spaces/SkalskiP/YOLO-World).   
 `[2024-2-17]:` The largest model **X** of YOLO-World is released, which achieves better zero-shot performance!   
 `[2024-2-17]:` We release the code & models for **YOLO-World-Seg** now! YOLO-World now supports open-vocabulary / zero-shot object segmentation!  
 `[2024-2-15]:` The pre-traind YOLO-World-L with CC3M-Lite is released!     

diff --git a/...igs/finetune_coco/yolo_world_l_efficient_neck_2e-4_80e_8gpus_mask-refine_finetune_coco.py b/...igs/finetune_coco/yolo_world_l_efficient_neck_2e-4_80e_8gpus_mask-refine_finetune_coco.py
@@ -15,6 +15,8 @@
 weight_decay = 0.05
 train_batch_size_per_gpu = 16
 load_from = 'pretrained_models/yolo_world_l_clip_base_dual_vlpan_2e-3adamw_32xb16_100e_o365_goldg_train_pretrained-0e566235.pth'
+# huggingface text model
+text_model_name = 'openai/clip-vit-base-patch32'
 persistent_workers = False
 
 # model settings
@@ -30,7 +32,7 @@
         image_model={{_base_.model.backbone}},
         text_model=dict(
             type='HuggingCLIPLanguageBackbone',
-            model_name='openai/clip-vit-base-patch32',
+            model_name=text_model_name,
             frozen_modules=['all'])),
     neck=dict(type='YOLOWorldPAFPN',
               guide_channels=text_channels,

diff --git a/.../pretrain/yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py b/.../pretrain/yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py
@@ -15,7 +15,8 @@
 base_lr = 2e-3
 weight_decay = 0.05 / 2
 train_batch_size_per_gpu = 16
-
+# text_model_name = '../pretrained_models/clip-vit-base-patch32-projection'
+text_model_name = 'openai/clip-vit-base-patch32'
 # model settings
 model = dict(
     type='YOLOWorldDetector',
@@ -29,7 +30,7 @@
         image_model={{_base_.model.backbone}},
         text_model=dict(
             type='HuggingCLIPLanguageBackbone',
-            model_name='openai/clip-vit-base-patch32',
+            model_name=text_model_name,
             frozen_modules=['all'])),
     neck=dict(type='YOLOWorldPAFPN',
               guide_channels=text_channels,

diff --git a/...mpt_tuning_coco/yolo_world_v2_l_vlpan_bn_2e-4_80e_8gpus_mask-refine_prompt_tuning_coco.py b/...mpt_tuning_coco/yolo_world_v2_l_vlpan_bn_2e-4_80e_8gpus_mask-refine_prompt_tuning_coco.py
@@ -0,0 +1,161 @@
+_base_ = ('../../third_party/mmyolo/configs/yolov8/'
+          'yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')
+custom_imports = dict(imports=['yolo_world'], allow_failed_imports=False)
+
+# hyper-parameters
+num_classes = 80
+num_training_classes = 80
+max_epochs = 80  # Maximum training epochs
+close_mosaic_epochs = 10
+save_epoch_intervals = 5
+text_channels = 512
+neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2]
+neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32]
+base_lr = 2e-3
+weight_decay = 0.05
+train_batch_size_per_gpu = 16
+load_from = 'pretrained_models/yolo_world_l_clip_t2i_bn_2e-3adamw_32xb16-100e_obj365v1_goldg_cc3mlite_train-ca93cd1f.pth'
+persistent_workers = False
+
+# model settings
+model = dict(type='YOLOWorldPromptDetector',
+             mm_neck=True,
+             num_train_classes=num_training_classes,
+             num_test_classes=num_classes,
+             embedding_path='embeddings/clip_vit_b32_coco_80_embeddings.npy',
+             prompt_dim=text_channels,
+             num_prompts=80,
+             data_preprocessor=dict(type='YOLOv5DetDataPreprocessor'),
+             backbone=dict(_delete_=True,
+                           type='MultiModalYOLOBackbone',
+                           text_model=None,
+                           image_model={{_base_.model.backbone}},
+                           frozen_stages=4,
+                           with_text_model=False),
+             neck=dict(type='YOLOWorldPAFPN',
+                       freeze_all=True,
+                       guide_channels=text_channels,
+                       embed_channels=neck_embed_channels,
+                       num_heads=neck_num_heads,
+                       block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')),
+             bbox_head=dict(type='YOLOWorldHead',
+                            head_module=dict(
+                                type='YOLOWorldHeadModule',
+                                freeze_all=True,
+                                use_bn_head=True,
+                                embed_dims=text_channels,
+                                num_classes=num_training_classes)),
+             train_cfg=dict(assigner=dict(num_classes=num_training_classes)))
+
+# dataset settings
+final_transform = [
+    dict(type='mmdet.PackDetInputs',
+         meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
+                    'flip_direction'))
+]
+mosaic_affine_transform = [
+    dict(type='Mosaic',
+         img_scale=_base_.img_scale,
+         pad_val=114.0,
+         pre_transform=_base_.pre_transform),
+    dict(type='YOLOv5CopyPaste', prob=_base_.copypaste_prob),
+    dict(
+        type='YOLOv5RandomAffine',
+        max_rotate_degree=0.0,
+        max_shear_degree=0.0,
+        max_aspect_ratio=100.,
+        scaling_ratio_range=(1 - _base_.affine_scale, 1 + _base_.affine_scale),
+        # img_scale is (width, height)
+        border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
+        border_val=(114, 114, 114),
+        min_area_ratio=_base_.min_area_ratio,
+        use_mask_refine=_base_.use_mask2refine)
+]
+train_pipeline = [
+    *_base_.pre_transform, *mosaic_affine_transform,
+    dict(type='YOLOv5MixUp',
+         prob=_base_.mixup_prob,
+         pre_transform=[*_base_.pre_transform, *mosaic_affine_transform]),
+    *_base_.last_transform[:-1], *final_transform
+]
+
+train_pipeline_stage2 = [*_base_.train_pipeline_stage2[:-1], *final_transform]
+
+coco_train_dataset = dict(type='YOLOv5CocoDataset',
+                          data_root='data/coco',
+                          ann_file='annotations/instances_train2017.json',
+                          data_prefix=dict(img='train2017/'),
+                          filter_cfg=dict(filter_empty_gt=False, min_size=32),
+                          pipeline=train_pipeline)
+
+train_dataloader = dict(persistent_workers=persistent_workers,
+                        batch_size=train_batch_size_per_gpu,
+                        collate_fn=dict(type='yolow_collate'),
+                        dataset=coco_train_dataset)
+
+train_dataloader = dict(persistent_workers=persistent_workers,
+                        batch_size=train_batch_size_per_gpu,
+                        collate_fn=dict(type='yolow_collate'),
+                        dataset=coco_train_dataset)
+test_pipeline = [
+    *_base_.test_pipeline[:-1],
+    dict(type='mmdet.PackDetInputs',
+         meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+                    'scale_factor', 'pad_param'))
+]
+coco_val_dataset = dict(type='YOLOv5CocoDataset',
+                        data_root='data/coco',
+                        ann_file='annotations/instances_val2017.json',
+                        data_prefix=dict(img='val2017/'),
+                        filter_cfg=dict(filter_empty_gt=False, min_size=32),
+                        pipeline=test_pipeline)
+
+val_dataloader = dict(dataset=coco_val_dataset)
+test_dataloader = val_dataloader
+# training settings
+default_hooks = dict(param_scheduler=dict(scheduler_type='linear',
+                                          lr_factor=0.01,
+                                          max_epochs=max_epochs),
+                     checkpoint=dict(max_keep_ckpts=-1,
+                                     save_best=None,
+                                     interval=save_epoch_intervals))
+custom_hooks = [
+    dict(type='EMAHook',
+         ema_type='ExpMomentumEMA',
+         momentum=0.0001,
+         update_buffers=True,
+         strict_load=False,
+         priority=49),
+    dict(type='mmdet.PipelineSwitchHook',
+         switch_epoch=max_epochs - close_mosaic_epochs,
+         switch_pipeline=train_pipeline_stage2)
+]
+train_cfg = dict(max_epochs=max_epochs,
+                 val_interval=5,
+                 dynamic_intervals=[((max_epochs - close_mosaic_epochs),
+                                     _base_.val_interval_stage2)])
+optim_wrapper = dict(optimizer=dict(
+    _delete_=True,
+    type='AdamW',
+    lr=base_lr,
+    weight_decay=weight_decay,
+    batch_size_per_gpu=train_batch_size_per_gpu),
+                     paramwise_cfg=dict(bias_decay_mult=0.0,
+                                        norm_decay_mult=0.0,
+                                        custom_keys={
+                                            'backbone.text_model':
+                                            dict(lr_mult=0.01),
+                                            'logit_scale':
+                                            dict(weight_decay=0.0),
+                                            'embeddings':
+                                            dict(weight_decay=0.0)
+                                        }),
+                     constructor='YOLOWv5OptimizerConstructor')
+
+# evaluation settings
+val_evaluator = dict(_delete_=True,
+                     type='mmdet.CocoMetric',
+                     proposal_nums=(100, 1, 10),
+                     ann_file='data/coco/annotations/instances_val2017.json',
+                     metric='bbox')
+find_unused_parameters = True
diff --git a/configs/segmentation/yolo_world_seg_l_dual_vlpan_2e-4_80e_8gpus_allmodules_finetune_lvis.py b/configs/segmentation/yolo_world_seg_l_dual_vlpan_2e-4_80e_8gpus_allmodules_finetune_lvis.py
@@ -17,7 +17,8 @@
 train_batch_size_per_gpu = 8
 load_from = 'pretrained_models/yolo_world_l_clip_base_dual_vlpan_2e-3adamw_32xb16_100e_o365_goldg_train_pretrained-0e566235.pth'
 persistent_workers = False
-
+text_model_name = '../pretrained_models/clip-vit-base-patch32-projection'
+# text_model_name = 'openai/clip-vit-base-patch32'
 # Polygon2Mask
 downsample_ratio = 4
 mask_overlap = False
@@ -38,7 +39,7 @@
         image_model={{_base_.model.backbone}},
         text_model=dict(
             type='HuggingCLIPLanguageBackbone',
-            model_name='openai/clip-vit-base-patch32',
+            model_name=text_model_name,
             frozen_modules=[])),
     neck=dict(type='YOLOWorldDualPAFPN',
               guide_channels=text_channels,

diff --git a/deploy/deploy.py b/deploy/deploy.py
@@ -9,6 +9,7 @@
 import torch.multiprocessing as mp
 from torch.multiprocessing import Process, set_start_method
 
+
 from mmdeploy.apis import (create_calib_input_data, extract_model,
                            get_predefined_partition_cfg, torch2onnx,
                            torch2torchscript, visualize_model)

diff --git a/docs/data.md b/docs/data.md
@@ -10,7 +10,7 @@ For pre-training YOLO-World, we adopt several datasets as listed in the below ta
 | GQA | 621k | grounding | 3,681k |
 | Flickr | 149k | grounding | 641k |
 | CC3M-Lite | 245k | image-text | 821k |
-
+ 
 ### Dataset Directory
 
 We put all data into the `data` directory, such as:
@@ -84,4 +84,17 @@ For custom dataset, we suggest the users convert the annotation files according
 
 1. **Large vocabulary, grounding, referring:** you can follow the annotation format as the `MixedGrounding` dataset, which adds `caption` and `tokens_positive` for assigning the text for each object. The texts can be a category or a noun phrases.
 
-2. **Custom vocabulary (fixed):** you can adopt the `MultiModalDataset` wrapper as the `Objects365` and create a **text json** for your custom categories.
+2. **Custom vocabulary (fixed):** you can adopt the `MultiModalDataset` wrapper as the `Objects365` and create a **text json** for your custom categories.
+
+
+### CC3M Pseudo Annotations
+
+The following annotations are generated according to the automatic labeling process in our paper. Adn we report the results based on these annotations.
+
+To use CC3M annotations, you need to prepare the `CC3M` images first.
+
+| Data | Images | Boxes | File |
+| :--: | :----: | :---: | :---: |
+| CC3M-246K | 246,363 | 820,629 | [Download 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/cc3m_pseudo_annotations.json) |
+| CC3M-500K | 536,405 | 1,784,405| [Download 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/cc3m_pseudo_500k_annotations.json) |
+| CC3M-750K | 750,000 | 4,504,805 | [Download 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/cc3m_pseudo_750k_annotations.json) |
diff --git a/docs/prompt_yolo_world.md b/docs/prompt_yolo_world.md
@@ -0,0 +1,73 @@
+## Prompt YOLO-World
+
+
+### 1. Simple YOLO-World with Embeddings
+
+For simplifying YOLO-World and get rid of the language model, we define a new basic detector `YOLOWorldPromptDetector`:
+
+The `YOLOWorldPromptDetector` supports prompt embeddings as the input and doesn't not contain a language model anymore!
+Now, YOLO-World adopts `embeddings` as language inputs, and the embeddings support several kinds: (1) text embeddings from the language model, e.g., CLIP language encoder, (2) image embeddings from a vision model, e.g., CLIP vision encoder, and (3) image-text fused embeddings, and (4) random embeddings.
+The (1)(2)(3) supports zero-shot inference and (4), including (1)(2)(3) are designed for prompt tuning on your custom data.
+
+The basic detector is defined as follows:
+
+```python
+class YOLOWorldPromptDetector(YOLODetector):
+    """Implementation of YOLO World Series"""
+
+    def __init__(self,
+                 *args,
+                 mm_neck: bool = False,
+                 num_train_classes=80,
+                 num_test_classes=80,
+                 prompt_dim=512,
+                 num_prompts=80,
+                 embedding_path='',
+                 freeze_prompt=False,
+                 use_mlp_adapter=False,
+                 **kwargs)
+```
+
+To use it in a zero-shot manner, you need to pre-compute the text embeddings (image embeddings) and save it as a `numpy array (*.npy)` with a `NxD` shape (N is the number of prompts and D is the dimension of the embeddings). Currently, we only support one prompt for one class. You can use several prompts for one class but you need to merge the results in the post-processing steps.
+
+
+### 2. Prompt Tuning YOLO-World
+
+We introduce prompt tuning for YOLO-World to maintain the zero-shot ability while improve the performance on your custom datasets.
+
+For more details about writing configs for prompt tuning, you can refer to [`prompt tuning for COCO data`](./../configs/prompt_tuning_coco/yolo_world_v2_l_vlpan_bn_2e-4_80e_8gpus_mask-refine_prompt_tuning_coco.py).
+
+1. Use random prompts
+
+```python
+dict(type='YOLOWorldPromptDetector',
+             mm_neck=True,
+             num_train_classes=num_training_classes,
+             num_test_classes=num_classes,
+             prompt_dim=text_channels,
+             num_prompts=80,
+             ...)
+```
+
+2. Use CLIP embeddings (text, image, or text-image embeddings)
+
+the `clip_vit_b32_coco_80_embeddings.npy` can be downloaded at [HuggingFace](https://huggingface.co/wondervictor/YOLO-World/blob/main/clip_vit_b32_coco_80_embeddings.npy).
+
+```python
+dict(type='YOLOWorldPromptDetector',
+             mm_neck=True,
+             num_train_classes=num_training_classes,
+             num_test_classes=num_classes,
+             embedding_path='embeddings/clip_vit_b32_coco_80_embeddings.npy',
+             prompt_dim=text_channels,
+             num_prompts=80,
+             ...)
+```
+
+Using CLIP model to obtains the image and text embeddings will maintain the zero-shot performace.
+
+
+| Model | Config |  AP  | AP50 | AP75  | APS | APM | APL |
+| :---- | :----: | :--: | :--: | :---: | :-: | :-: | :-: |
+| YOLO-World-v2-L | Zero-shot | 45.7 | 61.6 | 49.8 | 29.9 | 50.0 | 60.8 |
+| [YOLO-World-v2-L](./../configs/prompt_tuning_coco/yolo_world_v2_l_vlpan_bn_2e-4_80e_8gpus_mask-refine_prompt_tuning_coco.py) | Prompt tuning | 47.9 | 64.3 | 52.5 | 31.9 | 52.6 | 61.3 |