Adding grounding dino #26087

EduardoPach · 2023-09-11T04:04:20Z

What does this PR do?

This PR adds Grounding DINO

To-Do's:

amyeroberts · 2023-09-12T12:16:31Z

@EduardoPach Thanks for opening this model PR! From next week, I'll be away for a few weeks. If you need a review in that time please ping @rafaelpadilla.

EduardoPach · 2023-09-20T15:33:43Z

@rafaelpadilla hey, so I've finished implementing the model and validated with the original implementation. Still have to clean up some things and make sure the documentation is correct.

My main question is about pushing the model to the hub, because the authors uploaded already the checkpoints (two checkpoints in the same repo) they made available to the model, but it's under an user instead of their org (IDEA-Research), what is usually done in this case?

rafaelpadilla · 2023-09-21T17:26:11Z

Hi @EduardoPach ,

Are you referring to groundingdino_swinb_cogcoor.pth and groundingdino_swint_ogc.pth, placed here, right?

In this case, let's consult @ArthurZucker and @younesbelkada.

EduardoPach · 2023-09-21T18:00:00Z

Hi @EduardoPach ,

Are you referring to groundingdino_swinb_cogcoor.pth and groundingdino_swint_ogc.pth, placed here, right?

In this case, let's consult @ArthurZucker and @younesbelkada.

Precisely, I'm asking this because I had the impression that model repos contain only one checkpoint and also the IDEA-Research group has other models that we've could add to the transformers library later on so it might be helpful if there was an account for this org.

rafaelpadilla · 2023-09-22T00:57:18Z

Hi @EduardoPach ,
Are you referring to groundingdino_swinb_cogcoor.pth and groundingdino_swint_ogc.pth, placed here, right?
In this case, let's consult @ArthurZucker and @younesbelkada.

Precisely, I'm asking this because I had the impression that model repos contain only one checkpoint and also the IDEA-Research group has other models that we've could add to the transformers library later on so it might be helpful if there was an account for this org.

For now, you can upload weights to the hub under your own personal profile and use them until this PR is ready to merge.
Afterwards, we'll move the weights under the organization on the hub, and update all the paths to point to those.

younesbelkada · 2023-09-25T12:10:07Z

Hi @EduardoPach
I second what @rafaelpadilla said, for the groundingdino_swinb_cogcoor.pth and groundingdino_swint_ogc.pth you can create two different repositories under your personal name space with a suffix that is distinguishable (e.g. yournamespace/groundingdino-swinb-cogcoor and yournamespace/groundingdino-swint-ogc, and make sure the files has been renamed to pytorch_model.bin

EduardoPach · 2023-10-04T23:12:21Z

@rafaelpadilla Hey! Could you help me out with these questions?

The ImageProcessor from the original implementation is exactly the same as we have in DeformableDetr. Should I copy the ImageProcessor and just remove the segmentation-related things? (Since GroundingDINO is used only for Object Detection)
Their tokenizer is the same as Bert with a few extra steps after tokenizing so I copied and added this step, but I'm unsure how to push the pre-trained tokenizer to the hub
My implementation of GroundingDINOConfig has an attribute called text_backbone_config which is a GroundingDINOTextPrenetConfig which is just a copy of Bert config. However, after pushing the model to the hub when I try to instantiate the model with .from_pretrained I get an error saying:

ValueError: Parameter config in `GroundingDINOTextPrenet(config)` should be an instance of class `PretrainedConfig`. To create a model from a pretrained model use `model = GroundingDINOTextPrenet.from_pretrained(PRETRAINED_MODEL_NAME)`

and when I do AutoConfig.from_pretrained("EduardoPacheco/grounding-dino-base").text_backbone_config I get {'model_type': 'grounding-dino-text-prenet'} is there anything different that I need to do to have a config as an attribute? I've tried to look at CLIP's configuration to get some idea of how to do it, but I'm unsure why I am not getting the full GroundingDINOTextPrenetConfig after pushing the model to the hub

rafaelpadilla · 2023-10-05T10:19:26Z

Hi @EduardoPach ,

If your ImageProcessor is an exact copy from another model you must include the #Copied from. If somehow your ImageProcessor uses parts of other code, it would be good to have a #Modified from comment.

If I understood correctly, you have already generated the tokens using the newly extra steps, right? For pushing your tokens to the hub you could could use the hub api. See an example here

I'm not sure if the problem is regarding AutoConfig, as it could not load correctly your GroundingDINOConfig. Have you tried loading it directly with GroundingDINOTextPrenet.from_pretrained("EduardoPacheco/grounding-dino-base")?

EduardoPach · 2023-10-06T16:57:01Z

@rafaelpadilla the ImageProcessor is precisely the same, but the DeformableDetr one works for object detection and segmentation. Right now I've copied the processor and just removed the segmentation stuff, is that okay?

Also, about the config, sorry I had forgotten to push the modifications I've done to the configuration_grounding_dino.py file

EDIT

I figured out what the issue was haha it was somewhat dumb. Either way, I wasn't aware that when we push the config to the hub the config class is then converted to a config.json, and any Nested configuration is also modified to a dictionary so I only had to change my GroundingDINOConfig implementation a bit when creating the attribute text_backbone_config

NielsRogge · 2023-10-13T11:30:28Z

Hi @EduardoPach do you need any help in finishing this PR? Really great to see you're leveraging Copied from for the text encoder and all parts taken from Deformable DETR. Also, if the image processor is exactly the same as Deformable DETR, then we typically don't add a new image processor to the library, but rather just add a line in image_processing_auto, which will allow people to do:

from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained("sensetime/grounding-dino-base")

this will then automatically create a DeformableDetrImageProcessor.

README.md

EduardoPach · 2023-10-13T19:20:29Z

Hi @EduardoPach do you need any help in finishing this PR? Really great to see you're leveraging Copied from for the text encoder and all parts taken from Deformable DETR. Also, if the image processor is exactly the same as Deformable DETR, then we typically don't add a new image processor to the library, but rather just add a line in image_processing_auto, which will allow people to do:
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained("sensetime/grounding-dino-base")
this will then automatically create a DeformableDetrImageProcessor.

Writing here just for record

As we discussed through Discord I'll make that and will do the same for the Tokenizer part and will just create a GroundingDINOProcessor.

…m existing models

amyeroberts

Thanks for all the work on this - looks great!

Only thing is removing the Bert implementation and using AutoModel instead and some nits. Otherwise we're good to merge 🤗

docs/source/en/model_doc/grounding-dino.md

tests/models/grounding_dino/test_modeling_grounding_dino.py

src/transformers/models/grounding_dino/modeling_grounding_dino.py

jiangtann · 2024-04-05T19:36:50Z

@EduardoPach https://github.com/EduardoPach/transformers/blob/6f13fbb5f46a8c949a02c5c087de104fdf254f67/src/transformers/models/grounding_dino/modeling_grounding_dino.py#L1436 Can you modify the GroundingDinoMultiheadAttention to support text cross-attention mask?

Any particular reason for this? In the original implementation, they didn't use the cross-attention mask

If text cross-attention mask is not used, we can only train and inference with batch_size == 1, which is inefficient in utilizing GPU memory.

In a batch, multiple texts will be padding to a fixed length, so we need text cross-attention mask.

EduardoPach · 2024-04-05T20:20:31Z

@EduardoPach https://github.com/EduardoPach/transformers/blob/6f13fbb5f46a8c949a02c5c087de104fdf254f67/src/transformers/models/grounding_dino/modeling_grounding_dino.py#L1436 Can you modify the GroundingDinoMultiheadAttention to support text cross-attention mask?

Any particular reason for this? In the original implementation, they didn't use the cross-attention mask

If text cross-attention mask is not used, we can only train and inference with batch_size == 1, which is inefficient in utilizing GPU memory.

In a batch, multiple texts will be padding to a fixed length, so we need text cross-attention mask.

Yeah, I was under the assumption that one would use always the same labels when doing inference, but thinking more about it I can see some cases where that wouldn't be the case and training would be one as well.

I do think though, that we could fix this in a different PR as this PR is quite old and having a working version of the model in the main repo would be beneficial IMO and then I can work on adding the cross-attention masks as well 🤗.

Are you using the implementation for a project?

jiangtann · 2024-04-07T10:45:50Z

@EduardoPach https://github.com/EduardoPach/transformers/blob/6f13fbb5f46a8c949a02c5c087de104fdf254f67/src/transformers/models/grounding_dino/modeling_grounding_dino.py#L1436 Can you modify the GroundingDinoMultiheadAttention to support text cross-attention mask?

Any particular reason for this? In the original implementation, they didn't use the cross-attention mask

If text cross-attention mask is not used, we can only train and inference with batch_size == 1, which is inefficient in utilizing GPU memory.
In a batch, multiple texts will be padding to a fixed length, so we need text cross-attention mask.

Yeah, I was under the assumption that one would use always the same labels when doing inference, but thinking more about it I can see some cases where that wouldn't be the case and training would be one as well.

I do think though, that we could fix this in a different PR as this PR is quite old and having a working version of the model in the main repo would be beneficial IMO and then I can work on adding the cross-attention masks as well 🤗.

Are you using the implementation for a project?

Yes, I'm currently working on the development of extra-visual-module-based MLLM (e.g. LISA, GLaMM, etc.). And I'm using your code with nn.MultiheadAttention instead of GroundingDinoMultiheadAttention for training and inference.

Due to consistent Transformers code style, your code is more suitable to use with LLM (e.g. LlamaForCausalLM). Thanks for your work!

EduardoPach · 2024-04-09T12:57:16Z

Adding here a screenshot of running the tests RUN_SLOW=1 pytest tests/models/grounding_dino/ -vv where we can see that the tests are passing and specific GPU test test_inference_object_detection_head_equivalence_cpu_gpu is green

c..c @amyeroberts

rb-synth · 2024-04-09T17:00:27Z

I just tried out the model readme, and think it might be slightly outdated. Here are some changes I had to make:

the post-processor needs to be put on the device: inputs = {k: v.to(device) for k, v in inputs.items()}.
inputs is a dictionary, so needs to be inputs["input_ids"] not inputs.input_ids
bbox_threshold needs to be box_threshold.

Otherwise it seems to work well for me, thanks for the hard work!

EduardoPach · 2024-04-10T08:33:50Z

I just tried out the model readme, and think it might be slightly outdated. Here are some changes I had to make:

the post-processor needs to be put on the device: inputs = {k: v.to(device) for k, v in inputs.items()}.

inputs is a dictionary, so needs to be inputs["input_ids"] not inputs.input_ids

bbox_threshold needs to be box_threshold.

Otherwise it seems to work well for me, thanks for the hard work!

Hey, thanks for the heads up. The output of GroundingDinoProcessor is of type BatchEncoding so in your first point, you can simply do inputs = inputs.to(device) (added that to the model readme). For your second point, if you didn't modify inputs to a dict it should be a BatchEncoding so no problem there as well. For your third point, I fix that 😄

rb-synth · 2024-04-10T09:54:42Z

Good points, thanks! Last point, have you checked with a list of text prompts? The type hinting implies this should be possible (text: List[TextInput]), but I haven't succeeded. I tried both with and without padding=True:

import requests

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

model_id = "IDEA-Research/grounding-dino-tiny"
device = torch.device("cuda")
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Check for cats and remote controls
text = ["cat", "remote control"]

inputs = processor(images=image, text=text, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.4,
    text_threshold=0.3,
    target_sizes=[image.size[::-1]]
)

If padding is not given, error is:

Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

If padding=True is passed, the error is:

The expanded size of the tensor (8) must match the existing size (4) at non-singleton dimension 2.  Target sizes: [4, 17821, 8].  Tensor sizes: [8, 1, 4]

rb-synth · 2024-04-10T10:00:10Z

I see that in your example, you separate with full stops, so a list of text could be converted with ". ".join(texts). I tried this on an image with some items of clothing, so the prompt was 'shirt. dress. blouse. jacket. jumper. sweater. undershirt. t-shirt. tie'. I would expect each detected object to be one of the full-stop enclosed phrases, but the results were pretty mangled:

jumper
shirt -
shirt blouse undershirt t - shirt
shirt blouse jacket sweater
blouse undershirt t - shirt tie
shirt blouse sweater -
blouse sweater undershirt t - shirt
blouse jacket jumper sweater t - shirt
shirt blouse jacket sweater undershirt t - shirt
undershirt
##shirt
t - shirt
t - shirt

Is this to be expected?

EduardoPach · 2024-04-10T15:23:58Z

I see that in your example, you separate with full stops, so a list of text could be converted with ". ".join(texts). I tried this on an image with some items of clothing, so the prompt was 'shirt. dress. blouse. jacket. jumper. sweater. undershirt. t-shirt. tie'. I would expect each detected object to be one of the full-stop enclosed phrases, but the results were pretty mangled:
jumper
shirt -
shirt blouse undershirt t - shirt
shirt blouse jacket sweater
blouse undershirt t - shirt tie
shirt blouse sweater -
blouse sweater undershirt t - shirt
blouse jacket jumper sweater t - shirt
shirt blouse jacket sweater undershirt t - shirt
undershirt
##shirt
t - shirt
t - shirt
Is this to be expected?

It should also end with . so it would be something like ". ".join(texts) + ".". Fixed that on the example in the model readme as well!

And about your outputs, they are indeed a bit weird haha. It's unexpected to have outputs like shirt blouse jacket sweater if in your text input you separated the classes with .. Can you share your example?

src/transformers/models/grounding_dino/configuration_grounding_dino.py

src/transformers/models/grounding_dino/processing_grounding_dino.py

amyeroberts

Huge piece of work - thanks for adding this model!

…no.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

EduardoPach · 2024-04-10T17:27:51Z

Huge piece of work - thanks for adding this model!

Thank you for reviewing and your patience as well 😅. I'll open an issue to add the cross-attetion as @jiangtann mentioned as soon as it gets merged

* Fixed typo when converting weigths to GroundingDINO vision backbone * Final modifications on modeling * Removed unnecessary class * Fixed convert structure * Added image processing * make fixup partially completed * Now text_backbone_config has its own class * Modified convert script * Removed unnecessary config attribute * Added new function to generate sub sentence mask * Renamed parameters with gamma in the name as it's currently not allowed * Removed tokenization and image_processing scripts since we'll map from existing models * Fixed some issues with configuration * Just some modifications on conversion script * Other modifications * Copied deformable detr * First commit * Added bert to model * Bert validated * Created Text and Fusion layers for Encoder * Adapted Encoder layer * Fixed typos * Adjusted Encoder * Converted encoder to hf * Modified Decoder Layer * Modified main decoder class * Removed copy comments * Fixed forward from GroundingDINOModel and GroundingDINODecoder * Added all necessary layers, configurations and forward logic up to GroundingDINOModel * Added all layers to convertion * Fixed outputs for GroundingDINOModel and GroundingDINOForObjectDetection * Fixed mask input to encoders and fixed nn.MultiheadAttention batch first and attn output * Fixed forward from GroundingDINOTextEnhancerLayer * Fixed output bug with GroundingDINODeformableLayer * Fixed bugs that prevent GroundingDINOForObjectDetection to run forward method * Fixed attentions to be passed correctly * Passing temperature arg when creating Sine position embedding * Removed copy comments * Added temperature argument for position embedding * Fixed typo when converting weigths to GroundingDINO vision backbone * Final modifications on modeling * Removed unnecessary class * Fixed convert structure * Added image processing * make fixup partially completed * Now text_backbone_config has its own class * Modified convert script * Removed unnecessary config attribute * Added new function to generate sub sentence mask * Renamed parameters with gamma in the name as it's currently not allowed * Removed tokenization and image_processing scripts since we'll map from existing models * Fixed some issues with configuration * Just some modifications on conversion script * Other modifications * Fix style * Improve fixup * Improve conversion script * Improve conversion script * Add GroundingDINOProcessor * More improvements * Return token type ids * something * Fix more tests * More improvements * More cleanup * More improvements * Fixed tests, improved modeling and config * More improvements and fixing tests * Improved tests and modeling * Improved tests and added image processor * Improved tests inference * More improvements * More test improvements * Fixed last test * Improved docstrings and comments * Fix style * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> * Better naming * Better naming * Added Copied statement * Added Copied statement * Moved param init from GroundingDINOBiMultiHeadAttention * Better naming * Fixing clamp style * Better naming * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update src/transformers/models/grounding_dino/configuration_grounding_dino.py Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> * Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> * Improving conversion script * Improved config * Improved naming * Improved naming again * Improved grouding-dino.md * Moved grounding dino to multimodal * Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> * Fixed docstrings and style * Fix docstrings * Remove timm attributes * Reorder imports * More improvements * Add Grounding DINO to pipeline * Remove model from check_repo * Added grounded post_process to GroundingDINOProcessor * Fixed style * Fixed GroundingDINOTextPrenetConfig docstrings * Aligned inputs.keys() when both image and text are passed with model_input_names * Added tests for GroundingDINOImageProcessor and GroundingDINOProcessor * Testing post_process_grounded_object_detection from GroundingDINOProcessor at test_inference_object_detection_head * Fixed order * Marked test with require_torch * Temporarily changed repo_id * More improvements * Fix style * Final improvements * Improve annotators * Fix style * Add is_torch_available * Remove type hints * vocab_tokens as one liner * Removed print statements * Renamed GroundingDINOTextPrenetConfig to GroundingDINOTextConfig * remove unnecessary comments * Removed unnecessary tests on conversion script * Renamed GroundingDINO to camel case GroundingDino * Fixed GroundingDinoProcessor docstrings * loading MSDA kernels in the modeling file * Fix copies * Replace nn.multiheadattention * Replace nn.multiheadattention * Fixed inputs for GroundingDinoMultiheadAttention & order of modules * Fixed processing to avoid messing with inputs * Added more tips for GroundingDino * Make style * Chaning name to align with SAM * Replace final nn.multiheadattention * Fix model tests * Update year, remove GenerationTesterMixin * Address comments * Address more comments * Rename TextPrenet to TextModel * Rename hidden_states * Address more comments * Address more comments * Address comment * Address more comments * Address merge * Address comment * Address comment * Address comment * Make style * Added layer norm eps to layer norms * Address more comments * More fixes * Fixed equivalence * Make fixup * Remove print statements * Address comments * Address comments * Address comments * Address comments * Address comments * Address comments * Add comment * Address comment * Remove overwriting of test * Fix bbox_embed * Improve decoder_bbox_embed_share * Simplify outputs * Updated post_process_grounded_object_detection * Renamed sources to feature_maps * Improved tests for Grounding Dino ImageProcessor and Processor * Fixed test requirements and imports * Fixed image_processing * Fixed processor tests * Fixed imports for image processing tests * Fix copies * Updated modeling * Fix style * Moved functions to correct position * Fixed copy issues * Update src/transformers/models/deformable_detr/modeling_deformable_detr.py Co-authored-by: Sangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Sangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Sangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com> * Keeping consistency custom cuda kernels for MSDA * Make GroundingDinoProcessor logic clearer * Updated Grounding DINO checkpoints * Changed tests to correct structure * Updated gpu-cpu equivalence test * fix copies * Update src/transformers/models/grounding_dino/processing_grounding_dino.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/grounding_dino/processing_grounding_dino.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/grounding_dino/configuration_grounding_dino.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Fixed erros and style * Fix copies * Removed inheritance from PreTrainedModel from GroundingDinoTextModel * Fixed GroundingDinoTextModel * Fixed type of default backbone config * Fixed missing methods for GroundingDinoTextModel and Added timm support for GroundingDinoConvEncoder * Addressed comments * Addressed batched image processing tests * Addressed zero shot test comment * Addressed tip comment * Removed GroundingDinoTextModel from check_repo * Removed inplace masking * Addressed comments * Addressed comments * Addressed comments * Fix copies * Fixing timm test * Fixed batching equivalence test * Update docs/source/en/model_doc/grounding-dino.md Co-authored-by: Tianqi Xu <40522713+dandansamax@users.noreply.github.com> * Update docs/source/en/model_doc/grounding-dino.md Co-authored-by: Tianqi Xu <40522713+dandansamax@users.noreply.github.com> * Update docs/source/en/model_doc/grounding-dino.md Co-authored-by: Tianqi Xu <40522713+dandansamax@users.noreply.github.com> * Addressed more comments * Added a new comment * Reduced image size * Addressed more comments * Nits * Nits * Changed the way text_config is initialized * Update src/transformers/models/grounding_dino/processing_grounding_dino.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: Niels <niels.rogge1@gmail.com> Co-authored-by: Rafael Padilla <31217453+rafaelpadilla@users.noreply.github.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Eduardo Pacheco <eduardo.pacheco@limehome.com> Co-authored-by: Sangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Tianqi Xu <40522713+dandansamax@users.noreply.github.com>

EduardoPach changed the title ~~Adding grounding dino~~ [WIP] Adding grounding dino Sep 11, 2023

EduardoPach mentioned this pull request Sep 11, 2023

[WIP] Add Grounding DINO #25451

Closed

4 tasks

EduardoPach added 4 commits September 17, 2023 23:48

Fixed typo when converting weigths to GroundingDINO vision backbone

6e37211

Final modifications on modeling

0db05e0

Removed unnecessary class

a1eba2e

Fixed convert structure

9cf7c3a

EduardoPach added 2 commits September 24, 2023 01:35

Added image processing

9c55b24

make fixup partially completed

ae570bb

EduardoPach added 2 commits October 6, 2023 13:45

Now text_backbone_config has its own class

1f6475f

Modified convert script

d763e04

Removed unnecessary config attribute

04022d4

NielsRogge reviewed Oct 13, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

EduardoPach added 7 commits October 13, 2023 17:06

Added new function to generate sub sentence mask

938f805

Renamed parameters with gamma in the name as it's currently not allowed

6f08b04

Removed tokenization and image_processing scripts since we'll map fro…

7666253

…m existing models

Fixed some issues with configuration

046e0c5

Just some modifications on conversion script

70b248d

Other modifications

3bc92b7

Copied deformable detr

4cae0ca

amyeroberts reviewed Apr 4, 2024

View reviewed changes

EduardoPach added 4 commits April 9, 2024 00:12

Addressed more comments

a1e9ff0

Added a new comment

38a2e97

Reduced image size

e9633b4

Addressed more comments

89e070f

EduardoPach requested a review from amyeroberts April 9, 2024 12:54

Nits

a961ab7

Merge remote-tracking branch 'upstream/main' into adding-grounding-dino

6c2a617

Nits

f945c7a

amyeroberts reviewed Apr 10, 2024

View reviewed changes

src/transformers/models/grounding_dino/configuration_grounding_dino.py Outdated Show resolved Hide resolved

Changed the way text_config is initialized

b0891ca

EduardoPach requested a review from amyeroberts April 10, 2024 17:18

amyeroberts reviewed Apr 10, 2024

View reviewed changes

src/transformers/models/grounding_dino/processing_grounding_dino.py Outdated Show resolved Hide resolved

amyeroberts approved these changes Apr 10, 2024

View reviewed changes

Update src/transformers/models/grounding_dino/processing_grounding_di…

c630a9c

…no.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

amyeroberts merged commit b752ad3 into huggingface:main Apr 11, 2024
22 checks passed

EduardoPach mentioned this pull request Apr 11, 2024

Add Cross Attention Masking for Grounding DINO #30176

Closed

xenova mentioned this pull request Jun 5, 2024

Feature request: YOLO-World/Grounding DINO (Zero shot object detection) huggingface/transformers.js#792

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding grounding dino #26087

Adding grounding dino #26087

EduardoPach commented Sep 11, 2023 •

edited

Loading

amyeroberts commented Sep 12, 2023

EduardoPach commented Sep 20, 2023

rafaelpadilla commented Sep 21, 2023

EduardoPach commented Sep 21, 2023

rafaelpadilla commented Sep 22, 2023

younesbelkada commented Sep 25, 2023

EduardoPach commented Oct 4, 2023

rafaelpadilla commented Oct 5, 2023

EduardoPach commented Oct 6, 2023 •

edited

Loading

NielsRogge commented Oct 13, 2023

EduardoPach commented Oct 13, 2023 •

edited

Loading

amyeroberts left a comment •

edited

Loading

jiangtann commented Apr 5, 2024

EduardoPach commented Apr 5, 2024

jiangtann commented Apr 7, 2024

EduardoPach commented Apr 9, 2024

rb-synth commented Apr 9, 2024 •

edited

Loading

EduardoPach commented Apr 10, 2024

rb-synth commented Apr 10, 2024

rb-synth commented Apr 10, 2024

EduardoPach commented Apr 10, 2024

amyeroberts left a comment

EduardoPach commented Apr 10, 2024

Adding grounding dino #26087

Adding grounding dino #26087

Conversation

EduardoPach commented Sep 11, 2023 • edited Loading

What does this PR do?

amyeroberts commented Sep 12, 2023

EduardoPach commented Sep 20, 2023

rafaelpadilla commented Sep 21, 2023

EduardoPach commented Sep 21, 2023

rafaelpadilla commented Sep 22, 2023

younesbelkada commented Sep 25, 2023

EduardoPach commented Oct 4, 2023

rafaelpadilla commented Oct 5, 2023

EduardoPach commented Oct 6, 2023 • edited Loading

NielsRogge commented Oct 13, 2023

EduardoPach commented Oct 13, 2023 • edited Loading

amyeroberts left a comment • edited Loading

Choose a reason for hiding this comment

jiangtann commented Apr 5, 2024

EduardoPach commented Apr 5, 2024

jiangtann commented Apr 7, 2024

EduardoPach commented Apr 9, 2024

rb-synth commented Apr 9, 2024 • edited Loading

EduardoPach commented Apr 10, 2024

rb-synth commented Apr 10, 2024

rb-synth commented Apr 10, 2024

EduardoPach commented Apr 10, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

EduardoPach commented Apr 10, 2024

EduardoPach commented Sep 11, 2023 •

edited

Loading

EduardoPach commented Oct 6, 2023 •

edited

Loading

EduardoPach commented Oct 13, 2023 •

edited

Loading

amyeroberts left a comment •

edited

Loading

rb-synth commented Apr 9, 2024 •

edited

Loading