-
Notifications
You must be signed in to change notification settings - Fork 31.1k
🚨Fixed wrong padding value in OWLv2 #41938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @yonigozlan |
yonigozlan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @gjamesgoenawan ! Thanks a lot for investigating this and providing sources! Ok to merge for me, just waiting on the CI and the slow CI to pass, as this might change some integration tests. If it does I'll push the new (correct) results directly on this PR
|
run-slow: owlv2 |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2 |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@yonigozlan Thanks for reviewing! I think it would be helpful to mention some inference optimizations from the jax implementation (text embedding caching and prompt ensembling) as these aren’t immediately obvious from the current documentation. Including brief references or examples for these techniques would make it clearer for users. |
Indeed, at least for text embedding caching, it seems there's no easy way to use it with the current API. I'm working on a refactor of vision models, and I'll add this to the list of to-dos. |
|
Prompt Ensembling is explored in CLIP[1] in Section 3.1.4 & Figure 4. TLDR; It is an inference technique that significantly improve zero-shot performance by averaging logits over multiple text-prompts. You can find traces of this in the original JAX evaluator code and the code snippet I shared. I define the number of prompt templates as Effectively averaging logits across all prompt templates. Additionally, the top-k operations is a bit different from Transformers' implementation as well. Specifically this line in postprocessing which strictly enforce maximum of 1 detection per object proposal. My implementation follow the original closely which doesnt have this restriction. Instead, it took the top-k from all logits, meaning 1 object proposal can result in more than 1 detection. Without these two methods, evaluation performance differs significantly from the original. [1] CLIP |
|
Thanks for the explanation @gjamesgoenawan, it does sound like some elements were overlooked when adding this model. Feel free to open PRs to add/fix these two features, I'll make sure to review them quickly. |
* Update image_processing_owlv2_fast.py fixed padding value * fixed padding value * Change padding constant value from 0.5 to 0.0 * Fixed missed padding value in modular_owlv2.py --------- Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* Update image_processing_owlv2_fast.py fixed padding value * fixed padding value * Change padding constant value from 0.5 to 0.0 * Fixed missed padding value in modular_owlv2.py --------- Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
What does this PR do?
This PR proposes changing the default padding value from 0.5 to 0.0 in OWLv2. While OWLv1 originally used a padding value of 0.5 (gray) as described in its paper [1], OWLv2 adopts 0.0 instead [2], consistent with its official implementation [3]. Using the incorrect padding value (0.5) leads to degraded performance on the LVIS dataset.
Reproducing the results
Testing scripts:
The following scripts explicitly resized and pad the image beforehand so no padding will be done in the processor.
Commands:
Please prepare LVIS dataset beforehand with the following structure:
After Running the scripts, the following logs should be printed:
0.5 padding
0.0 padding
Reference:
[1] OWLv1 (Figure A4.)
[2] OWLv2 (Figure A3),
[3] OWLv2 original implementation, which is changed with this PR (scenic/projects/owl_vit/evaluator.py, line 158).