Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More details about VQModel used in OFA? #396

Closed
YAOYI626 opened this issue Jun 5, 2023 · 4 comments
Closed

More details about VQModel used in OFA? #396

YAOYI626 opened this issue Jun 5, 2023 · 4 comments

Comments

@YAOYI626
Copy link

YAOYI626 commented Jun 5, 2023

Hi team,

Thanks for the really amazing work OFA! I want to know more about the VQ model used in OFA.

Does it share the same VQ model when doing different tasks like captioning or generation? How is the VQ model trained? @logicwong @JustinLin610

Thanks,
Xiaoyi

@logicwong
Copy link
Member

@YAOYI626 Thanks for your interest.

  1. The VQ model is exclusively employed for image infilling and generation. We discretize the raw image into a sequence of codes using the VQ model, and OFA learns to generate the codes based on the text descriptions or masked images.
  2. For other tasks, like image captioning, we directly embed raw images into vectors via ResNet.
  3. We utilize the pre-trained VQ model from here.

@YAOYI626
Copy link
Author

YAOYI626 commented Jun 9, 2023

Hey @logicwong thanks for your reply!

Just curious, is there any specific reason doing captioning without VQ, Like big gap between captioning with VQ and captioning with embeddings from ResNet?

Thanks
Xiaoyi

@logicwong
Copy link
Member

@YAOYI626 There are two main reasons:

  1. Discretizing images with VQ results in a loss of information from the original image. In our preliminary experiments, using VQ resulted in a significant decrease in performance for the caption and VQA tasks.
  2. We use a compression ratio of f8 to discretize images, which means that an image of 256x256 resolution will be discretized into a sequence of codes with a length of 1024. This will increase the training cost.

@YAOYI626
Copy link
Author

Thanks @logicwong for the helpful information. I'd like to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants