Skip to content

How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges

Notifications You must be signed in to change notification settings

htqin/GoogleBard-VisUnderstand

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

GoogleBard-VisUnderstand

This is the official project of How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges, which is created by Haotong Qin*, Ge-Peng Ji*, Salman Khan, Deng-Ping Fan#, Fahad Shahbaz Khan, Luc Van Gool from ETH Zurich, MBZUAI, and ANU.

1 Introduction

Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 13 diverse task scenarios encompassing regular, camouflaged, medical, and remote sensing data to comprehensively evaluate Bard's performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data.

2 Empirical Experiments

Bard-COCO

Fig. 1. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the Microsoft COCO dataset.

Bard-COCO

Fig. 2. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the Tiny-ImageNet-C dataset.

Bard-COCO

Fig. 3. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the MPID dataset.

Bard-COCO

Fig. 4. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the Image Sentiment dataset.

Bard-COCO

Fig. 5. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the FGVC dataset.

Bard-COCO

Fig. 6. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the COD10K dataset.

Bard-COCO

Fig. 7. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the IOCfish5K dataset.

Bard-COCO

Fig. 8. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the CDS2K dataset.

Bard-COCO

Fig. 9. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the TextVQA dataset.

Bard-COCO

Fig. 10. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the SUN-SEG dataset.

Bard-COCO

Fig. 11. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the RAVQA-LR dataset.

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{GoogleBard_VisUnderstand:MIR23,
  author    = {Haotong Qin and Ge-Peng Ji and Salman Khan and Deng-Ping Fan and Fahad Shahbaz Khan and Luc Van Gool },
  title     = {How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges},
  booktitle = {Machine Intelligence Research (MIR)},
  doi       = {10.1007/s11633-023-1469-x},
  volume    = {20},
  number    = {5},
  page      = {605-613},
  year      = {2023}
}

About

How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published