This is the official project of How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges, which is created by Haotong Qin*, Ge-Peng Ji*, Salman Khan, Deng-Ping Fan#, Fahad Shahbaz Khan, Luc Van Gool from ETH Zurich, MBZUAI, and ANU.
Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 13 diverse task scenarios encompassing regular, camouflaged, medical, and remote sensing data to comprehensively evaluate Bard's performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data.
Fig. 1. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the Microsoft COCO dataset.
Fig. 2. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the Tiny-ImageNet-C dataset.
Fig. 3. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the MPID dataset.
Fig. 4. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the Image Sentiment dataset.
Fig. 5. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the FGVC dataset.
Fig. 6. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the COD10K dataset.
Fig. 7. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the IOCfish5K dataset.
Fig. 8. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the CDS2K dataset.
Fig. 9. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the TextVQA dataset.
Fig. 10. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the SUN-SEG dataset.
Fig. 11. Several examples of multi-modal interactive sessions using Google’s BARD, wherein the AI system responds to the user’s question based on images sourced from the RAVQA-LR dataset.
If you find our work useful in your research, please consider citing:
@inproceedings{GoogleBard_VisUnderstand:MIR23,
author = {Haotong Qin and Ge-Peng Ji and Salman Khan and Deng-Ping Fan and Fahad Shahbaz Khan and Luc Van Gool },
title = {How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges},
booktitle = {Machine Intelligence Research (MIR)},
doi = {10.1007/s11633-023-1469-x},
volume = {20},
number = {5},
page = {605-613},
year = {2023}
}