-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding #31
Comments
SummaryNeural-symbolic VQA (NS-VQA) model has three components:
Scene Parser
Question Parser
Program Executor
Evaluation: Data-Efficient, Interpretable ReasoningDatasetCLEVR. The dataset includes synthetic images of 3D primitives with multiple attributes—shape, color, material, size, and 3D coordinates. Each image has a set of questions, each of which associates with a program (a set of symbolic modules) generated by machines based on 90 logic templates. Quantitative resultsRepeated experiments starting from different sets of programs show a standard deviation of less than 0.1 percent on the results for 270 pretraining programs (and beyond). The variances are larger when we train our model with fewer programs (90 and 180). The reported numbers are the mean of three runs. Data-efficiency comparison
Qualitative examplesIEP tends to fake a long wrong program that leads to the correct answer. In contrast, NS-VQA achieves 88% program accuracy with 500 annotations, and performs almost perfectly on both question answering and program recovery with 9K programs. Evaluation: Generalizing to Unseen Attribute CombinationsDatasetCLEVR-CoGenT. Derived from CLEVR and separated into two biased splits:
ResultsSee Table 2a.
Evaluation: Generalizing to Questions from HumansDataset
ResultsSee Table 2b. This shows our structural scene representation and symbolic program executor helps to exploit the Evaluation: Extending to New Scene ContextDataset
Results
Future ResearchBeyond supervised learning, some recent papers have made inspiring attempts to explore how concepts naturally emerge during unsupervised learning by Irina Higgins et al. [4] (See related work on Structural scene representation in the paper.) We see integrating our model with these approaches a promising future direction. Reference
|
Metadata
The text was updated successfully, but these errors were encountered: