Knowledge Vault 5 /50 - CVPR 2019
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
Drew A. Hudson; Christopher D. Manning
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef gqa fill:#f9d4d4, font-weight:bold, font-size:14px classDef vqa fill:#d4f9d4, font-weight:bold, font-size:14px classDef structure fill:#d4d4f9, font-weight:bold, font-size:14px classDef metrics fill:#f9f9d4, font-weight:bold, font-size:14px A[GQA: A New
Dataset for Real-World
Visual Reasoning and
Compositional Question Answering] --> B[GQA: new visual
reasoning dataset. 1] A --> C[Existing VQA datasets
have weaknesses. 2] B --> D[GQA provides structured
scene graphs. 3] D --> E[Questions represented as
functional programs. 4] D --> F[Scene graph generates
multi-step questions. 5] F --> G[Question engine translates
graph paths. 6] G --> H[Generates diverse, multi-step
inference questions. 7] D --> I[Structure reduces exploitable
question biases. 8] I --> J[Balancing method reduces
answer bias. 9] D --> K[Structure enables consistency
and grounding metrics. 10] K --> L[Metrics provide model
behavior insights. 11] B --> M[More info: visualreasoning.org,
CVPR poster. 12] class A,B,M gqa class C vqa class D,E,F,G,H,I,J structure class K,L metrics

Resume:

1.- GQA is a new dataset for real-world visual reasoning and compositional question answering over images.

2.- Existing VQA datasets have weaknesses like short/simple questions and language biases that limit their usefulness for measuring visual understanding.

3.- GQA provides structure for everything - each image has a scene graph specifying objects, attributes, and relations.

4.- Questions also have structural representations as functional programs listing the reasoning steps needed to answer them over the scene graph.

5.- The scene graph allows automatically creating 22 million multi-step questions of varying compositionality, each corresponding to a graph path.

6.- A robust question engine traverses the graph and translates the path into a natural language question, handling grammar and syntax.

7.- This generates linguistically rich and semantically diverse questions covering spatial reasoning, comparisons, logic, relations, and multi-step inference.

8.- Structural representations help reduce question biases that models previously exploited to guess answers without true scene understanding.

9.- An iterative balancing method uses question semantics to make answer distributions more uniform and reduce bias.

10.- Structural representations also enable new evaluation metrics beyond accuracy, like consistency in answering equivalent questions and grounding answers in images.

11.- The new metrics provide further insight into model behavior and inner workings.

12.- More info is at visualreasoning.org or CVPR poster 189.

Knowledge Vault built byDavid Vivancos 2024