Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:
Resume:
1.- GQA is a new dataset for real-world visual reasoning and compositional question answering over images.
2.- Existing VQA datasets have weaknesses like short/simple questions and language biases that limit their usefulness for measuring visual understanding.
3.- GQA provides structure for everything - each image has a scene graph specifying objects, attributes, and relations.
4.- Questions also have structural representations as functional programs listing the reasoning steps needed to answer them over the scene graph.
5.- The scene graph allows automatically creating 22 million multi-step questions of varying compositionality, each corresponding to a graph path.
6.- A robust question engine traverses the graph and translates the path into a natural language question, handling grammar and syntax.
7.- This generates linguistically rich and semantically diverse questions covering spatial reasoning, comparisons, logic, relations, and multi-step inference.
8.- Structural representations help reduce question biases that models previously exploited to guess answers without true scene understanding.
9.- An iterative balancing method uses question semantics to make answer distributions more uniform and reduce bias.
10.- Structural representations also enable new evaluation metrics beyond accuracy, like consistency in answering equivalent questions and grounding answers in images.
11.- The new metrics provide further insight into model behavior and inner workings.
12.- More info is at visualreasoning.org or CVPR poster 189.
Knowledge Vault built byDavid Vivancos 2024