The End Of Knowledge - Vault 5/19 - CVPR - 2016

graph LR classDef vision fill:#d4f9d4, font-weight:bold, font-size:14px classDef networks fill:#d4d4f9, font-weight:bold, font-size:14px classDef parsing fill:#f9f9d4, font-weight:bold, font-size:14px classDef attention fill:#f9d4d4, font-weight:bold, font-size:14px classDef vqa fill:#f9d4f9, font-weight:bold, font-size:14px A[Neural Module Networks] --> B[Answering questions
based on images. 1] A --> C[Constructing neural networks
from questions. 2] A --> D[Colored shapes for teaching. 3] A --> E[Networks built from
question analysis. 4] A --> F[Using network for
image processing. 5] A --> G[Vision-related structured
neural models. 6] A --> H[Related work in NLP. 7] A --> I[Computation as
a neural network. 8] A --> J[VQA models
capabilities. 9] A --> K[Identifying red objects
in images. 10] F --> L[Focusing on images
relevant parts. 11] L --> M[Mapping image
to red objects. 12] M --> N[Attention from circles
to objects above. 13] N --> O[Combining multiple concepts
for answers. 14] B --> P[Parsing question structure
for networks. 15] P --> Q[Network fragments
building specific networks. 16] Q --> R[Custom networks
for each question. 17] R --> S[Processing image
with built network. 18] S --> T[Generating answer from
networks output. 19] I --> U[Structured neural vision
model research. 20] H --> V[Research on mapping
natural language. 21] I --> W[Encoding computation as
neural network. 22] J --> X[VQA models expected
capabilities. 23] X --> Y[Associating words with
visual concepts. 24] Y --> Z[Using above to
modify attention. 25] class A vision class B,C,D,E,F,G,H,I,J,K networks class L,M,N,O attention class P,Q,R,S,T parsing class U,V,W,X,Y,Z vqa

Resume:

1.- Visual question answering: Answering questions based on an image input.

2.- Neural Module Networks: Dynamically constructing neural networks based on the syntactic structure of the question.

3.- Abstract scenes dataset: Colored shapes used for pedagogical examples.

4.- Question-specific neural networks: Networks built on the fly from modules based on the question's syntactic analysis.

5.- Applying dynamic networks: Using the constructed network to process the input image and produce an answer.

6.- Structured neural models for vision: Related work in vision.

7.- Semantic parsing: Related work in natural language processing.

8.- Neural representation of question-specific computation: Representing the dynamically constructed computation as a neural network.

9.- Capabilities of VQA models: Expectations for what visual question answering models should be able to do.

10.- Understanding "red": Identifying red objects in an image.

11.- Visual attention mechanism: Focusing on relevant parts of the image, used in vision and language models.

12.- "Red" as a function: Mapping an image to an attention map highlighting red objects.

13.- Understanding "above": Transforming attention from one object (circles) to another (objects above circles).

14.- Complex questions: Combining multiple concepts (e.g., "red shape above a circle") to answer a question.

15.- Syntactic analysis: Parsing the structure of the question to guide network construction.

16.- Modules: Small network fragments used to build the question-specific neural network.

17.- Dynamically constructing networks: Building a custom network for each question based on its syntactic structure.

18.- Applying constructed networks to images: Using the dynamically built network to process the input image.

19.- Producing answers: Generating an answer to the question based on the network's output.

20.- Related work in structured neural models: Other research on incorporating structure into neural networks for vision tasks.

21.- Related work in semantic parsing: Other research on mapping natural language to executable representations.

22.- Neural representation of constructed computation: Encoding the dynamically built question-specific computation as a neural network.

23.- Expectations for VQA models: Capabilities that visual question answering models should possess.

24.- Mapping words to visual concepts: Associating words like "red" with their corresponding visual representations.

25.- Transforming attention: Using words like "above" to modify attention from one object to another in the image.

Knowledge Vault built byDavid Vivancos 2024