Knowledge Vault 5 /18 - CVPR 2016
Stacked Attention Networks for Image Question Answering
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef IQA fill:#f9d4d4, font-weight:bold, font-size:14px classDef applications fill:#d4f9d4, font-weight:bold, font-size:14px classDef challenges fill:#d4d4f9, font-weight:bold, font-size:14px classDef attention fill:#f9f9d4, font-weight:bold, font-size:14px classDef results fill:#f9d4f9, font-weight:bold, font-size:14px A[Stacked Attention Networks
for Image Question
Answering] --> B[IQA: answer questions
from images. 1] B --> C[Applications: aid visually impaired
comprehension. 2] B --> D[Challenges: understand relationships,
focus regions. 3] B --> E[Reasoning: narrow focus
to infer answer. 4] B --> F[SAN: encode question, image,
attention, predict. 5] E --> G[First attention: correlate question
with regions. 8] G --> H[Second attention: focus further,
suppress noise. 11] E --> I[Aggregate weighted features: sum based
on attention. 9] E --> J[Multimodal pooling: combine image
and text features. 10] F --> K[Image encoding: capture spatial
features with VGG. 6] F --> L[Question encoding: LSTM or CNN
capture structure. 7] B --> M[Benchmarks: VQA, CoCoQA, DAQUAR. 13] M --> N[VQA results: improvement on
what/color questions. 14] B --> O[Impact: two attention layers
outperform one. 15] O --> P[LSTM, CNN perform
similarly. 16] O --> Q[Qualitative: focus on relevant,
ignore irrelevant. 17] O --> R[Error analysis: correct region,
wrong answer. 18] R --> S[Error types: ambiguous answers,
label errors. 19] S --> T[Examples: object confusion,
label mistakes. 20] B --> U[Interest: increased conference
papers. 21] U --> V[Comparison: IQA needs detailed,
focused reasoning. 22] U --> W[SAN motivation: enable progressive,
multi-level reasoning. 23] U --> X[Visual grounding: clearer reasoning
grounding. 24] U --> Y[Shared code: available
on GitHub. 25] class A,B IQA class C applications class D challenges class E,F,G,H,I,J attention class K,L results class M,N,O,P,Q,R,S,T results class U,V,W,X,Y results


1.- Image Question Answering (IQA): Answering natural language questions based on an image's content.

2.- IQA applications: Helping visually impaired understand surroundings.

3.- IQA challenges: Requires understanding relationships between objects and focusing on relevant regions.

4.- Multi-step reasoning: Progressively narrowing focus to infer the answer.

5.- Stacked Attention Network (SAN) model: 4 steps - encode question, encode image, multi-level attention, predict answer.

6.- Image encoding: Using last convolutional layer of VGG network to capture spatial features.

7.- Question encoding: Using LSTM or CNN to capture semantic and syntactic structure.

8.- First attention layer: Computes correlation between question entities and image regions.

9.- Aggregate weighted image features: Sums image features based on attention.

10.- Multimodal pooling: Combines pruned image and text features.

11.- Second attention layer: Further narrows focus to answer-relevant regions and suppresses noise.

12.- Answer prediction: Treats as 400-way classification using multimodal features.

13.- Benchmarks: Evaluated on Visual Question Answering (VQA), CoCoQA, DAQUAR datasets.

14.- VQA results: Major improvement over baselines, especially for "what is/color" type questions.

15.- Impact of attention layers: Using 2 layers of attention significantly outperforms 1 layer.

16.- LSTM vs CNN for question encoding: Perform similarly.

17.- Qualitative examples: Model learns to focus on relevant regions and ignore irrelevant ones.

18.- Error analysis: 78% pay attention to correct region, 42% still predict wrong answer.

19.- Error types: Ambiguous answers, label errors.

20.- Example errors: Confusion between similar objects, ground truth label mistakes.

21.- Increased interest in IQA: Many related papers at the conference.

22.- Comparison to captioning: IQA requires understanding subtle details and focused reasoning.

23.- SAN motivation: Provide capacity for progressive, multi-level reasoning.

24.- Visual grounding: SAN enables clearer grounding of reasoning in the image.

25.- Shared code: Available on GitHub.

Knowledge Vault built byDavid Vivancos 2024