Knowledge Vault 2/66 - ICLR 2014-2023
Devi Parikh ICLR 2020 - Invited Speaker - AI Systems That Can See And Talk
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef vision fill:#f9d4d4, font-weight:bold, font-size:14px; classDef vqa fill:#d4f9d4, font-weight:bold, font-size:14px; classDef models fill:#d4d4f9, font-weight:bold, font-size:14px; classDef bias fill:#f9f9d4, font-weight:bold, font-size:14px; classDef datasets fill:#f9d4f9, font-weight:bold, font-size:14px; classDef challenges fill:#d4f9f9, font-weight:bold, font-size:14px; classDef thoughts fill:#f9d4d4, font-weight:bold, font-size:14px; A[Devi Parikh
ICLR 2020] --> B[Vision-language: interesting, important area 1] A --> C[VQA: broad, open-ended, evaluable 2] C --> D[Models describe images, videos, converse 3] C --> E[Models biased, fail grounding 4] C --> F[Neural Baby Talk grounds captions 5] C --> G[VQA-CP tests language prior reliance 6] G --> H[Separate vision, language reduces bias 7] C --> I[Difficult VQA needs OCR 8] I --> J[TextVQA focuses on text questions 9] A --> K[Most work trains task-specific models 10] K --> L[Vision-language should learn generic representations 11] L --> M[ViLBERT learns generic representations, fine-tunes 12] M --> N[Multitask ViLBERT outperforms, shares representations 13] M --> O[Demo: one model, 8 tasks 14] A --> P[Challenges: diverse data, external knowledge 15] P --> Q[Better downstream evaluation with humans 16] P --> R[Non-English, bias mitigation needed 17] A --> S[Vision-language: exciting ground for research 18] S --> T[Current capabilities impressive but brittle 19] A --> U[Additional thoughts: time management, AI
creativity, climate, women in AI, philosophy 20] class A,B,S,T vision; class C,D,E,F,G,H,I,J vqa; class K,L,M,N,O models; class E,G,H,Q,R bias; class J,P,Q datasets; class P,Q,R challenges; class U thoughts;


1.-Vision and language is an interesting and important area to study for applications, conceptual reasons, and technical challenges.

2.-Visual question answering (VQA) is a broad, open-ended vision-language task that is quantitatively evaluable and showcases current model capabilities.

3.-Models can now describe images, videos, and have back-and-forth conversations about image content in an impressive fashion.

4.-Though impressive, vision-language models can still be biased towards language priors and fail to sufficiently ground in images.

5.-Neural Baby Talk generates captions grounded in object detector outputs, allowing description of novel scenes and robust captioning.

6.-VQA-CP tests models' reliance on language priors by having different answer distributions for question types between train and test.

7.-Separating vision and language reasoning into distinct modules that combine later helps reduce language bias in VQA.

8.-Difficult VQA questions often involve reading text in images, but top models fail at this due to lack of OCR integration.

9.-The TextVQA dataset and challenge focuses on questions that require reading and reasoning about text in images.

10.-Most vision-language work trains separate task-specific models on different datasets, learning non-generic representations.

11.-Vision and language should aim to learn generic representations that enable solving multiple tasks with one model.

12.-ViLBERT learns generic vision-language representations through pre-training that can be fine-tuned for various downstream tasks.

13.-A multitask ViLBERT model with task-specific heads outperforms specialist models and benefits from shared representations.

14.-A live demo shows a single model performing 8 vision-language tasks: VQA, referring expressions, entailment, retrieval and more.

15.-Open challenges remain in using diverse visual data beyond COCO and incorporating external knowledge for VQA.

16.-Vision-language models need better evaluation on downstream tasks with humans in the loop, not just static benchmarks.

17.-More work is needed on non-English languages, identifying and mitigating biases in vision-language datasets and models.

18.-Vision-language is an exciting, fertile ground for research on tasks, datasets, evaluation, applications, biases, and human-AI interaction.

19.-Current vision-language capabilities are impressive but still easy to break; much more work remains to be done.

20.-The speaker has additional thoughts on time management, AI creativity, climate change, experiences as a woman in AI, and philosophy.

Knowledge Vault built byDavid Vivancos 2024