Knowledge Vault 5 /87 - CVPR 2023
Visual Programming: Compositional visual reasoning without training
Tanmay Gupta, Aniruddha Kembhavi
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef visual fill:#f9d4d4, font-weight:bold, font-size:14px classDef reasoning fill:#d4f9d4, font-weight:bold, font-size:14px classDef models fill:#d4d4f9, font-weight:bold, font-size:14px classDef program fill:#f9f9d4, font-weight:bold, font-size:14px classDef visprog fill:#f9d4f9, font-weight:bold, font-size:14px A[Visual Programming: Compositional
visual reasoning without
training] --> B[Language-based computer
vision system design 1] A --> C[Combining skills for
complex tasks 2] A --> D[User-driven models
limited in scope 3] D --> E[Increasing tasks demand
more capabilities 4] A --> F[Programs generated from
task descriptions 5] F --> G[Leveraging specialized models
as components 6] F --> H[Modifying programs
adapts to tasks 7] F --> I[Automated program creation
from descriptions 8] A --> J[Visprog: GPT-3 generates
Python code 9] J --> K[Visprog modules: vision
models, routines 10] J --> L[Programs are interpretable
and debuggable 11] J --> M[Models need intent and
module understanding 12] M --> N[Examples enable practical
program generation 13] J --> O[Visual rationale links
execution steps 14] O --> P[Users interpret, debug,
intervene visually 15] class B,C visual class D,E models class F,G,H,I program class J,K,L,M,N,O,P visprog


1.- Visual Programming: A new paradigm for designing computer vision systems using language descriptions to generate programs that solve visual tasks.

2.- Compositional Visual Reasoning: Combining multiple computer vision skills to perform complex tasks, like tagging characters in an image.

3.- End-to-End Models: Large neural networks that consume user inputs and task descriptions to directly output results, but are limited in scope.

4.- Increasing Task Complexity: As users become more creative with visual tasks, end-to-end models will require more skills and parameters.

5.- Program Generation: An alternative approach where a program generator creates a computer program based on the task description.

6.- Specialized Computer Vision Models: Visual Programming leverages existing, specialized models as building blocks for generated programs.

7.- Modifying Programs: Users can modify generated programs to adapt to new tasks, invoking different sets of skills.

8.- Automatic Program Generation: The goal is to automatically generate programs using task descriptions provided by users.

9.- Visprog: A specific implementation of Visual Programming using GPT-3 and in-context learning to generate Python programs.

10.- Visprog Modules: The heart of Visprog, consisting of various computer vision models, image processing routines, and arithmetic/logical operations.

11.- Program Interpretation and Debugging: Visprog programs are easy to interpret and debug, with each step invoking a Visprog module.

12.- Language Model Limitations: Large language models alone cannot generate useful programs without understanding the user's intent and available modules.

13.- In-Context Examples: Providing examples of tasks, accepted programs, and available modules enables language models to generate useful programs.

14.- Visual Rationale: Visprog provides a visual rationale by stitching together the input and output of each execution step.

15.- Interpreting and Intervening: Users can interpret, debug, diagnose, and intervene within the visual reasoning process using the visual rationale.

Knowledge Vault built byDavid Vivancos 2024