The End Of Knowledge - Vault 5/71 - CVPR - 2022 - Learning to see the human way

graph LR classDef human fill:#f9d4d4, font-weight:bold, font-size:14px classDef machine fill:#d4f9d4, font-weight:bold, font-size:14px classDef vision fill:#d4d4f9, font-weight:bold, font-size:14px classDef intelligence fill:#f9f9d4, font-weight:bold, font-size:14px classDef future fill:#f9d4f9, font-weight:bold, font-size:14px A[Learning to see
the human way] --> B[Human vision models physical
world, beyond patterns 1] A --> C[Early vision labeled images,
limited approach 2] A --> D[True vision sees occluded,
invisible via physics 3] A --> E[Intelligence models world,
not just data 4] A --> F[Cognitive AI: intelligence as
model-building 5] A --> G[Humans infer invisible objects,
properties via reasoning 6] A --> H[AI pattern recognition captures
partial intelligence 7] A --> I[Seeing human way: labeling
vs world understanding 8] A --> J[Core infant knowledge: physics,
psychology, reasoning 9] J --> K[Intuitive physics: object permanence,
solidity, causes 10] J --> L[Mental simulation may underlie
scene understanding 11] A --> M[Cognitive AI integrates probabilistic,
symbolic, neural 12] M --> N[Inverse graphics infers 3D
from images 13] M --> O[Ideal vision sees independently
movable objects 14] M --> P[Neural architectures with physics,
objects, causality 15] M --> Q[Flexible scene representations overcome
recognition limits 16] M --> R[Probabilistic programs express physics-based
vision models 17] A --> S[Challenges test physics,
reasoning, transfer 18-19] S --> T[Progress via differentiable rendering,
probabilistic programming, cognition 20] A --> U[Humans parse 3D, dynamics,
affordances in novel scenes 21] A --> V[Goal: vision learns generative
models from few observations 22] A --> W[Biology likely uses physics,
simulation, not just recognition 23] A --> X[Probability crucial for uncertainty,
information-seeking 24] A --> Y[Task performance ≠ human-level
understanding 25] A --> Z[Exciting progress in physically-grounded,
human-like AI 26] Z --> AA[Open challenges in flexible
knowledge transfer 27] Z --> AB[Progress requires ML, vision,
cognition, biology 28] Z --> AC[Aim for rapid learning,
flexible transfer, not incrementalism 29] Z --> AD[Precise challenges push boundaries
of visual understanding 30] class B,D,I,U,W,Y human class C,H,S machine class E,F,G,J,K,L,M,N,O,P,Q,R,T,V,X vision class Z,AA,AB,AC,AD future

Resume:

1.- Learning to see the human way involves modeling the physical world, not just finding patterns in images.

2.- Early computer vision focused on labeling what humans can in images, which is coherent and practical but limited.

3.- True human vision involves "seeing" things that are occluded or invisible by leveraging knowledge of physics and objecthood.

4.- Human intelligence is about modeling the world, not just data, to explain observations, imagine possibilities, and achieve goals through planning.

5.- Cognitive AI views intelligence as model-building, with learning as the construction of new models based on interactions with the world.

6.- Humans can infer invisible objects and properties in scenes by reasoning about physics, objecthood, and causality.

7.- AI has focused more on pattern recognition and function approximation, capturing only part of what constitutes intelligence.

8.- Two views of "seeing the human way": 1) Labeling images like humans do, 2) Making sense of the world from visual input

9.- Core knowledge in humans, present early in infancy, includes intuitive physics, psychology, and other domains for reasoning about the world.

10.- Intuitive physics allows infants and adults to understand object permanence, solidity, support, stability, and causal interactions from visual observations.

11.- Mental simulation, akin to a "game engine in your head", may underlie human physical scene understanding and interaction planning.

12.- Cognitive AI aims to combine the strengths of probabilistic, symbolic, and neural approaches, integrated via techniques like probabilistic programming.

13.- Inverse graphics infers 3D scene structure from images, a foundation for human-like vision; recent progress comes from differentiable rendering.

14.- Ideal vision systems see "independently movable objects" to support reasoning about scene dynamics and afforances for action.

15.- Neural architectures incorporating inductive biases about physics, objecthood, and causality show promise for human-like visual scene understanding.

16.- Flexible, object-centric scene representations that combine logic, probability, and neural networks can overcome limitations of conventional recognition pipelines.

17.- Probabilistic programs can express rich generative models for physics-based vision, with programmable inference to adaptively solve scene understanding tasks.

18.- The "Bottle Cap Challenge" tests whether vision systems can segment and model novel objects with partial observability by leveraging physics understanding.

19.- The "General Game Inverse Graphics Challenge" tests transfer of visual understanding to novel virtual worlds with different appearance and physics.

20.- Progress on these challenges may come from integrating differentiable rendering, probabilistic programming, and insights from studies of human cognition.

21.- Humans effortlessly parse 3D structure, dynamics, and afforances in novel scenes and transfer knowledge to new environments in near-zero-shot ways.

22.- A key goal is vision systems that rapidly learn generative models to infer occluded objects/properties by combining physics and few observations.

23.- Biological vision likely relies on physics-based representations and simulation, not just pattern recognition, implicating areas beyond the ventral stream.

24.- Probabilistic approaches are crucial for quantifying uncertainty in vision to drive information-seeking behaviors when world models are inapplicable.

25.- Improved performance on tasks is not equivalent to human-level understanding; we must distinguish small steps from reaching key human abilities.

26.- Exciting progress is being made in physically-grounded AI systems that learn generative world models in a more human-like way.

27.- However, major open challenges remain in flexibly transferring knowledge to novel environments with different appearance and physical dynamics.

28.- Success on these challenges will require combining tools from modern ML, classic vision, cognitive science, and studies of biological intelligence.

29.- The field should aim to create systems that learn rapidly and transfer flexibly, not just achieve incremental gains on narrow tasks.

30.- Key to progress is precise formulation of challenges that push the boundaries of artificial visual understanding towards more human-like capabilities.

Knowledge Vault built byDavid Vivancos 2024