Knowledge Vault 5 /68 - CVPR 2021
Learning to see like humans
Matthias Bethge
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef turing fill:#f9d4d4, font-weight:bold, font-size:14px classDef brain fill:#d4f9d4, font-weight:bold, font-size:14px classDef learning fill:#d4d4f9, font-weight:bold, font-size:14px classDef imagenet fill:#f9f9d4, font-weight:bold, font-size:14px classDef generalization fill:#f9d4f9, font-weight:bold, font-size:14px classDef experimental fill:#d4f9f9, font-weight:bold, font-size:14px classDef generative fill:#f9d4d4, font-weight:bold, font-size:14px A[Learning to see
like humans] --> B[Turing: machines mimic mind,
memory, learning. 1] A --> C[Brain: complex data,
good decisions, vision. 2] A --> D[Supervised learning: task, decisions,
train machines. 3] A --> E[ImageNet: image classification,
1000 categories. 4] E --> F[ImageNet performance vs.
brain-like vision. 5] A --> G[Generalization: key to
intelligence, task changes. 6] G --> H[Transfer learning: reuse
features, other tasks. 7] H --> I[Transfer success: saliency,
pose, tracking. 8] G --> J[Limitations: adversarial examples,
texture bias. 9] J --> K[CNNs: texture, humans: shape. 10] J --> L[Out-of-domain noise: CNNs
vs. shape models. 11] J --> M[Texture bias removal:
augmentation, robustness. 12] G --> N[Out-of-domain accuracy
correlates human-like decisions. 13] G --> O[Counterfactual testing: smallest
input changes, decisions. 14] O --> P[Generative modeling enables
on MNIST. 15] G --> Q[Controversial stimuli: model
disagreements, human alignment. 16] Q --> R[ABS: best alignment,
ambiguous digits. 17] A --> S[Scaling generative models:
objects, scenes complexity. 18] S --> T[Compositional scene model:
background, objects, segmentation. 19] T --> U[Latent representation: properties,
recombination, intervention. 20] A --> V[Invariance manifolds: information
preserved, discarded. 21] V --> W[Invertible networks: metameric
images, nuisance information. 22] W --> X[CNNs: metamers perceived
as nuisance, misaligned. 23] W --> Y[Training: nuisance invariant
to class, consistency. 24] V --> Z[Shaping invariances: human-like decisions. 25] A --> AA[Data, inductive biases:
constrain human-like rules. 26] AA --> AB[Object-centric generative models,
compositionality, training data. 27] A --> AC[Human-like decisions, not
just benchmark performance. 28] A --> AD[Out-of-domain, counterfactual testing:
assess, improve consistency. 29] A --> AE[Generative models: robust,
generalizable, human-aligned vision. 30] class B turing class C brain class D,H,I learning class E,F imagenet class G,J,K,L,M,N,O generalization class P,Q,R,V,W,X,Y,Z experimental class S,T,U,AA,AB,AE generative

Resume:

1.- Alan Turing's ideas on using machines to mimic the human mind, including the Turing test, universality of machines, memory requirements, and machine learning.

2.- Brains as decision-making devices that receive complex data and utilize it to make good decisions, with a focus on vision.

3.- Supervised learning approach of defining a task, collecting human decisions, and training machines to generate the same responses.

4.- ImageNet challenge of assigning images to 1000 categories. Performance has improved from 50% to 90% accuracy over 10 years.

5.- Question of whether high ImageNet performance implies brain-like visual decision making in neural networks.

6.- Testing generalization ability as a key to intelligence when input data or task changes.

7.- Transfer learning: Reusing features from pretrained ImageNet models as fixed representations for other vision tasks.

8.- Successes of transfer learning in saliency prediction, pose estimation, behavior tracking, showing useful generalization beyond ImageNet.

9.- Limitations: Adversarial examples, sensitivity to domains/backgrounds, texture bias show ImageNet features alone don't imply brain-like vision.

10.- Controlled experiments showing CNNs rely more on texture while humans rely more on shape for object recognition.

11.- Out-of-domain testing on noise perturbations reveals non-human-like sensitivity of standard CNNs compared to shape-based models.

12.- Removing texture bias through data augmentation makes CNNs more robust to noise like humans.

13.- Empirical findings that better out-of-domain accuracy on some datasets correlates with more human-like decisions.

14.- Counterfactual testing of smallest input changes that alter model decisions is even stronger test of human-like vision.

15.- Generative modeling enables this on MNIST, revealing human-interpretable perturbations at class boundaries.

16.- Controversial stimuli experiments systematically comparing model disagreements enables quantifying alignment with human decisions.

17.- Generative model (ABS) shows best alignment with human interpretation of ambiguous digits.

18.- Scaling up generative models to natural images requires handling combinatorial complexity of objects and scenes.

19.- Compositional generative scene model learns to sequentially render background and objects from noisy unsupervised segmentation.

20.- Learned latent representation captures meaningful perceptual properties, enables plausible recombination and intervention on scenes.

21.- Exploring invariance manifolds in neural networks to study what information is preserved or discarded.

22.- Invertible neural networks allow synthesizing "metameric" images with same output but different nuisance (non-class-specific) information.

23.- For standard CNNs, humans perceive metamers as identical to nuisance image, not class image, exposing misaligned invariances.

24.- Modified training to encourage nuisance space to be invariant to class improves human consistency of CNN invariances.

25.- Actively shaping invariances in neural networks is important direction to make their decisions more human-like.

26.- Overall perspective on using data and inductive biases to constrain learnable decision rules towards intended human-like solutions.

27.- Object-centric generative models and compositionality across scales as key ingredients for generating training data.

28.- Implicit argument that more human-like decision making, not just benchmark performance, should be goal of computer vision.

29.- Importance of out-of-domain and counterfactual testing to assess and improve human-consistency of vision models.

30.- Central role of generative models in future work to build more robust, generalizable and human-aligned computer vision systems.

Knowledge Vault built byDavid Vivancos 2024