The End Of Knowledge - Vault 5/2 - CVPR - 2015 - Reverse Engineering the Human Visual System

graph LR classDef vision fill:#f9d4d4, font-weight:bold, font-size:14px classDef fmri fill:#d4f9d4, font-weight:bold, font-size:14px classDef cnn fill:#d4d4f9, font-weight:bold, font-size:14px classDef challenges fill:#f9f9d4, font-weight:bold, font-size:14px classDef community fill:#f9d4f9, font-weight:bold, font-size:14px A[Reverse Engineering the
Human Visual System] --> B[Human vision: hierarchical,
30-40 distinct areas. 1] B --> C[Visual areas represent
information as neurons. 2] B --> D[Attention influences vision
feedforward and feedback. 3] A --> E[fMRI measures slow responses,
maps activity. 4] E --> F[fMRI delineated early
and intermediate vision. 5] E --> G[fMRI shows complex patterns,
high-dimensional regression. 6] G --> H[Encoding models predict
responses using regression. 7] G --> I[Decoding reconstructs stimuli
from brain activity. 8] B --> J[Semantic representations: rich
gradients across areas. 9] B --> K[Semantic tuning shifts
based on task. 10] A --> L[CNNs advanced vision,
mimic biological aspects. 11] L --> M[CNN layers predict brain
activity, outperform. 12] M --> N[CNN layers predict visual
areas hierarchically. 13] L --> O[CNNs reveal selectivity
in visual areas. 14] L --> P[Discrepancies exist between
CNNs and vision. 15] P --> Q[Human vision has attention,
CNNs lack. 16] L --> R[Weight sharing in CNNs
akin to normalization. 17] A --> S[Reasoning and cognition
lag behind vision. 18] S --> T[Big data approaches
study complex cognition. 19] B --> U[Color difficult from V1
due to luminance. 20] B --> V[Natural vision video-based,
important to study. 21] L --> W[CNNs predict movie responses
despite training. 22] E --> X[Better 3D brain data
needed for CNNs. 23] A --> Y[Computer vision: abstract vs
biologically-inspired communities. 24] Y --> Z[Workshops bring together
vision and biology. 25] class B,C,D,J,K,U,V vision class E,F,G,H,I,X fmri class L,M,N,O,P,Q,R,W cnn class S,T challenges class Y,Z community

Resume:

1.- Human vision is organized hierarchically, with 30-40 distinct visual areas arranged in an interconnected network.

2.- Each visual area represents certain information about the visual world, with neurons acting as basis functions in a high-dimensional space.

3.- Attentional influences occur throughout the visual system via feedforward and feedback connections between layers.

4.- Functional MRI (fMRI) measures slow hemodynamic responses in 3D voxels across the brain, allowing mapping of functional activity.

5.- Early and intermediate human visual system was delineated using fMRI over 20 years, identifying various functional visual areas.

6.- fMRI data shows rich, complex patterns of activity corresponding to different stimuli, posing a high-dimensional regression problem.

7.- Encoding models using ridge regression can predict fMRI responses to novel stimuli based on previously learned feature spaces.

8.- Decoding models, derived from encoding models, can reconstruct stimuli from brain activity patterns, e.g., decoding movies from visual cortex activity.

9.- Semantic representations in the brain are organized in rich gradients distributed across multiple areas, not just individual punctate regions.

10.- Semantic tuning across the brain dynamically shifts based on task demands, allocating representational resources to task-relevant information.

11.- Deep convolutional neural networks (CNNs) have advanced computer vision, mimicking aspects of biological vision.

12.- CNN layers can be used as regressors to predict brain activity in response to stimuli, outperforming conventional feature-based models.

13.- Early visual areas are best predicted by early CNN layers, while higher-level areas are predicted by later layers.

14.- Probing CNNs can reveal features represented in each visual area, e.g., curvature selectivity in V4, face selectivity in fusiform face area.

15.- Some discrepancies exist between CNNs and human vision, such as idiosyncratic categorization artifacts and unclear emergence of figure-ground organization.

16.- Attentional control in human vision influences processing throughout the hierarchy, while CNNs lack short-term attentional mechanisms.

17.- Weight sharing across retinotopic positions, akin to divisive normalization in biological vision, is a key feature of CNNs.

18.- Understanding reasoning and complex cognition in mammals lags behind vision research due to difficultly varying top-down state variables.

19.- Big data approaches are beginning to be applied to study complicated cognitive tasks in humans and animals.

20.- Color information is difficult to recover from V1 voxels in fMRI experiments with natural images due to luminance dominance.

21.- Studying vision using video stimuli is important, as natural vision is essentially video-based.

22.- CNNs trained on static images can still predict brain responses to movie stimuli, possibly due to slow fMRI hemodynamic responses.

23.- Better human brain data measuring neural activity in 3D is needed to fully leverage CNNs trained on movies.

24.- Two communities in computer vision: one favoring abstract, theoretical approaches, and another using biology for inspiration (e.g., Jitendra Malik).

25.- Workshops have been organized to bring together the biologically-interested computer vision community and the biological vision community.

Knowledge Vault built byDavid Vivancos 2024