The End Of Knowledge - Vault 1 - Lex 100 - 39 (2024) - Jitendra Malik: Computer Vision

graph LR classDef vision fill:#f9d4d4, font-weight:bold, font-size:14px; classDef challenges fill:#d4f9d4, font-weight:bold, font-size:14px; classDef cognition fill:#d4d4f9, font-weight:bold, font-size:14px; classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px; classDef development fill:#f9d4f9, font-weight:bold, font-size:14px; classDef future fill:#d4f9f9, font-weight:bold, font-size:14px; linkStyle default stroke:white; Z[Jitendra Malik:
Computer Vision] -.-> A[The complexity and challenges
of computer vision] Z -.-> G[The role of perception,
cognition, and memory in vision] Z -.-> M[Learning and knowledge
representation in computer vision] Z -.-> T[Vision's role in
AI development] Z -.-> AA[Current and future
impacts of AI] Z -.-> AG[Mentorship and problem
selection in AI] A -.-> B[Underestimate computer
vision complexity. 2] A -.-> C[Neuroscience perspective
on vision. 3] A -.-> D[Autonomous driving
vision limitations. 4] A -.-> E[Vision's role
in AI fallacy. 5] A -.-> F[Common sense in
computer vision. 6] G -.-> H[Compare human learning
to AI learning. 7] G -.-> I[Challenges in computer
vision learning. 8] G -.-> J[Computing power disparity
brain vs AI. 9] G -.-> K[The fundamental problem
of computer vision. 10] G -.-> L[Perception, cognition,
memory in vision. 11] M -.-> N[Learning vs. coding
knowledge in AI. 12] M -.-> O[Benchmarks reflect child
learning in vision. 13] M -.-> P[Embodiment for
AI development. 14] M -.-> Q[Break correlation-causation
barrier in vision. 15] M -.-> R[Exploratory learning and
simulation for vision. 23] M -.-> S[Feedback mechanisms
in vision systems. 24] T -.-> U[Vision before language
in AI development. 16,25] T -.-> V[The role of
vision in AI. 17] T -.-> W[End-to-end learning
in computer vision. 18] T -.-> X[Deep learning's potential
in vision. 19] T -.-> Y[Multimodal learning's importance
in vision. 26] AA -.-> AB[AI's current and
future impacts. 20] AA -.-> AC[Long-term video
understanding challenges. 21] AA -.-> AD[3D world understanding
challenges. 22] AA -.-> AE[Ethical and societal
impacts of AI. 27] AA -.-> AF[The future of AI
and unknown unknowns. 28] Z -.-> AH[Deep learning's surprises
and limitations. 29] Z -.-> AI[Jitendra Malik's impact
on computer vision. 1] class A,B,C,D,E,F,AC,AD vision; class G,H,I,J,K,L challenges; class M,N,O,P,Q,R,S cognition; class T,U,V,W,X,Y,AH learning; class AA,AB,AE,AF development; class AG,AI future;

Custom ChatGPT resume of the OpenAI Whisper transcription:

1.- Jitendra Malik's Influence in Computer Vision: Malik has significantly impacted both the era before and after the deep learning revolution in computer vision. His work has received over 180,000 citations, highlighting his role in mentoring many top researchers in the field.

2.- The Underestimation of Computer Vision Complexity: The conversation begins with the mention of the Summer Vision Project by Seymour Papert in 1966, emphasizing the long-standing underestimation of computer vision's complexity. This underestimation stems from the unconscious way humans process vision, contrasting sharply with the conscious effort required in tasks like chess or mathematical theorem proving.

3.- The Neuroscience Perspective on Vision: Malik points out that a considerable part of the cerebral cortex is dedicated to visual processing in humans and other primates, underscoring the complexity of the task from a neuroscience viewpoint.

4.- Autonomous Driving and Vision's Limitations: Malik expresses skepticism about fully autonomous driving in the near future, attributing this to the occasional need for sophisticated cognitive reasoning, a challenge not yet overcome by current computer vision systems.

5.- Vision's Role in AI and the Fallacy of the First Step: The conversation delves into the perception that vision might be an easier problem compared to language processing in AI, discussing the "fallacy of the successful first step" where initial progress in vision can be misleadingly fast compared to achieving near-perfect accuracy.

6.- Need for Common Sense and Understanding in Vision: Malik argues that vision operates at all levels of processing, requiring a combination of sensation, perception, and cognition to interpret scenes accurately. This complexity necessitates common sense reasoning and understanding, challenging for current AI models.

7.- Comparison with Human Learning: Discussing the difference between how humans learn to drive and how AI is trained, Malik notes that humans come to driving with a vast repertoire of visual knowledge gained from infancy, unlike AI systems which start from scratch.

8.- Challenges in Learning Techniques: Malik believes that while neural networks have potential, the learning techniques need significant evolution to mimic the richer learning processes humans undergo, including exploration and interaction with the world.

9.- The Computing Power Disparity: The discussion touches on the computational power difference between the human brain and computers, emphasizing that despite advances in computing power, the style and efficiency of biological computing are far from replicated in AI.

10.- The Fundamental Problem of Computer Vision: Malik and the conversation explore the essence of computer vision as being to sense the world in a way that guides action, drawing parallels with biological systems where perception and action are intricately linked.

11.- Perception, Cognition, and Memory: Malik discusses the blending of perception into cognition and emphasizes the role of memory and psychological schemas, such as those experienced in a restaurant scenario, illustrating the necessity of sophisticated knowledge for AI systems, especially in long-term video understanding.

12.- Learning Versus Coding Knowledge: Malik contrasts the 1970s approach of hand-coding knowledge in AI with the current emphasis on learning from data. He advocates for learning methodologies as more robust, drawing parallels with how children acquire knowledge through observation and experience rather than explicit instruction.

13.- Benchmarks Reflecting Child Learning: The discussion moves towards the inadequacy of current benchmarks in computer vision, suggesting that they are skewed towards adult capabilities rather than capturing the incremental and multimodal learning process observed in children.

14.- Importance of Embodiment in AI: Malik stresses the significance of physical interaction with the world for learning, mentioning robotics and simulation environments as critical for developing AI systems capable of understanding and interacting with their surroundings effectively.

15.- Breaking the Correlation-Causation Barrier: He discusses the importance of active learning and experimentation in breaking the correlation versus causation barrier, suggesting that controlled experiments, akin to those performed by children, are crucial for building and refining causal models of the world.

16.- Vision and Language in AI Development: Malik argues that vision is more fundamental than language in cognitive development, both evolutionarily (phylogeny) and in individual development (ontogeny), suggesting AI development should mirror this sequence.

17.- The Role of Vision in AI: The conversation covers the "three R's" of computer vision: recognition, reconstruction, and reorganization, and Malik's vision for a unified understanding of these aspects to enhance AI's interpretability of the world.

18.- Learning End-to-End in Computer Vision: Malik critiques the narrow focus on end-to-end supervised learning, advocating for a broader, lifelong learning approach that mirrors human development and emphasizes the integration of multiple learning modes.

19.- The Potential of Deep Learning and the Need for More: Despite being positively surprised by deep learning's capabilities, Malik highlights the ongoing need for breakthroughs in AI, especially to address the complex, long-term challenges in computer vision and cognition.

20.- AI's Current and Future Impacts on Society: Malik emphasizes the importance of addressing the ethical and societal implications of AI now, rather than waiting for hypothetical future developments, pointing out that AI's deployment in various domains already necessitates a careful consideration of its consequences.

21.- Long-term Video Understanding: Malik emphasizes the need for computer vision systems to progress beyond short-term video clip analysis to comprehend long-form videos, which requires understanding agents, their goals, intentions, and predicting future actions. This leap necessitates a more sophisticated understanding of the scenes, blending short-term observations with deeper cognitive models.

22.- 3D World Understanding Challenges: While acknowledging progress in 3D understanding, Malik points out the current limitations and the need for a more integrated approach that combines multiple viewpoints and single-view reconstruction. He advocates for a learning model that develops a nuanced 3D understanding as one moves through the world, similar to human experience.

23.- Exploratory Learning and Simulation: Malik highlights the importance of exploratory learning and simulations for AI to understand and interact with the world more effectively. He discusses the potential of simulation environments, like Habitat, to offer photorealistic and interactive experiences that could advance AI's understanding of complex real-world scenarios.

24.- The Need for Feedback Mechanisms in Vision Systems: Drawing a parallel between biological systems and AI, Malik discusses the lack of feedback mechanisms in current vision systems, which are predominantly feed-forward. He suggests incorporating feedback and recurrence mechanisms to handle ambiguous stimuli more effectively, similar to biological vision systems.

25.- Vision Before Language in AI Development: Malik argues that vision is more fundamental to cognitive development and AI than language, reflecting evolutionary and individual development processes. He suggests AI development should prioritize vision to mimic human cognitive development more accurately.

26.- Multimodal Learning's Importance: Emphasizing the significance of multimodal learning, Malik describes how combining different sensory inputs, such as tactile and visual or audio and visual signals, can provide strong cross-calibration signals for AI systems, promoting a more integrated and robust understanding of the world.

27.- Ethical and Societal Impacts of AI: Malik urges attention to the ethical and societal implications of AI, highlighting the importance of considering AI's effects not just in hypothetical future scenarios but in its current applications, including potential biases, safety concerns, and impacts on decision-making processes.

28.- The Future of AI and Unknown Unknowns: While optimistic about the potential for significant breakthroughs in AI, Malik expresses caution regarding overestimating the pace of progress, emphasizing the distinction between known unknowns and unknown unknowns in AI research and development.

29.- Deep Learning's Surprises and Limitations: Acknowledging deep learning's impressive capabilities, Malik reflects on his initial skepticism and the surprising effectiveness of deep learning systems. However, he also notes the limitations of current models and the need for future breakthroughs to address more complex challenges in AI.

30.- Mentorship and Problem Selection: Reflecting on his role as a mentor to many leading researchers in computer vision and AI, Malik emphasizes the importance of selecting the right problems to focus on. He shares his approach to guiding students towards problems that are both significant and solvable, highlighting the value of intellectual breadth and interdisciplinary connections.

Interview byLex Fridman| Custom GPT and Knowledge Vault built byDavid Vivancos 2024