The End Of Knowledge - Vault 6/11 - CVPR - 2016 - A Quest for Visual Intelligence in Computers

graph LR classDef main fill:#f9d4d4, font-weight:bold, font-size:14px classDef intro fill:#d4f9d4, font-weight:bold, font-size:14px classDef evolution fill:#d4d4f9, font-weight:bold, font-size:14px classDef progress fill:#f9f9d4, font-weight:bold, font-size:14px classDef datasets fill:#f9d4f9, font-weight:bold, font-size:14px classDef future fill:#d4f9f9, font-weight:bold, font-size:14px Main[A Quest for
Visual Intelligence in
Computers] Main --> A[Introduction and Background] A --> A1[Fei-Fei Li: Stanford professor,
AI researcher 1] A --> A2[Eyes triggered animal speciation
explosion 2] A --> A3[Human vision: powerful evolutionary
machinery 3] A --> A4[Humans understand brief scene
presentations quickly 4] A --> A5[Total scene understanding: computer
vision goal 5] Main --> B[Evolution of Computer Vision] B --> B1[Early vision: handcrafted 3D
world models 6] B --> B2[Vision challenging: context, knowledge
required 7] B --> B3[Big data learning: path
to intelligence 8] B --> B4[Machine learning tools advanced
computer vision 9] B --> B5[One-shot learning inspired by
humans 10] B --> B6[Infant vision requires extensive
training 11] Main --> C[Data and Datasets] C --> C1[Internet explosion of visual data 12] C --> C2[ImageNet: massive labeled image
dataset 13] C --> C3[Algorithms recognize objects, avoid
mistakes 14] C --> C4[Deep learning breakthrough in
object recognition 15] C --> C5[Object detection challenging for
small objects 16] C --> C6[Visual Genome: rich, interconnected
annotations 22] Main --> D[Image Understanding Progress] D --> D1[Humans excel at contextual
recognition 17] D --> D2[Early image descriptions: limited
datasets 18] D --> D3[Image-sentence matching through semantic
similarity 19] D --> D4[RNNs generate descriptions from
CNN features 20] D --> D5[Image captioning models: novel,
some errors 21] D --> D6[Dense captioning: detailed region
descriptions 23] Main --> E[Current State and Future Challenges] E --> E1[Context enables small object
detection 24] E --> E2[Deep understanding requires relationships,
knowledge 25] E --> E3[Vision progress: modeling to
question-answering 26] E --> E4[Algorithms lack human-level story
understanding 27] E --> E5[Human vision grasps complex
scene meanings 28] E --> E6[Much work remains for
visual intelligence 29] class Main main class A,A1,A2,A3,A4,A5 intro class B,B1,B2,B3,B4,B5,B6 evolution class C,C1,C2,C3,C4,C5,C6 datasets class D,D1,D2,D3,D4,D5,D6 progress class E,E1,E2,E3,E4,E5,E6 future

Resume:

1.- Fei-Fei Li is an associate professor at Stanford and director of the AI Lab, Computer Vision Lab, and Toyota Human-Centric AI Research.

2.- 540 million years ago, the onset of eyes in animals triggered an explosion in animal speciation, setting off an evolutionary arms race.

3.- The human visual system, developed over 540 million years of evolution, is the most powerful visual machinery in the known universe.

4.- Experiments show humans can understand the gist of a scene presented very briefly, typing detailed descriptions of 500ms image flashes.

5.- Total scene understanding - being able to see and understand an entire visual scene the way humans can - is a key goal for computer vision.

6.- Early computer vision work in the 1960s-80s focused on handcrafting models to explain the 3D world from 2D pictures.

7.- Vision is hard because measuring pixels is not the same as understanding scenes; the brain interprets scenes using context and prior knowledge.

8.- After 30 years of modeling, the computer vision community realized learning from big data was the path to visual intelligence.

9.- Around 2000, computer vision found machine learning, gaining tools like SVMs, graphical models, and neural networks to make real progress.

10.- One-shot learning algorithms aimed to learn to recognize objects from very few training examples, inspired by human learning.

11.- However, the human visual system actually requires extensive training - infants' eyes collect hundreds of millions of "training images" in their first years.

12.- The early 2000s saw an explosion of visual data on the internet, with over 85% of cyberspace data being pixel-based by 2016.

13.- The ImageNet project, started in 2009, collected a massive crowd-sourced dataset of 15 million labeled images across 22,000 categories.

14.- Using ImageNet, algorithms were developed to recognize objects while avoiding mistakes by backing off to more general categories when uncertain.

15.- The 2012 ImageNet competition saw a breakthrough in object recognition accuracy with deep convolutional neural networks, ushering in the deep learning renaissance.

16.- Progress on the ImageNet object detection challenge has been slower, with small, textureless objects proving very difficult for algorithms to detect.

17.- Humans excel at using context and a holistic view of an image for recognition, noticing meaningful differences while ignoring irrelevant ones.

18.- Early efforts at generating descriptions of images were limited, using small datasets and known object categories and sentence structures.

19.- Later work aimed to match images to sentences by ranking them, learning features that encoded the semantic similarity between the two.

20.- To generate descriptions directly, recurrent neural networks were used as language models, combined with convolutional neural networks representing the image.

21.- These image captioning models could generate novel descriptions of images, though still made some errors due to lack of context.

22.- The Visual Genome dataset was introduced, with richer annotations including entities, attributes, relationships, mapped to knowledge bases - interconnecting images.

23.- Using Visual Genome, a dense captioning model was developed to provide detailed, contextual descriptions of many regions within an image.

24.- The dense captioning model also enabled detection of small objects by providing context, solving a limitation of earlier object detection methods.

25.- Deep understanding of images requires going beyond labeling objects to also considering relationships, knowledge, and structure.

26.- The journey of computer vision has progressed from early modeling to big data, machine learning, rich descriptions and question-answering.

27.- However, vision is still an unsolved problem - while algorithms can detect objects and describe scenes, they lack human-level story understanding.

28.- Today's algorithms cannot match the depth of human visual understanding which effortlessly grasps stories, emotions, humor, intentions from images.

29.- While computer vision has made remarkable strides, there remains much work to be done to approach human-like visual intelligence.

30.- This progress is thanks to the work of many students and collaborators contributing to advance the state-of-the-art in visual understanding.

Knowledge Vault built byDavid Vivancos 2024