Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Kristen Grauman is a professor at UT Austin researching computer vision and machine learning, focusing on visual recognition and search.
2.-Visual recognition has made exciting progress in recent years, as exemplified by performance on the ImageNet benchmark.
3.-Most visual recognition systems today learn through supervised classification on disembodied web photos, which has limitations.
4.-In contrast, real-world visual data is captured from an agent's first-person perspective with uncontrolled motions, irrelevant clutter, and multimodal sensory input.
5.-The goal is to move towards embodied visual learning that considers learning in the context of an agent's own behavior and observations.
6.-A famous study with kittens on a carousel demonstrated the importance of learning representations in the context of an agent's ego-motion.
7.-Their approach learns an equivariant visual embedding from unlabeled video that is predictive of how the scene will change with ego-motion.
8.-The learned representation captures semantics, context, geometry and relative depth in order to enable prediction of new viewpoints from a single view.
9.-Using the equivariant representation as an unsupervised pretraining step improves recognition accuracy by 30% while reducing the need for labeled data.
10.-Next they consider how an agent can learn visual representations by actively moving to inspect an object from different views.
11.-A self-supervised task of predicting a complete set of viewpoints from a single view encourages learning of 3D shape semantics.
12.-The category-agnostic shape representation, called "shape codes", improves recognition accuracy on ModelNet and ShapeNet compared to other unsupervised approaches.
13.-In a related line of work, they recover 3D human body pose from egocentric video by leveraging the correlation with scene motion.
14.-Visual recognition is traditionally silent, but in the real world, visual observations are coupled with informative multisensory signals like audio.
15.-Their goal is to learn object-specific sound models from unlabeled video where multiple objects are making sounds simultaneously.
16.-They use a deep multi-instance multi-label learning framework to disentangle which visual objects make which sounds based on audio NMF bases.
17.-At test time, they detect objects present in a new video and use the learned audio bases to guide separation of the sound sources.
18.-The approach successfully learns to separate sounds of musical instruments and objects in unlabeled video, outperforming traditional audio source separation.
19.-Remaining challenges include determining when visually detected objects are actually making sound in the video.
20.-Next they discuss learning policies for how agents should move to quickly recognize objects and scenes.
21.-In active recognition, the goal is to learn intelligent action selection, evidence fusion over a sequence of views, and perception.
22.-They propose an end-to-end approach to simultaneously learn the three components by sharing representations, outperforming several recent baselines.
23.-A recurrent neural network fuses evidence over the views and updates the category beliefs to recognize the object in few glimpses.
24.-They demonstrate results for active recognition in three scenarios: an agent looking around a scene, manipulating an object, or moving around an object.
25.-However, active recognition assumes a predefined, closed-world task, so next they consider learning generic exploratory policies for new environments.
26.-The idea is to learn policies that actively select a small set of observations allowing reconstruction of the rest of the environment.
27.-This "observation completion" objective encourages efficient, non-myopic exploratory behaviors to quickly reduce uncertainty in new scenes.
28.-Results show the learned look-around policies can reconstruct novel 360 scenes and new objects in very few glimpses.
29.-Preliminary experiments show these task-independent exploratory policies can be transferred to active recognition, performing competitively with closed-world policies.
30.-To summarize, embodied visual learning that exploits unlabeled video and active perception and interaction leads to more robust, general, and efficient recognition.
Knowledge Vault built byDavid Vivancos 2024