The End Of Knowledge - Vault 2 - ICLR (2014-2023)

graph LR classDef grauman fill:#f9d4d4, font-weight:bold, font-size:14px; classDef recognition fill:#d4f9d4, font-weight:bold, font-size:14px; classDef learning fill:#d4d4f9, font-weight:bold, font-size:14px; classDef embodied fill:#f9f9d4, font-weight:bold, font-size:14px; classDef audio fill:#f9d4f9, font-weight:bold, font-size:14px; A[Kristen Grauman
ICLR 2018] --> B[Grauman: UT Austin vision,
learning professor 1] A --> C[Exciting visual recognition
progress, ImageNet 2] C --> D[Supervised web photos
recognition limitations 3] C --> E[Embodied visual data:
uncontrolled, cluttered, multimodal 4] A --> F[Embodied visual learning goal 5] F --> G[Kitten ego-motion
representation study 6] F --> H[Equivariant embedding predicts
ego-motion changes 7] H --> I[Captures semantics, geometry,
depth, viewpoints 8] H --> J[Unsupervised pretraining: 30%
accuracy boost 9] F --> K[Learn representations actively
inspecting object views 10] K --> L["Shape codes" boost
ModelNet, ShapeNet accuracy 12] F --> M[Ego-motion, scene motion
recover 3D pose 13] A --> N[Real-world vision coupled
with informative audio 14] N --> O[Learn object sounds
from unlabeled video 15] O --> P[Multi-instance multi-label separates
simultaneous sounds 16] P --> Q[Learned audio guides
test video separation 17] P --> R[Beats traditional audio
source separation 18] A --> S[Learn agent movement
policies for recognition 20] S --> T[Active recognition: action,
evidence, perception 21] T --> U[End-to-end approach learns
components, beats baselines 22] T --> V[RNN evidence fusion
for few-glimpse recognition 23] S --> W[Generic exploration policies
for new environments 25] W --> X["Observation completion" encourages
efficient exploration 27] X --> Y[Policies reconstruct novel
scenes, objects quickly 28] A --> Z[Embodied learning, active
perception enables recognition 30] class A,B grauman; class C,D recognition; class E,F,G,H,I,J,K,L,M,S,T,U,V,W,X,Y,Z embodied; class N,O,P,Q,R audio;

Resume:

1.-Kristen Grauman is a professor at UT Austin researching computer vision and machine learning, focusing on visual recognition and search.

2.-Visual recognition has made exciting progress in recent years, as exemplified by performance on the ImageNet benchmark.

3.-Most visual recognition systems today learn through supervised classification on disembodied web photos, which has limitations.

4.-In contrast, real-world visual data is captured from an agent's first-person perspective with uncontrolled motions, irrelevant clutter, and multimodal sensory input.

5.-The goal is to move towards embodied visual learning that considers learning in the context of an agent's own behavior and observations.

6.-A famous study with kittens on a carousel demonstrated the importance of learning representations in the context of an agent's ego-motion.

7.-Their approach learns an equivariant visual embedding from unlabeled video that is predictive of how the scene will change with ego-motion.

8.-The learned representation captures semantics, context, geometry and relative depth in order to enable prediction of new viewpoints from a single view.

9.-Using the equivariant representation as an unsupervised pretraining step improves recognition accuracy by 30% while reducing the need for labeled data.

10.-Next they consider how an agent can learn visual representations by actively moving to inspect an object from different views.

11.-A self-supervised task of predicting a complete set of viewpoints from a single view encourages learning of 3D shape semantics.

12.-The category-agnostic shape representation, called "shape codes", improves recognition accuracy on ModelNet and ShapeNet compared to other unsupervised approaches.

13.-In a related line of work, they recover 3D human body pose from egocentric video by leveraging the correlation with scene motion.

14.-Visual recognition is traditionally silent, but in the real world, visual observations are coupled with informative multisensory signals like audio.

15.-Their goal is to learn object-specific sound models from unlabeled video where multiple objects are making sounds simultaneously.

16.-They use a deep multi-instance multi-label learning framework to disentangle which visual objects make which sounds based on audio NMF bases.

17.-At test time, they detect objects present in a new video and use the learned audio bases to guide separation of the sound sources.

18.-The approach successfully learns to separate sounds of musical instruments and objects in unlabeled video, outperforming traditional audio source separation.

19.-Remaining challenges include determining when visually detected objects are actually making sound in the video.

20.-Next they discuss learning policies for how agents should move to quickly recognize objects and scenes.

21.-In active recognition, the goal is to learn intelligent action selection, evidence fusion over a sequence of views, and perception.

22.-They propose an end-to-end approach to simultaneously learn the three components by sharing representations, outperforming several recent baselines.

23.-A recurrent neural network fuses evidence over the views and updates the category beliefs to recognize the object in few glimpses.

24.-They demonstrate results for active recognition in three scenarios: an agent looking around a scene, manipulating an object, or moving around an object.

25.-However, active recognition assumes a predefined, closed-world task, so next they consider learning generic exploratory policies for new environments.

26.-The idea is to learn policies that actively select a small set of observations allowing reconstruction of the rest of the environment.

27.-This "observation completion" objective encourages efficient, non-myopic exploratory behaviors to quickly reduce uncertainty in new scenes.

28.-Results show the learned look-around policies can reconstruct novel 360 scenes and new objects in very few glimpses.

29.-Preliminary experiments show these task-independent exploratory policies can be transferred to active recognition, performing competitively with closed-world policies.

30.-To summarize, embodied visual learning that exploits unlabeled video and active perception and interaction leads to more robust, general, and efficient recognition.

Knowledge Vault built byDavid Vivancos 2024