Knowledge Vault 5 /74 - CVPR 2022
Embodied Computer Vision
Martial Hebert, Kristen Grauman, Nicholas Roy,Michael Ryoo
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef embodied fill:#f9d4d4, font-weight:bold, font-size:14px classDef representation fill:#d4f9d4, font-weight:bold, font-size:14px classDef simulation fill:#d4d4f9, font-weight:bold, font-size:14px classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px classDef future fill:#f9d4f9, font-weight:bold, font-size:14px A[Embodied Computer Vision] --> B[Embodied vision:
perception, action combined 1] B --> C[Embodied intelligence:
energy, info exchange 2] B --> D[CV uncertainty crucial
for embodiment 3] B --> E[Agents actions change
encountered data 4] A --> F[Visual representations for
robot learning 5] F --> G[Self-supervised RL losses
need framework 6] A --> H[Sim-to-real cheaper,
challenging transfer 7] A --> I[Understand intention, affordance,
anticipate actions 8] A --> J[Ego4D informs embodied
agent behavior 9] A --> K[Spatial audio for
3D interaction 10] A --> L[Abstraction vs reasoning
about dynamics 11] A --> M[Photorealistic sim lacks
physics, camera 12] M --> N[Embodiment changes data
distribution continuously 13] M --> O[World models simulate,
high potential 14] M --> P[Simulation benchmarks,
sim-to-real challenges 15] M --> Q[Simulation scales experience,
lacks realism 16] M --> R[Simulated environments lack
dynamic humans 17] A --> S[Assistive robotics: uncertain,
overhyped timeline 18] A --> T[Self-driving: focused domains
before ubiquity 19] A --> U[HRI critical, neglected
in research 20] A --> V[Performance modeling lacking
for deployment 21] A --> W[Embodiment drives task-specific
representation learning 22] W --> X[Hierarchical representations for
reasoning, generalization 23] W --> Y[Language priors for
embodied AI 24] W --> Z[Composable models to
handle complexity 25] A --> AA[Robust, generalizable to
novel situations 26] A --> AB[Integrate vision, audio,
touch, etc. 27] A --> AC[Lifelong learning from
new experiences 28] A --> AD[Human-centric sim could
accelerate development 29] A --> AE[Rethink CV for
real-world integration 30] class B,C,D,E embodied class F,G,W,X,Y,Z representation class H,M,N,O,P,Q,R,AD simulation class AC,AE future

Resume:

1.- Embodied vision: Vision systems for agents that act purposefully in their environment, not just static systems. Combines perception and action.

2.- Embodied intelligence: Purposeful exchange of energy and information with the environment. Requires thinking about consequences of movement and uncertainty.

3.- Uncertainty in computer vision: Modern CV techniques like deep learning often don't handle uncertainty well, which is crucial for embodied systems.

4.- Interaction with environment: In embodied vision, the agent's actions change the environment, objects, and data distribution it encounters.

5.- Robot learning for action policies: Much robotics research neglects advances in computer vision. More work needed on visual representations for RL.

6.- Self-supervised learning for RL: Explored using self-supervised computer vision losses for RL, but encountered difficulties. A better framework is needed.

7.- Sim-to-real transfer: Interaction with the environment is expensive in the real world. Simulation is cheaper but sim-to-real transfer is challenging.

8.- Video understanding for human activity: Going beyond recognition to understand human intentions, affordances, and anticipate actions.

9.- Egocentric video datasets: Large egocentric video datasets like Ego4D enable learning from human experience to inform embodied agent behavior.

10.- Embodied audiovisual learning: Important for embodied agents to learn spatial audio to understand 3D environments and interact.

11.- Abstracting away low-level control: Some argue abstracted actions are okay if the application allows. Others believe reasoning about dynamics/forces is essential.

12.- Photorealistic simulation: Has improved but still lacks realism in areas like physics and camera operation. Not a silver bullet.

13.- Data distribution shifts: In embodied vision, the visual data distribution continuously changes based on the agent's actions, unlike static datasets.

14.- World models and dreaming: Learning dynamics models from data to simulate environments. Extremely difficult, but high potential if possible.

15.- Benchmarking and reproducibility: Simulation enables benchmarking embodied vision systems, but reproducibility and sim-to-real remain significant open challenges.

16.- Value of simulation: Blessing for scaling up experience and evaluation. Curse in still lacking complete realism. Important research tool.

17.- Lack of humans in simulators: Current embodied AI simulators lack humans. Dynamic modeling of human behavior in simulated environments is a key opportunity.

18.- Assistive robotics timeline: Steady rollout of robotic systems helping in daily life expected, but timeline is uncertain and overhyped.

19.- Self-driving progress: Significant strides like initial commercial operations, but ubiquitous autonomy still far off. Will emerge in focused domains first.

20.- Human-robot interaction: Often neglected in robotics research in favor of navigation and manipulation. Proper HRI critical for real-world deployment.

21.- Performance modeling and guarantees: Formal methods to model robotic system performance are critical for real-world deployment, but currently lacking.

22.- Task-driven representations: Embodied vision provides a concrete task to drive representation learning, not just accuracy for its own sake.

23.- Hierarchical representations: Potential need for hierarchical, symbolic, abstract representations to enable efficient reasoning and strong generalization.

24.- Language as a prior: Language models may provide useful priors or knowledge for embodied AI, but significant work remains to leverage them.

25.- Composable models: Future embodied AI likely requires composable models to handle complexity, similar to early AI paradigms, not just end-to-end neural nets.

26.- Robustness and generalization: Embodied AI systems need to be robust and generalize well to real-world deployment with novel situations.

27.- Integrating multiple modalities: Embodied perception should leverage multiple sensor modalities (vision, audio, touch, etc.) to better understand and act.

28.- Lifelong learning: Embodied agents have the opportunity to keep learning and adapting over their lifespan as they encounter new experiences.

29.- Simulation for human environments: Photorealistic simulation of human-centric spaces and activities could accelerate development of assistive embodied AI if done well.

30.- Rethinking problem formulations: As embodied AI advances, many existing computer vision problem setups and assumptions may need fundamental rethinking to integrate with real-world systems.

Knowledge Vault built byDavid Vivancos 2024