Knowledge Vault 5 /44 - CVPR 2019
Learning the Depths of Moving People by Watching Frozen People
Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Ce Liu, Bill Freeman and Noah Snavely
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef depth fill:#f9d4d4, font-weight:bold, font-size:14px classDef stereo fill:#d4f9d4, font-weight:bold, font-size:14px classDef dataset fill:#d4d4f9, font-weight:bold, font-size:14px classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px classDef applications fill:#f9d4f9, font-weight:bold, font-size:14px A[Learning the Depths
of Moving People
by Watching Frozen
People] --> B[Learning depth of
moving people 1] A --> C[Classical stereo unsuitable
for moving objects 2] A --> D[Data-driven approach using
Mannequin Challenge dataset 3] D --> E[Dataset spans scenes,
poses, people 4] D --> F[Structure-from-motion, multi-view
stereo recover poses, depths 5] F --> G[Multi-view stereo depths
train neural network 6] A --> H[Single-image prediction ignores
neighboring frames 7] A --> I[Flow between frames
converted to depths 8] I --> J[Inaccurate moving people
depths masked out 9] A --> K[Model inputs: RGB,
mask, parallax depths, confidence 10] K --> L[Network inpaints masked
depth, refines scene 11] K --> M[Model applied to
moving people videos 12] M --> N[Outperforms baselines on
TUM RGBD dataset 13] M --> O[Qualitative comparison shows
models predictions most similar 14] M --> P[Accurate, coherent predictions
on internet videos 15] P --> Q[Enables defocus, focus
pause effects 16] P --> R[Synthetic objects inserted,
occluded using depth 17] P --> S[Novel view synthesis
using near-field frames 18] P --> T[Human regions inpainted
when camera, people move 19] A --> U[Code, dataset released
on project website 20] class B,H,I,J depth class C,F,G stereo class D,E dataset class K,L,M,N,O learning class P,Q,R,S,T applications

Resume:

1.- Learning depth of moving people using frozen people dataset (Mannequin Challenge).

2.- Classical stereo algorithms assume rigid scenes, unsuitable for moving objects.

3.- Data-driven approach using Mannequin Challenge dataset with stationary people.

4.- Dataset spans various scenes, poses, and number of people.

5.- Structure-from-motion and multi-view stereo recover camera poses and depths.

6.- Multi-view stereo depth maps used as ground-truth for training neural network.

7.- Single-image depth prediction ignores 3D information in neighboring frames.

8.- Optical flow between reference and neighbor frames converted to depths using camera poses.

9.- Inaccurate depths from moving people masked out using segmentation.

10.- Full model inputs: RGB frame, segmentation mask, depths from motion parallax, confidence map.

11.- Network learns to inpaint masked human depth and refine entire scene depth.

12.- Model applied to moving people videos during inference.

13.- Outperforms baseline RGB-only, motion stereo, and single-view methods on TUM RGBD dataset.

14.- Qualitative comparison shows model's depth predictions most similar to ground truth.

15.- Accurate and coherent depth predictions on regular internet video clips.

16.- Depth predictions enable visual effects like synthetic defocus and focus pause.

17.- Synthetic objects inserted and occluded using depth predictions.

18.- Novel view synthesis using near-field and nearby frames to fill occlusions.

19.- Human regions effectively inpainted using depth predictions when camera and people move freely.

20.- Code and dataset released on the project website.

Knowledge Vault built byDavid Vivancos 2024