Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Ce Liu, Bill Freeman and Noah Snavely
Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:
graph LR
classDef depth fill:#f9d4d4, font-weight:bold, font-size:14px
classDef stereo fill:#d4f9d4, font-weight:bold, font-size:14px
classDef dataset fill:#d4d4f9, font-weight:bold, font-size:14px
classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px
classDef applications fill:#f9d4f9, font-weight:bold, font-size:14px
A[Learning the Depths
of Moving People
by Watching Frozen
People] --> B[Learning depth of
moving people 1]
A --> C[Classical stereo unsuitable
for moving objects 2]
A --> D[Data-driven approach using
Mannequin Challenge dataset 3]
D --> E[Dataset spans scenes,
poses, people 4]
D --> F[Structure-from-motion, multi-view
stereo recover poses, depths 5]
F --> G[Multi-view stereo depths
train neural network 6]
A --> H[Single-image prediction ignores
neighboring frames 7]
A --> I[Flow between frames
converted to depths 8]
I --> J[Inaccurate moving people
depths masked out 9]
A --> K[Model inputs: RGB,
mask, parallax depths, confidence 10]
K --> L[Network inpaints masked
depth, refines scene 11]
K --> M[Model applied to
moving people videos 12]
M --> N[Outperforms baselines on
TUM RGBD dataset 13]
M --> O[Qualitative comparison shows
models predictions most similar 14]
M --> P[Accurate, coherent predictions
on internet videos 15]
P --> Q[Enables defocus, focus
pause effects 16]
P --> R[Synthetic objects inserted,
occluded using depth 17]
P --> S[Novel view synthesis
using near-field frames 18]
P --> T[Human regions inpainted
when camera, people move 19]
A --> U[Code, dataset released
on project website 20]
class B,H,I,J depth
class C,F,G stereo
class D,E dataset
class K,L,M,N,O learning
class P,Q,R,S,T applications
Resume:
1.- Learning depth of moving people using frozen people dataset (Mannequin Challenge).
2.- Classical stereo algorithms assume rigid scenes, unsuitable for moving objects.
3.- Data-driven approach using Mannequin Challenge dataset with stationary people.
4.- Dataset spans various scenes, poses, and number of people.
5.- Structure-from-motion and multi-view stereo recover camera poses and depths.
6.- Multi-view stereo depth maps used as ground-truth for training neural network.
7.- Single-image depth prediction ignores 3D information in neighboring frames.
8.- Optical flow between reference and neighbor frames converted to depths using camera poses.
9.- Inaccurate depths from moving people masked out using segmentation.
10.- Full model inputs: RGB frame, segmentation mask, depths from motion parallax, confidence map.
11.- Network learns to inpaint masked human depth and refine entire scene depth.
12.- Model applied to moving people videos during inference.
13.- Outperforms baseline RGB-only, motion stereo, and single-view methods on TUM RGBD dataset.
14.- Qualitative comparison shows model's depth predictions most similar to ground truth.
15.- Accurate and coherent depth predictions on regular internet video clips.
16.- Depth predictions enable visual effects like synthetic defocus and focus pause.
17.- Synthetic objects inserted and occluded using depth predictions.
18.- Novel view synthesis using near-field and nearby frames to fill occlusions.
19.- Human regions effectively inpainted using depth predictions when camera and people move freely.
20.- Code and dataset released on the project website.
Knowledge Vault built byDavid Vivancos 2024