The End Of Knowledge - Vault 5/8 - CVPR - 2015 - SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite

graph LR classDef sunrgbd fill:#f9d4d4, font-weight:bold, font-size:14px classDef understanding fill:#d4f9d4, font-weight:bold, font-size:14px classDef dataset fill:#d4d4f9, font-weight:bold, font-size:14px classDef annotations fill:#f9f9d4, font-weight:bold, font-size:14px classDef baselines fill:#f9d4f9, font-weight:bold, font-size:14px A[SUN RGB-D: A
RGB-D Scene Understanding
Benchmark Suite] --> B[SunRGBD: large RGB-D dataset,
benchmark suite. 1] A --> C[Scene understanding: crucial, challenging
computer vision task. 2] C --> D[Existing datasets: small,
mostly 2D labels. 3] B --> E[SunRGBD size: 10,000+ images,
comparable to PASCAL VOC. 4] B --> F[SunRGBD sensors: RealSense, Xtion,
Kinect v1/v2, differing attributes. 5] B --> G[Dense annotations: 2D/3D labels,
orientation, room layout. 6] G --> H[Data collection: challenging, extensive
worldwide effort. 7] G --> I[Annotation tools: custom 2D/3D
interfaces for dense labeling. 8] B --> J[Object categories: diverse indoor
objects, chairs most common. 9] A --> K[Benchmark tasks: 6 scene
understanding tasks evaluated. 10] K --> L[Scene classification: deep features
beat hand-crafted, RGB-D improves. 11] K --> M[Semantic segmentation: per-pixel category,
nearest neighbor, optical flow baselines. 12] K --> N[2D detection: box and
category, limited for reasoning. 13] K --> O[3D detection: location, dimensions,
orientation key for interactions. 14] K --> P[Layout estimation: infers 3D
room geometry, challenging. 15] P --> Q[Layout baselines: convex hull,
Manhattan box, single-view geometry. 16] P --> R[Layout evaluation: 3D IoU,
not 2D segmentation. 17] K --> S[Holistic understanding: joint objects
and layout prediction. 18] F --> T[Sensor details: RealSense low
quality, Kinect v2 accurate. 19] B --> U[Additional data: distinct 3D
Objects, SUN3D frames added. 20] C --> V[Object orientation: 3D pose
estimation for interaction understanding. 21] B --> W[Object distribution: naturalistic, long-tailed,
chairs, sofas, tables common. 22] K --> X[Detection metrics: precision-recall 2D/3D
boxes, 3D IoU proposed. 23] K --> Y[Free space evaluation: objects
and room space considered. 24] S --> Z[Holistic approaches: four methods
combine detections and layout. 25] B --> AA[Limitations: 2-3 images per
scene, multi-view future work. 26] B --> AB[Funding: Intel gift funds,
data and code released. 27] H --> AC[Data gathering: portable rig
with laptop, sensors, batteries. 28] I --> AD[Labeling: Mechanical Turk initial
3D labels, researcher verified. 29] A --> AE[Impact: fuels RGB-D understanding
advances, aids interior design. 30] class A,B,E,F,J,U,W,AA,AB,AC dataset class C,V understanding class D limitations class G,H,I,AD annotations class K,L,M,N,O,P,Q,R,S,X,Y,Z baselines class T sensor class AE impact

Resume:

1.- SunRGBD: A large RGB-D scene understanding dataset and benchmark suite introduced by Princeton researchers.

2.- Scene understanding: A crucial but challenging computer vision task that benefits from RGB-D sensors.

3.- Existing RGB-D datasets: Too small (e.g. NYU) to train data-hungry algorithms and mostly have only 2D labels.

4.- SunRGBD size: Over 10,000 images, comparable to PASCAL VOC dataset.

5.- SunRGBD sensors: Captured with Intel RealSense, Asus Xtion, Kinect v1 and v2, each with different attributes.

6.- Dense annotations: 2D segmentation, 3D bounding boxes, object orientation, and room layout labeled for each image.

7.- Data collection: Challenging, requiring extensive effort to capture RGB-D images across many locations worldwide.

8.- Annotation tools: Custom 2D and 3D interfaces used to densely label objects, orientations and room geometry.

9.- Object categories: Diverse set of indoor objects, with chair being most common. Can help with furniture selection.

10.- Benchmark tasks: Evaluates 6 scene understanding tasks - classification, segmentation, 2D/3D detection, orientation, layout.

11.- Scene classification baselines: Deep learning features outperform hand-crafted ones. RGB-D improves over just RGB.

12.- Semantic segmentation: Predict per-pixel object category. Nearest neighbor and optical flow baselines evaluated.

13.- 2D object detection: Provides bounding box and category, but inadequate for reasoning about object use.

14.- 3D object detection: Outputs 3D location, dimensions and orientation - key for understanding object interactions.

15.- Room layout estimation: Infers 3D geometry of walls, floor, ceiling. Challenging due to complex real-world room shapes.

16.- Layout baselines: Convex hull, Manhattan box assumptions compared to single-view geometry approach.

17.- Layout evaluation: 3D free space IoU used instead of treating it as 2D segmentation problem.

18.- Holistic scene understanding: Joint prediction of object bounding boxes and room layout.

19.- Sensor details: RealSense has low raw depth quality improved by frame averaging. Kinect v2 is more accurate but with missing depth.

20.- Additional data: Hand-selected distinct frames from Berkeley 3D Objects and SUN3D datasets added and re-annotated.

21.- Object orientation: Estimate 3D object pose, important for understanding how to interact with objects.

22.- Object distribution: Dataset has naturalistic, long-tailed category distribution. Many examples of chairs, sofas, tables etc.

23.- Detection metrics: Standard precision-recall for 2D and 3D bounding boxes. 3D free space IoU also proposed.

24.- Free space evaluation: Considers objects and room together - space inside room but outside objects.

25.- Holistic understanding approaches: Four simple methods to combine 3D object detections and room layout. Details in paper.

26.- Limitations: Each scene represented by 2-3 images without overlap. Exploring multi-view RGB-D is future work.

27.- Funding: Project supported by Intel gift funds. Data and code released to public.

28.- Data gathering interfaces: Laptop on cart, sensors on stabilizers, batteries in backpack formed portable capture rig.

29.- Labeling effort: Amazon Mechanical Turk workers did initial 3D annotations, later verified by the researchers.

30.- Impact: Provides data to fuel advances in RGB-D scene understanding algorithms; can also aid interior design applications.

Knowledge Vault built byDavid Vivancos 2024