The End Of Knowledge - Vault 5/40 - CVPR - 2018 - DensePose: Dense Human Pose Estimation in the Wild

graph LR classDef main fill:#f9d4d4, font-weight:bold, font-size:14px classDef pose fill:#d4f9d4, font-weight:bold, font-size:14px classDef dataset fill:#d4d4f9, font-weight:bold, font-size:14px classDef architecture fill:#f9f9d4, font-weight:bold, font-size:14px classDef applications fill:#f9d4f9, font-weight:bold, font-size:14px A[DensePose: Dense Human
Pose Estimation in
the Wild] --> B[DensePose: human pose estimation,
mapping pixels to 3D model. 1] B --> C[Beyond keypoints, provides dense
pixel-mesh correspondences. 2] B --> D[Body surface partitioned into
patches with local coordinates. 3] A --> E[Constructed large-scale dataset with
manual image-surface annotations. 4] E --> F[Efficient two-stage annotation: segmentation,
point mapping to 3D. 5] E --> G[Evaluated annotation accuracy using
synthetic data, visual cues. 6] A --> H[Discriminative training learns dense
pose from annotated dataset. 7] A --> I[DensePose R-CNN: real-time estimation,
multi-frame per second. 8] I --> J[Predicts part classification, U/V
coordinate regression within parts. 9] I --> K[Evaluated using geodesic distance
metrics for correspondence accuracy. 10] I --> L[Outperforms model-fitting like SMPLify,
much faster. 11] I --> M[Real annotated images superior
to fitted/synthetic data. 12] I --> N[Multi-task learning, cross-task connections
boost performance substantially. 13] I --> O[Robust to scale, occlusion,
appearance, smooth over video. 14] I --> P[Handles multiple people simultaneously,
runs real-time on GPU. 15] A --> Q[Applications: transferring textures densely
from 3D model to images. 16] A --> R[Code, dataset publicly available
for dense pose research. 17] A --> S[Challenges announced for ECCV 2018. 18] B --> T[Focuses on template correspondences,
not specific 3D pose/shape. 19] I --> U[Keypoint detection auxiliary task
boosts dense pose performance. 20] I --> V[Cross-talk between network heads,
keypoints to dense pose, helps. 21] I --> W[Hands, face, feet most accurate.
Torso less distinctive, higher errors. 22] B --> X[Trained to correspond pixels
to body, even when obscured. 23] I --> Y[Introduced Geodesic Point Similarity
GPS, extends OKS to dense. 24] I --> Z[Larger backbone ResNet-101 vs 50
diminishing accuracy-speed trade-off. 25] B --> AA[Two-step correspondence: part labels,
then U-V within parts. 26] I --> AB[End-to-end training with dense
supervision, no test-time fitting. 27] I --> AC[Single system: bounding box, keypoint,
masking, dense pose estimation. 28] I --> AD[Visualizes part segmentations, U-V fields
to assess performance, failure modes. 29] B --> AE[Opens new possibilities for
detailed human understanding. 30] class A main class B,C,D,T,X,AA,AE pose class E,F,G,R dataset class H,I,J,K,L,M,N,O,P,U,V,W,Y,Z,AB,AC,AD architecture class Q,S applications

Resume:

1.- DensePose estimates dense human pose by mapping image pixels to a 3D surface model of the body.

2.- It extends beyond keypoint-based pose estimation to provide correspondences between all human pixels and thousands of mesh points.

3.- The human body surface is partitioned into patches, each associated with local UV coordinates.

4.- A large-scale dataset was constructed with manual annotations of image-to-surface correspondences on 50,000 COCO images.

5.- An efficient two-stage annotation pipeline was used, first segmenting parts then mapping sampled points to the 3D surface.

6.- Accuracy of annotations was evaluated using synthetic data, finding prominent visual cues enable precise labeling.

7.- Discriminative training is used to learn dense pose estimation from the large annotated dataset.

8.- The DensePose R-CNN architecture performs real-time dense pose estimation, processing video at multiple frames per second.

9.- Three outputs are predicted - part classification for each pixel, and U/V coordinate regression within parts.

10.- Evaluation is done using geodesic distance metrics measuring correspondence accuracy between image points and the surface.

11.- DensePose shows large improvements over model-fitting approaches like SMPLify while being much faster.

12.- Training on real annotated images (DensePose-COCO) gives superior results compared to training on fitted or synthetic data.

13.- Architectural choices were analyzed, finding multi-task learning and cross-task connections boost performance substantially.

14.- Qualitative results demonstrate robustness to scale, occlusion, appearance variation, and smooth predictions over video sequences.

15.- The system handles multiple people simultaneously and runs in real-time on a single GPU.

16.- Potential applications are shown, like transferring textures densely from the 3D model to images.

17.- Code and dataset are made publicly available to encourage further research on the dense pose estimation problem.

18.- DensePose-COCO and DensePose-PoseTrack challenges are announced for ECCV 20

19.- The approach focuses on correspondences to a template shape, not estimating a specific 3D pose and shape for each image.

20.- Keypoint detection as an auxiliary task provides the largest boost to dense pose estimation performance.

21.- Cross-talk between different network heads, especially from keypoints to dense pose, helps the model significantly.

22.- Hands, face and feet have the most accurate correspondences while less visual distinctive areas like the torso have higher errors.

23.- The system is trained to correspond pixels to the underlying body even when obscured by clothes and accessories.

24.- A per-instance evaluation measure called Geodesic Point Similarity (GPS) is introduced, extending OKS from keypoints to dense correspondence.

25.- Using a larger backbone network (ResNet-101 vs 50) gives diminishing returns in accuracy-speed trade-off.

26.- Image-to-surface correspondence is established in two steps: assigning part labels, then regressing U-V coordinates within parts.

27.- The model is trained end-to-end using dense correspondence as supervision, without any model fitting at test time.

28.- A single system can perform multiple tasks including bounding box/keypoint detection, masking and dense pose estimation.

29.- Part segmentations and U-V fields predicted by the system are visualized to qualitatively assess performance and failure modes.

30.- Dense pose estimation opens up new possibilities for detailed human understanding beyond sparse keypoints.

Knowledge Vault built byDavid Vivancos 2024