The End Of Knowledge - Vault 5/55 - CVPR - 2020 - Weakly-supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects

graph LR classDef pose fill:#f9d4d4, font-weight:bold, font-size:14px classDef data fill:#d4f9d4, font-weight:bold, font-size:14px classDef method fill:#d4d4f9, font-weight:bold, font-size:14px classDef results fill:#f9f9d4, font-weight:bold, font-size:14px A[Weakly-supervised Domain Adaptation
via GAN and
Mesh Model for
Estimating 3D Hand
Poses Interacting Objects] --> B[Hand pose prediction successful hand-only,
not hand-object. 1] A --> C[Goal: hand pose prediction
with object interactions. 2] A --> D[Previous work limited
by dataset issues. 3] A --> E[Comparing real datasets:
limited quantity, annotations. 4] A --> F[Synthetic datasets differ
from real appearance. 5] F --> G[Synthetic data causes
real prediction failure. 6] A --> H[Marker-based mocap sets
have gaps. 7] A --> I[Small real datasets
require manual effort. 8] A --> J[Video game datasets
lack 3D annotations. 9] A --> K[Combining real and
synthetic data. 10] K --> L[Bridge limited real,
abundant synthetic data. 11] A --> M[Pipeline: image supervision, propagation,
3D skeleton supervision. 12] M --> N[Cycle-consistency maps to
original image space. 13] M --> O[3D pose supervision via
differentiable rendering. 14] M --> P[Generator, discriminator,
differentiable renderer. 15] P --> Q[Generate synthetic hands-only image. 16] P --> R[Map synthetic to real
using generator network. 17] P --> S[GAN synthesizes mixed image,
preserves hand structure. 18] P --> T[Predict 3D pose from
mixed image. 19] P --> U[Discriminator enforces GAN
objective at image level. 20] A --> V[Leverages existing RGB, synthetic
hand-object datasets. 21] A --> W[Fine-tunes on small
real 3D datasets. 22] A --> X[Optional: utilize 3D annotated
hand-object interaction datasets. 23] A --> Y[Weakly-supervised domain adaptation
bridges synthetic-real gap. 24] A --> Z[Maintains hands-only performance,
generalizes to hand-object. 25] A --> AA[Qualitative results visualize
inputs, predictions. 26] A --> AB[Tests demonstrate methods
effectiveness on datasets. 27] A --> AC[Combines 2D estimation, GANs,
differentiable rendering advances. 28] AC --> AD[Trains pose estimator without
large real datasets. 29] A --> AE[Enables 3D hand pose progress
under object interactions. 30] class A,B,C,Z,AE pose class D,E,F,G,H,I,J,K,L,V,W,X,AA data class M,N,O,P,Q,R,S,T,U,Y,AC,AD method class AB results

Resume:

1.- Hand pose prediction successful in hand-only scenarios, but not yet with hand-object interactions.

2.- Goal is to enable successful hand pose prediction with object interactions.

3.- Previous work had issues populating new datasets due to limitations.

4.- Comparing Dex-YCB and Ego-Dexter datasets - real sequences but limited quantity and 3D annotations.

5.- HO-3D dataset has fuller 3D annotations but synthetic appearance differs from real.

6.- Using synthetic data causes hand pose prediction to fail on real sequences.

7.- Marker-based mocap sets have RGB images if sensors used, but gaps remain.

8.- Small real datasets built via 3D model fitting and manual refinement have better quality but require manual effort.

9.- Large datasets from video games remain problematic due to lack of 3D annotations.

10.- Combining small real datasets with larger synthetic ones through 3D model fitting.

11.- Need to bridge gap between limited real data and abundant synthetic data.

12.- Main idea: Utilize image-level supervision in RGB images, propagate to hands-only images, then to 3D skeleton supervision.

13.- Use cycle-consistency to map back to original image space.

14.- Obtain 3D pose supervision via differentiable rendering.

15.- Pipeline involves generator, discriminator, and differentiable renderer.

16.- Generate synthetic hands-only image X' by rendering 3D hand mesh.

17.- Map synthetic X' to real image X using generator network.

18.- Train GAN to synthesize new mixed image X'' preserving hand structure.

19.- Predict 3D hand mesh and pose from X'' using 2D pose estimator and differentiable renderer.

20.- Train discriminator network to enforce GAN objective at image level.

21.- Leverage existing RGB hand pose datasets and synthetic hand-object images to train full pipeline.

22.- Fine-tune on small datasets with real 3D pose annotations.

23.- Optionally utilize datasets with 3D annotations and hand-object interactions if available.

24.- Weakly-supervised domain adaptation helps bridge gap between synthetic and real data.

25.- Maintain performance on hands-only benchmark while enabling generalization to hand-object scenarios.

26.- Qualitative results visualize inputs, initial mesh prediction, translated full image, and final mesh prediction.

27.- Tests on HO-3D, EgoDexter, Dexter+Object datasets demonstrate method's effectiveness.

28.- Combines advances in 2D pose estimation, GANs, and differentiable rendering.

29.- Allows training pose estimator without requiring large annotated real datasets.

30.- Enables progress on challenging problem of 3D hand pose estimation under object interactions.

Knowledge Vault built byDavid Vivancos 2024