Knowledge Vault 5 /48 - CVPR 2019
Learning Video Representations From Correspondence Proposals
Xingyu Liu; Joon-Young Lee; Hailin Jin
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef video fill:#f9d4d4, font-weight:bold, font-size:14px classDef correspondence fill:#d4f9d4, font-weight:bold, font-size:14px classDef cp fill:#d4d4f9, font-weight:bold, font-size:14px classDef performance fill:#f9f9d4, font-weight:bold, font-size:14px A[Learning Video Representations
From Correspondence Proposals] --> B[Video: 2D field, objects,
correspondences. 1] A --> C[Correspondence: similar features,
arbitrary ranges. 2] C --> D[Potential correspondence:
sparsity, irregularity. 3] A --> E[Novel neural net proposed. 4] E --> F[Video tensor: point cloud. 5] F --> G[k-NN considered
potential correspondences. 6] E --> H[CP computes k-NN indices. 7] H --> I[Correspondence embedding:
concatenation, processing. 8] I --> J[Output encodes
dynamic information. 9] E --> K[CP integrated into ResNet. 10] E --> L[Ablation: CP modules, k. 11] E --> M[Better performance,
fewer parameters. 12] M --> N[SOTA on motion-centric datasets. 13] C --> O[CP proposes
reasonable correspondences. 14] O --> P[CP filters, keeps correct. 15] O --> Q[CP changes moving areas. 16] A --> R[Code open-sourced. 17] class A,B video class C,D,O,P,Q correspondence class E,F,G,H,I,J,K,L cp class M,N performance

Resume:

1.- Video is a 2D field changing over time with objects having correspondences across frames.

2.- Corresponding positions have similar visual/semantic features and can span arbitrary spatial and temporal ranges.

3.- Given a position, only a small portion of positions in other frames could potentially be the correspondence (sparsity and irregularity).

4.- A novel neural net architecture is proposed to address correspondence properties in videos.

5.- The representation tensor of a video is treated as a point cloud in semantic feature space.

6.- For each point, k nearest neighbors from other frames are found and considered as potential correspondences (CP module).

7.- CP module takes video representation tensor as input and computes pairwise feature distance matrix to get k nearest neighbor indices.

8.- Correspondence embedding layer concatenates semantic feature vectors and relative spatial-temporal location, processes them independently, and applies max pooling.

9.- The output tensor encodes the dynamic information of the video after max pooling selects the most interesting information.

10.- CP module is integrated into C2D ResNet architecture.

11.- Ablation studies were conducted on the number and position of CP modules and the value of k.

12.- The proposed method achieves better performance with fewer parameters compared to previous works on kinetics dataset.

13.- State-of-the-art results are achieved among published works on motion-centric datasets (Something-Something and Gesture) with fewer parameters.

14.- Visualization shows that CP module proposes reasonable correspondences like basketball, metal can, and thumb.

15.- CP module filters out wrong correspondence points and keeps correct ones during max pooling.

16.- CP module makes more changes to moving areas in the feature map.

17.- The code for the proposed method is open-sourced.

Knowledge Vault built byDavid Vivancos 2024