Knowledge Vault 5 /47 - CVPR 2019
Video Action Transformer Network
Rohit Girdhar; João Carreira; Carl Doersch; Andrew Zisserman
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef main fill:#f9d4d4,stroke:#333,stroke-width:2px,font-weight:bold,font-size:14px classDef video fill:#d4f9d4,stroke:#333,stroke-width:2px,font-weight:bold,font-size:14px classDef actor fill:#d4d4f9,stroke:#333,stroke-width:2px,font-weight:bold,font-size:14px classDef transformer fill:#f9f9d4,stroke:#333,stroke-width:2px,font-weight:bold,font-size:14px classDef results fill:#f9d4f9,stroke:#333,stroke-width:2px,font-weight:bold,font-size:14px A[Video Action Transformer
Network] --> B[Localizes actors, recognizes
actions, video clips. 1] A --> C[Spatio-temporal action detection,
experimented AY dataset. 2] A --> D[Extracts 3D convolution,
center frame features. 3] B --> E[Recognizing actions requires
person, scene context. 4] B --> F[Initial actor representation
extracts video regions. 6] E --> G[Self-attention transformer encodes
actor representation context. 5] G --> H[Video projected key-value,
actor dot-product attention. 7] H --> I[Values summed, added
to actor features. 8] F --> J[Action transformer: initial
actor, video context. 9] J --> K[Action transformer after
actor, video features. 10] K --> L[Multiple action transformer
layers, arbitrary organization. 11] L --> M[Classification regression loss,
action transformer head. 12] M --> N[i3D, action transformer
together best results. 13] N --> O[State-of-the-art performance
at publication time. 14] G --> P[Key-value embeddings visualized
PCA, color-coding. 15] G --> Q[Implicitly learns track
people, semantic, instance. 16] Q --> R[Action transformer heads
track semantic, instance. 17] E --> S[Attention maps: faces,
hands, objects, scene. 18] S --> T[Performs well common
action classes. 19] O --> U[Results: semantic, instance
embeddings, attention. 20] class A main class B,C,D,F video class E,G,H,I,J,K,L,P,Q,R,S actor class M,N transformer class O,T,U results

Resume:

1.- Video Action Transformer Network aims to localize actors and recognize their actions in video clips.

2.- Spatio-temporal action detection is the technical term for this task, experimented on the AY dataset.

3.- Standard solution involves extracting 3D convolution features, center frame features, and using region proposal network for actor locations.

4.- Recognizing actions often requires looking beyond just the person, focusing on other people and objects in the scene.

5.- Self-attention-based solution using transformer architecture is proposed to encode context for actor representation.

6.- Initial actor representation is used to extract relevant regions from the full video representation.

7.- Video representation is projected into key and value embeddings, and actor representation is used for dot product attention.

8.- Weighted sum of values is added back to original actor features, creating an updated actor representation.

9.- Action transformer block takes initial actor representation, encodes video context, and outputs updated actor representation.

10.- Action transformer blocks are plugged in after initial actor representation, along with video features.

11.- Multiple layers of action transformer blocks can be organized arbitrarily, e.g., two cross three configuration.

12.- Final feature is trained for classification regression loss, similar to FasterR-CNN, using an action transformer head.

13.- Replacing i3D head with action transformer gave a 4% performance improvement; using both together yielded the best results.

14.- The model achieved state-of-the-art performance at the time of publication.

15.- Key and value embeddings in action transformer blocks can be visualized using PCA and color-coding.

16.- The model implicitly learns to track people in the video, both at a semantic and instance level.

17.- One action transformer head tracks people semantically by projecting them to the same embedding, while another tracks at an instance level.

18.- Attention maps show the model focusing on other people's faces, hands, and objects in the scene.

19.- The model performs well for most common action classes.

20.- Additional results demonstrate semantic and instance level embeddings, and attention focusing on relevant people and objects.

Knowledge Vault built byDavid Vivancos 2024