Knowledge Vault 2/85 - ICLR 2014-2023
Cordelia Schmid ICLR 2022 - Invited Talk - Do you see what I see? Large-scale learning from multimodal videos
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef video fill:#f9d4d4, font-weight:bold, font-size:14px; classDef tasks fill:#d4f9d4, font-weight:bold, font-size:14px; classDef representation fill:#d4d4f9, font-weight:bold, font-size:14px; classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px; classDef videobird fill:#f9d4f9, font-weight:bold, font-size:14px; classDef vqa fill:#d4f9f9, font-weight:bold, font-size:14px; classDef dataset fill:#f9d4d4, font-weight:bold, font-size:14px; A[Cordelia Schmid
ICLR 2022] --> B[Growing online video data. 1] A --> C[Key video tasks. 2] C --> D[Classify, retrieve, describe videos. 3] A --> E[Multi-modal video
representations needed. 4] E --> F[Audio-visual data for
understanding. 5] A --> G[Large-scale cross-modal
supervision. 6] G --> H[HowTo 1M: learning
without annotation. 7] A --> I[VideoBird: video-speech correspondences. 8] I --> J[BERT-like architecture,
instructional pretraining. 9] I --> K[Zero-shot prediction on new videos. 10] I --> L[Near supervised action
recognition. 11] I --> M[Improves SOTA video
captioning. 12] I --> N[Future: harder tasks,
non-instructional. 13] A --> O[Cross-modal zero-shot video QA. 14] O --> P[QA from text and video. 15] O --> Q[Instructional videos
with transcripts. 16] Q --> R[QA extraction from
transcripts. 17] Q --> S[69M video QA dataset generated. 18] S --> T[30% QA pairs correct,
well-matched. 19] O --> U[Multimodal transformer
for zero-shot QA. 20] S --> V[Enables strong zero-shot
on benchmarks. 21] S --> W[Pretraining boosts
vs. scratch. 22] S --> X[First zero-shot video QA. 23] A --> Y[Leveraging image captioning
datasets. 24] Y --> Z[Video-text datasets
noisy or small. 25] Y --> AA[Transfer image
captions to videos. 26] AA --> AB[Video CC 3M dataset
constructed. 27] AB --> AC[Video CC 3M more
balanced. 28] AB --> AD[91% video-caption pairs relevant. 29] AB --> AE[Improves zero-shot
retrieval vs. HowTo100M. 30] AB --> AF[Audio features
further improve SOTA. 31] AB --> AG[First zero-shot
video captioning. 32] A --> AH[Key takeaways. 33] AH --> AI[Cross-modal learning from
clean, diverse data. 34] A --> AJ[Open questions and
future work. 35] AJ --> AK[Data cleaning, scale
, diversity. 36] AJ --> AL[Refine temporal video-text
alignment. 37] AJ --> AM[Object-level representations
for matching. 38] class A,B,Y,Z video; class C,D tasks; class E,F,G,H representation; class I,J,K,L,M,N videobird; class O,P,Q,R,S,T,U,V,W,X vqa; class AA,AB,AC,AD,AE,AF,AG dataset; class AH,AI,AJ,AK,AL,AM learning;

Resume:

1.-There is a large and growing amount of video data available online from various sources like YouTube and surveillance cameras.

2.-Key tasks include classifying activities in videos, text-to-video retrieval, and describing the story of a video.

3.-Multi-modal video representations that incorporate audio and visual information are needed to precisely understand video content.

4.-Large-scale cross-modal supervision from datasets like HowTo One Million allow learning without manual annotation.

5.-VideoBird is a model that learns correspondences between video and speech from multi-modal data using a BERT-like architecture.

6.-VideoBird is pretrained on a large instructional video dataset, then can be applied for zero-shot prediction on new videos.

7.-VideoBird performance on action recognition is close to fully-supervised models. More pretraining data improves performance.

8.-VideoBird can be fine-tuned for downstream tasks like video captioning, where pretraining helps improve state-of-the-art.

9.-Open questions for VideoBird include extending to more difficult tasks and non-instructional videos.

10.-Cross-modal learning is used for zero-shot video question answering by generating a large QA dataset from text and video.

11.-Start with an instructional video dataset with speech transcripts. Use a trained QA model to extract questions and answers from transcripts.

12.-This process automatically generates a 69M video QA dataset, with 33 10-second clips per video and 1.2 QA pairs per clip.

13.-Around 30% of the automatically generated QA pairs are correct and well-matched to the video, based on manual evaluation.

14.-A multimodal transformer is trained on this dataset for zero-shot video QA, taking video+question as input to predict the answer.

15.-The automatically generated HowTo VQA 69M dataset enables strong zero-shot performance on IVQA and MSVD-QA benchmarks.

16.-Using the HowTo VQA dataset for pretraining boosts performance significantly compared to training from scratch.

17.-This cross-modal pretraining matches state-of-the-art models that use other pretraining sources, and is the first to enable zero-shot QA.

18.-Existing video-text datasets are either semi-automatically collected and noisy, or manually labeled and small-scale.

19.-In contrast, image captioning datasets are cleaner and larger-scale. The idea is to leverage these to automatically annotate videos.

20.-Find visually similar video frames to captioned images, transfer the caption to short video clips around those frames.

21.-This process constructs the Video CC 3M dataset from the Conceptual Captions 3M image dataset. It has 10.3M video-caption pairs.

22.-Video CC 3M is more balanced across domains compared to HowTo100M, which is dominated by cooking/food videos.

23.-Manual evaluation shows 91% of Video CC 3M's video-caption pairs are relevant, with some noise from visual similarity not capturing objects precisely.

24.-Zero-shot video-text retrieval performance is significantly higher when training on Video CC 3M versus HowTo100M, showing the importance of data quality.

25.-Adding audio features to Video CC 3M further improves zero-shot retrieval accuracy, outperforming state-of-the-art.

26.-A model trained on Video CC 3M for zero-shot video captioning generates much more relevant captions than one trained on HowTo100M.

27.-This is the first approach to demonstrate zero-shot video captioning, with promising qualitative results.

28.-The key takeaways are the effectiveness of cross-modal learning from clean, diverse datasets for zero-shot video understanding tasks.

29.-Open questions include further data cleaning, extending data scale and diversity, and refining temporal video-text alignment.

30.-Future work could incorporate object-level representations to improve cross-modal matching beyond global visual similarity.

Knowledge Vault built byDavid Vivancos 2024