The End Of Knowledge - Vault 2 - ICLR (2014-2023) - Mohammad Norouzi et al. ICLR 2014

graph LR classDef main fill:#f9d4d4, font-weight:bold, font-size:14px; classDef approach fill:#d4f9d4, font-weight:bold, font-size:14px; classDef experiments fill:#d4d4f9, font-weight:bold, font-size:14px; classDef results fill:#f9f9d4, font-weight:bold, font-size:14px; classDef limitations fill:#f9d4f9, font-weight:bold, font-size:14px; A[Main] --> B[Zero-shot learning
leverages side info 1] B --> C[Unsupervised embeddings
handle many classes 2] B --> D[Projects images,
uses nearest neighbor 3] D --> E[Challenges: embedding,
projection, search 4] A --> F[ConSE approach 6] F --> G[Skip-gram learns
unsupervised embeddings 5] F --> H[Weights from
probabilistic classifier 7] F --> I[t selects top
label embeddings 8] F --> J[ConSE needs
no extra training 9] F --> K[Alternatives map
images to label 10] A --> L[ImageNet experiments 11] L --> M[ConSE vs
DeViSE linear mapping 12] L --> N[Test labels:
1-2, 3, >3 hops 13] L --> O[Hitak reported,
training labels in/out 14] A --> P[ConSE results 15] P --> Q[t=10 beats
t=1, t=1000, DeViSE 15] P --> R[Farther labels
degrade, training preferred 16] P --> S[Predicts rare
categories well 17] P --> T[Failures have
sensible predictions 18] P --> U[Generalizes better
than DeViSE 19] P --> V[Deterministically embeds,
outperforms regression 20] A --> W[Limitations 22] W --> X[Embeddings limit
visual similarity 22] X --> Y[Visual, textual
similarity may misalign 23] W --> Z[Wikipedia embeddings
convenient despite limitation 24] W --> AA[Classifier confusion
could improve similarity 25] A --> AB[Hierarchical metrics
measure distance 21] class A,B main; class F,G,H,I,J,K approach; class L,M,N,O experiments; class P,Q,R,S,T,U,V,AB results; class W,X,Y,Z,AA limitations;

Resume:

1.-Zero-shot learning aims to classify images into unseen classes by leveraging side information like semantic embeddings of labels.

2.-Unsupervised semantic embeddings of labels allow working with many classes without manually annotating attributes.

3.-The approach projects images into the semantic label space and uses nearest neighbor search to classify test images.

4.-Key challenges are defining the semantic label embedding, projecting images into that space, and performing nearest neighbor search.

5.-Skip-gram model is used to learn semantic label embeddings from word co-occurrences in an unsupervised way.

6.-The ConSE (convex combination of semantic embeddings) model projects images by taking a weighted combination of training label embeddings.

7.-Weights are the conditional probabilities of training labels given the image, obtained from a trained probabilistic classifier.

8.-A parameter t selects averaging only the top t label embeddings to reduce noise from tiny probabilities.

9.-ConSE requires no extra training beyond the initial classifier. Output likely stays on the manifold of label embeddings.

10.-Alternative models learn a regression to map images close to their label embedding and far from incorrect ones.

11.-Experiments done on ImageNet with 1000 training labels and 20,000 zero-shot test labels, using 500-D skip-gram Wikipedia embeddings.

12.-ConSE is compared to DeViSE model which learns a linear mapping of images into embedding space via ranking loss.

13.-Three subsets of test labels used: 1-2 hops, 3 hops, and >3 hops away from training labels in ImageNet hierarchy.

14.-Flat hit@k (% of test images with true label in top k predictions) reported, excluding and including training labels.

15.-ConSE with t=10 outperforms t=1 (can capture label ambiguity) and t=1000 (too much noise). Beats DeViSE by 5-15%.

16.-Performance degrades as test labels get farther in hierarchy from training. All methods prefer predicting training labels when included.

17.-Qualitative results show ConSE predicts relevant labels for images of rare categories like sea lions and hammers.

18.-Even failure cases have sensible predictions like vehicle-related classes for a farm machine image.

19.-ConSE performs worse than DeViSE on training labels, but generalizes better to unseen test labels without overfitting.

20.-In summary, ConSE deterministically embeds images using classifier probabilities and semantic label embeddings, outperforming regression-based approaches.

21.-Hierarchical performance metrics are also reported in the paper to measure taxonomic distance of predictions from ground truth.

22.-A questioner points out limitations of using textual embeddings as a proxy for visual similarity.

23.-Visual and textual similarity may not always align well, e.g. Eiffel Tower mapping close to unrelated categories.

24.-Word embeddings are unsupervised and easily trained on Wikipedia, making them convenient despite this limitation.

25.-Future work could explore using classifier confusion matrices to better capture visual similarities between classes.

Knowledge Vault built byDavid Vivancos 2024