Knowledge Vault 2/9 - ICLR 2014-2023
Mohammad Norouzi; Tomas Mikolov; Samy Bengio; Yoram Singer; Jonathon Shlens; Andrea Frome; Greg S. Corrado; Jeffrey Dean ICLR 2014 - Zero-Shot Learning by Convex Combination of Semantic Embeddings
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

Main
Zero-shot learning
leverages side info 1
Unsupervised embeddings
handle many classes 2
Projects images,
uses nearest neighbor 3
Challenges: embedding,
projection, search 4
ConSE approach 6
Skip-gram learns
unsupervised embeddings 5
Weights from
probabilistic classifier 7
t selects top
label embeddings 8
ConSE needs
no extra training 9
Alternatives map
images to label 10
ImageNet experiments 11
ConSE vs
DeViSE linear mapping 12
Test labels:
1-2, 3, >3 hops 13
Hitak reported,
training labels in/out 14
ConSE results 15
t=10 beats
t=1, t=1000, DeViSE 15
Farther labels
degrade, training preferred 16
Predicts rare
categories well 17
Failures have
sensible predictions 18
Generalizes better
than DeViSE 19
Deterministically embeds,
outperforms regression 20
Limitations 22
Embeddings limit
visual similarity 22
Visual, textual
similarity may misalign 23
Wikipedia embeddings
convenient despite limitation 24
Classifier confusion
could improve similarity 25
Hierarchical metrics
measure distance 21

Resume:

1.-Zero-shot learning aims to classify images into unseen classes by leveraging side information like semantic embeddings of labels.

2.-Unsupervised semantic embeddings of labels allow working with many classes without manually annotating attributes.

3.-The approach projects images into the semantic label space and uses nearest neighbor search to classify test images.

4.-Key challenges are defining the semantic label embedding, projecting images into that space, and performing nearest neighbor search.

5.-Skip-gram model is used to learn semantic label embeddings from word co-occurrences in an unsupervised way.

6.-The ConSE (convex combination of semantic embeddings) model projects images by taking a weighted combination of training label embeddings.

7.-Weights are the conditional probabilities of training labels given the image, obtained from a trained probabilistic classifier.

8.-A parameter t selects averaging only the top t label embeddings to reduce noise from tiny probabilities.

9.-ConSE requires no extra training beyond the initial classifier. Output likely stays on the manifold of label embeddings.

10.-Alternative models learn a regression to map images close to their label embedding and far from incorrect ones.

11.-Experiments done on ImageNet with 1000 training labels and 20,000 zero-shot test labels, using 500-D skip-gram Wikipedia embeddings.

12.-ConSE is compared to DeViSE model which learns a linear mapping of images into embedding space via ranking loss.

13.-Three subsets of test labels used: 1-2 hops, 3 hops, and >3 hops away from training labels in ImageNet hierarchy.

14.-Flat hit@k (% of test images with true label in top k predictions) reported, excluding and including training labels.

15.-ConSE with t=10 outperforms t=1 (can capture label ambiguity) and t=1000 (too much noise). Beats DeViSE by 5-15%.

16.-Performance degrades as test labels get farther in hierarchy from training. All methods prefer predicting training labels when included.

17.-Qualitative results show ConSE predicts relevant labels for images of rare categories like sea lions and hammers.

18.-Even failure cases have sensible predictions like vehicle-related classes for a farm machine image.

19.-ConSE performs worse than DeViSE on training labels, but generalizes better to unseen test labels without overfitting.

20.-In summary, ConSE deterministically embeds images using classifier probabilities and semantic label embeddings, outperforming regression-based approaches.

21.-Hierarchical performance metrics are also reported in the paper to measure taxonomic distance of predictions from ground truth.

22.-A questioner points out limitations of using textual embeddings as a proxy for visual similarity.

23.-Visual and textual similarity may not always align well, e.g. Eiffel Tower mapping close to unrelated categories.

24.-Word embeddings are unsupervised and easily trained on Wikipedia, making them convenient despite this limitation.

25.-Future work could explore using classifier confusion matrices to better capture visual similarities between classes.

Knowledge Vault built byDavid Vivancos 2024