Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Zero-shot learning aims to classify images into unseen classes by leveraging side information like semantic embeddings of labels.
2.-Unsupervised semantic embeddings of labels allow working with many classes without manually annotating attributes.
3.-The approach projects images into the semantic label space and uses nearest neighbor search to classify test images.
4.-Key challenges are defining the semantic label embedding, projecting images into that space, and performing nearest neighbor search.
5.-Skip-gram model is used to learn semantic label embeddings from word co-occurrences in an unsupervised way.
6.-The ConSE (convex combination of semantic embeddings) model projects images by taking a weighted combination of training label embeddings.
7.-Weights are the conditional probabilities of training labels given the image, obtained from a trained probabilistic classifier.
8.-A parameter t selects averaging only the top t label embeddings to reduce noise from tiny probabilities.
9.-ConSE requires no extra training beyond the initial classifier. Output likely stays on the manifold of label embeddings.
10.-Alternative models learn a regression to map images close to their label embedding and far from incorrect ones.
11.-Experiments done on ImageNet with 1000 training labels and 20,000 zero-shot test labels, using 500-D skip-gram Wikipedia embeddings.
12.-ConSE is compared to DeViSE model which learns a linear mapping of images into embedding space via ranking loss.
13.-Three subsets of test labels used: 1-2 hops, 3 hops, and >3 hops away from training labels in ImageNet hierarchy.
14.-Flat hit@k (% of test images with true label in top k predictions) reported, excluding and including training labels.
15.-ConSE with t=10 outperforms t=1 (can capture label ambiguity) and t=1000 (too much noise). Beats DeViSE by 5-15%.
16.-Performance degrades as test labels get farther in hierarchy from training. All methods prefer predicting training labels when included.
17.-Qualitative results show ConSE predicts relevant labels for images of rare categories like sea lions and hammers.
18.-Even failure cases have sensible predictions like vehicle-related classes for a farm machine image.
19.-ConSE performs worse than DeViSE on training labels, but generalizes better to unseen test labels without overfitting.
20.-In summary, ConSE deterministically embeds images using classifier probabilities and semantic label embeddings, outperforming regression-based approaches.
21.-Hierarchical performance metrics are also reported in the paper to measure taxonomic distance of predictions from ground truth.
22.-A questioner points out limitations of using textual embeddings as a proxy for visual similarity.
23.-Visual and textual similarity may not always align well, e.g. Eiffel Tower mapping close to unrelated categories.
24.-Word embeddings are unsupervised and easily trained on Wikipedia, making them convenient despite this limitation.
25.-Future work could explore using classifier confusion matrices to better capture visual similarities between classes.
Knowledge Vault built byDavid Vivancos 2024