The End Of Knowledge - Vault 6/78 - CVPR - 2022 - Understanding Dataset Difficulty with V-Usable Information

graph LR classDef v_info fill:#f9d4d4, font-weight:bold, font-size:14px classDef difficulty fill:#d4f9d4, font-weight:bold, font-size:14px classDef analysis fill:#d4d4f9, font-weight:bold, font-size:14px classDef models fill:#f9f9d4, font-weight:bold, font-size:14px A[Understanding Dataset Difficulty
with V-Usable Information] --> B[V-usable
Information] A --> C[Dataset
and
Model
Comparisons] A --> D[Data
Analysis] A --> E[Model
Evaluations] B --> B1[Measures dataset
difficulty by
model. 1] B --> B2[Difficulty of
individual
instances. 2] B --> B3[Isolate attributes,
measure
information. 5] B --> B4[Analyze PVI
across data
subsets. 6] B --> B5[Identify influential
tokens. 7] B --> B6[Uncover dataset
biases. 8] C --> C1[Compare difficulty
across
datasets. 3] C --> C2[Compare models
information
extraction. 4] C --> C3[PVI stable
across
architectures. 9] C --> C4[PVI stable across
epochs. 10] C --> C5[High PVI,
easier for
humans. 11] C --> C6[Low PVI,
often
mislabeled. 12] D --> D1[Token identity
provides most
information. 13] D --> D2[Word classes
indicate
ungrammaticality. 14] D --> D3[Potential racial
bias in
labels. 15] D --> D4[Measure specific
attribute
information. 16] D --> D5[PVI correlates
with
confidence. 17] D --> D6[Highest PVI:
easy-to-learn
instances. 18] E --> E1[Lowest PVI:
hard-to-learn
instances. 19] E --> E2[Intermediate PVI:
ambiguous
instances. 20] E --> E3[Plateauing PVI
with more
data. 21] E --> E4[Larger models
extract more
information. 22] E --> E5[V-information
sensitive to
overfitting. 23] E --> E6[Similar PVI
threshold for
errors. 24] class A,B,B1,B2,B3,B4,B5,B6 v_info class C,C1,C2,C3,C4,C5,C6 difficulty class D,D1,D2,D3,D4,D5,D6 analysis class E,E1,E2,E3,E4,E5,E6 models

Resume:

1.- V-usable information: Framework for measuring dataset difficulty based on how much information a model family V can extract about labels from inputs.

2.- Pointwise V-information (PVI): Measure of difficulty for individual instances within a dataset, based on V-usable information framework.

3.- Dataset comparisons: V-usable information allows comparing difficulty of different datasets with respect to the same model.

4.- Model comparisons: V-usable information allows comparing how much information different models can extract from the same dataset.

5.- Input transformations: Technique of applying transformations to isolate input attributes and measure their information content about labels.

6.- Dataset slicing: Analyzing average PVI across different slices/subsets of data to understand difficulty patterns.

7.- Token-level artefacts: Identifying individual tokens that contribute most to model predictions by measuring change in V-information when removed.

8.- Annotation artefacts: Using V-information framework to uncover spurious correlations and biases in datasets that models exploit.

9.- Cross-model consistency: PVI estimates tend to be highly correlated across different model architectures, especially for higher V-information datasets.

10.- Stability across training: PVI estimates remain relatively stable across training epochs and random initializations.

11.- Correlation with human difficulty: Examples humans find easier (higher annotator agreement) tend to have higher PVI.

12.- Mislabeled examples: Instances with very low or negative PVI are often mislabeled.

13.- SNLI dataset analysis: Revealed that token identity alone provides most usable information, and hypothesis-only baselines extract substantial information.

14.- CoLA dataset analysis: Showed less usable information overall compared to SNLI, with certain word classes indicative of ungrammaticality.

15.- Hate speech detection bias: Analysis of DWMW17 dataset revealed potential racial bias in labeling of offensive language.

16.- Information isolation: Technique to measure information content of specific attributes beyond what is captured by other variables.

17.- Dataset cartography comparison: PVI shows correlation with confidence measure from dataset cartography, offering complementary dataset analysis.

18.- Easy-to-learn instances: Correspond to highest average PVI, indicating most usable information for model.

19.- Hard-to-learn instances: Correspond to lowest average PVI, often indicative of mislabeled examples.

20.- Ambiguous instances: Show intermediate PVI values, containing some but not maximum usable information.

21.- Training data sufficiency: Plateauing of V-information estimate with increasing training data indicates sufficient data for estimation.

22.- Model capacity: Larger models tend to extract more V-usable information from datasets.

23.- Overfit detection: V-information more sensitive to overfitting than held-out accuracy.

24.- Dataset difficulty threshold: Similar PVI threshold across datasets where models start making incorrect predictions.

25.- Interpretability advantage: V-information framework offers more principled, interpretable difficulty estimates compared to standard performance metrics.

Knowledge Vault built byDavid Vivancos 2024