Knowledge Vault 2/97 - ICLR 2014-2023
Ananya Kumar · Tengyu Ma · Tiffany Vlaar · Aditi Raghunathan · Hanie Sedghi · Yamini Bansal · Sang Michael Xie · Percy Liang · Mathilde Caron ICLR 2023 - Workshop Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef foundation fill:#f9d4d4, font-weight:bold, font-size:14px; classDef scaling fill:#d4f9d4, font-weight:bold, font-size:14px; classDef overparametrization fill:#d4d4f9, font-weight:bold, font-size:14px; classDef regularization fill:#f9f9d4, font-weight:bold, font-size:14px; classDef diffusion fill:#f9d4f9, font-weight:bold, font-size:14px; classDef learning fill:#d4f9f9, font-weight:bold, font-size:14px; classDef language fill:#f9d4d4, font-weight:bold, font-size:14px; A[Workshop ME-FoMo
ICLR 2023] --> B[Foundation models workshop:
pre-training, adaptation, emergence. 1] A --> C[Scaling causes phase transitions,
sharp performance changes. 2] C --> D[Larger width impacts
generalization scaling more. 22] A --> E[Overparametrization: suboptimal training
algorithms byproduct. 3] E --> F[Approximate message passing
avoids overparameterization. 3] E --> G[Bayesian principles suggest
overparametrization unnecessary. 27] A --> H[Optimal regularization mitigates
overconfidence, overparameterization. 4] H --> I[Bayesian neural networks
well-calibrated out-of-the-box. 4] A --> J[Data-dependent kernel eigenvalue
spectrum explains generalization. 5] J --> K[Power law decay of
kernel eigenvalues. 21] A --> L[Diffusion models: effective
zero-shot classifiers. 6] L --> M[Diffusion models enable
few-shot learning. 7] L --> N[Diffusion models: minimax optimal
non-parametric distribution estimators. 11] A --> O[Demonstration ensembling improves
in-context few-shot learning. 8] A --> P[Fine-tuning localizes skills,
enables continual learning. 9] A --> Q[SSL objective impacts
Vision Transformer representations. 10] A --> R[Flipped learning improves
zero-shot generalization, robustness. 12] A --> S[Prompt-based fine-tuning more
kernel-like than standard. 13] A --> T[Higher masking rates can
improve pre-training. 14] T --> U[Optimal masking rate depends
on model size, strategy. 25] A --> V[Language models learn concepts
sequentially, reduce perplexity. 15] V --> W[Validation perplexity aligns with
downstream performance. 15] V --> X[Perplexity doesn't always predict
downstream performance. 16] A --> Y[Pre-training and adaptation
should be studied jointly. 16] A --> Z[Medium-scale models frontier
for empirical, theoretical research. 17] A --> AA[No clear evidence of
'emergent' abilities in LLMs. 18] A --> AB[Prompt-based fine-tuning induces
kernel-like behavior. 26] A --> AC[SSL objectives impact representations
more than architecture. 28] A --> AD[Language models learn concepts
similarly regardless of size. 29] A --> AE[Medium-scale models with
instruction-tuning, human feedback important. 30] class A,B foundation; class C,D,J,K scaling; class E,F,G,H,I overparametrization; class H,I regularization; class L,M,N diffusion; class O,P,Q,R,S,T,U,AB,AC learning; class V,W,X,Y,Z,AA,AD,AE language;

Resume:

1.-Workshop on mathematical and empirical understanding of foundation models, focusing on pre-training, adaptation, and emergent phenomena.

2.-Phase transitions and sharp performance changes can emerge in neural networks as data, model size, and complexity parameters are scaled.

3.-Overparameterization may be a byproduct of using suboptimal training algorithms like gradient descent. Approximate message passing avoids needing overparameterization.

4.-Optimal regularization mitigates overconfidence in overparameterized neural networks. Bayesian neural networks are well-calibrated out of the box.

5.-Power law scaling exponents for generalization error can be explained by the eigenvalue spectrum of data-dependent kernels.

6.-Text-to-image diffusion models like Imagen serve as effective zero-shot classifiers, outperforming CLIP, especially on challenging tasks requiring compositional generalization.

7.-Data augmentation with diffusion models and text inversion enables few-shot learning that outperforms standard augmentations.

8.-Demonstration ensembling, weighting demonstrations by similarity to test input, improves in-context learning over concatenation in few-shot settings.

9.-Fine-tuning localizes task-specific skills to small subnetworks. Simultaneously training on multiple tasks embeds non-overlapping skills. Enables continual learning by grafting.

10.-The Self-Supervised Learning (SSL) objective strongly impacts the learned representations in Vision Transformers, more so than architecture.

11.-Diffusion models are minimax optimal non-parametric distribution estimators. Sampling and likelihood evaluation have computational-statistical gaps. Manifold structure helps avoid the curse of dimensionality.

12.-Flipped learning, predicting instructions from input and label, improves zero-shot generalization and robustness to labels in instruction tuning.

13.-A kernel-based analysis shows prompt-based fine-tuning exhibits more kernel-like behavior than standard fine-tuning, explaining its better performance.

14.-Masking rates up to 40-50% can yield better pre-training than BERT's 15% masking. Optimal masking rate depends on model size and masking strategy.

15.-Large language models first learn the same concepts as smaller models, then reduce perplexity further. Validation perplexity aligns with downstream performance.

16.-Pre-training and adaptation (e.g. fine-tuning, prompting) should be studied jointly to inform better pre-training. Perplexity doesn't always predict downstream performance.

17.-The "zone of boring" (medium-scale models) is the new frontier for empirical and theoretical ML research, especially with instruction tuning and human feedback.

18.-Study did not find clear evidence of "emergent" abilities in large language models when examining full training trajectories.

19.-Optimal regularization mitigates overconfidence in overparameterized neural networks trained with SGD. Bayesian neural networks are well-calibrated out of the box.

20.-Approximate Message Passing avoids the need for overparameterization in certain settings where SGD requires it for good performance.

21.-Scaling trends for generalization error can be theoretically explained by the power law decay of data-dependent kernel eigenvalue spectra.

22.-Larger width (not depth) in neural networks has a bigger impact on changing the power law exponent of generalization scaling.

23.-Diffusion models with noising/denoising of real data enable flexible data augmentation that significantly improves few-shot learning over standard augmentations.

24.-In-context learning performance of large language models aligns much more closely with validation perplexity than with compute or parameters.

25.-Increasing masking rate up to 40-50% yields better representations than standard 15% masking in BERT-style pre-training. Optimal rate depends on model size.

26.-Prompt-based fine-tuning induces more kernel-like behavior in the final layers compared to standard fine-tuning, possibly explaining its better performance.

27.-Bayesian principles suggest overparameterization shouldn't be necessary, but may arise as a crutch due to suboptimal optimization algorithms like SGD.

28.-SSL objectives have a bigger impact on learned representations than model architecture in Vision Transformers. Joint embedding and reconstruction learn very different features.

29.-Large language models learn concepts in a similar order regardless of size, with perplexity levels aligning across models.

30.-Medium-scale models with instruction tuning and human feedback are an important new frontier for empirical and theoretical ML research.

Knowledge Vault built byDavid Vivancos 2024