Ananya Kumar · Tengyu Ma · Tiffany Vlaar · Aditi Raghunathan · Hanie Sedghi · Yamini Bansal · Sang Michael Xie · Percy Liang · Mathilde Caron ICLR 2023 - Workshop Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef foundation fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef scaling fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef overparametrization fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef regularization fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef diffusion fill:#f9d4f9, font-weight:bold, font-size:14px;
classDef learning fill:#d4f9f9, font-weight:bold, font-size:14px;
classDef language fill:#f9d4d4, font-weight:bold, font-size:14px;
A[Workshop ME-FoMo

ICLR 2023] --> B[Foundation models workshop:

pre-training, adaptation, emergence. 1] A --> C[Scaling causes phase transitions,

sharp performance changes. 2] C --> D[Larger width impacts

generalization scaling more. 22] A --> E[Overparametrization: suboptimal training

algorithms byproduct. 3] E --> F[Approximate message passing

avoids overparameterization. 3] E --> G[Bayesian principles suggest

overparametrization unnecessary. 27] A --> H[Optimal regularization mitigates

overconfidence, overparameterization. 4] H --> I[Bayesian neural networks

well-calibrated out-of-the-box. 4] A --> J[Data-dependent kernel eigenvalue

spectrum explains generalization. 5] J --> K[Power law decay of

kernel eigenvalues. 21] A --> L[Diffusion models: effective

zero-shot classifiers. 6] L --> M[Diffusion models enable

few-shot learning. 7] L --> N[Diffusion models: minimax optimal

non-parametric distribution estimators. 11] A --> O[Demonstration ensembling improves

in-context few-shot learning. 8] A --> P[Fine-tuning localizes skills,

enables continual learning. 9] A --> Q[SSL objective impacts

Vision Transformer representations. 10] A --> R[Flipped learning improves

zero-shot generalization, robustness. 12] A --> S[Prompt-based fine-tuning more

kernel-like than standard. 13] A --> T[Higher masking rates can

improve pre-training. 14] T --> U[Optimal masking rate depends

on model size, strategy. 25] A --> V[Language models learn concepts

sequentially, reduce perplexity. 15] V --> W[Validation perplexity aligns with

downstream performance. 15] V --> X[Perplexity doesn't always predict

downstream performance. 16] A --> Y[Pre-training and adaptation

should be studied jointly. 16] A --> Z[Medium-scale models frontier

for empirical, theoretical research. 17] A --> AA[No clear evidence of

'emergent' abilities in LLMs. 18] A --> AB[Prompt-based fine-tuning induces

kernel-like behavior. 26] A --> AC[SSL objectives impact representations

more than architecture. 28] A --> AD[Language models learn concepts

similarly regardless of size. 29] A --> AE[Medium-scale models with

instruction-tuning, human feedback important. 30] class A,B foundation; class C,D,J,K scaling; class E,F,G,H,I overparametrization; class H,I regularization; class L,M,N diffusion; class O,P,Q,R,S,T,U,AB,AC learning; class V,W,X,Y,Z,AA,AD,AE language;

ICLR 2023] --> B[Foundation models workshop:

pre-training, adaptation, emergence. 1] A --> C[Scaling causes phase transitions,

sharp performance changes. 2] C --> D[Larger width impacts

generalization scaling more. 22] A --> E[Overparametrization: suboptimal training

algorithms byproduct. 3] E --> F[Approximate message passing

avoids overparameterization. 3] E --> G[Bayesian principles suggest

overparametrization unnecessary. 27] A --> H[Optimal regularization mitigates

overconfidence, overparameterization. 4] H --> I[Bayesian neural networks

well-calibrated out-of-the-box. 4] A --> J[Data-dependent kernel eigenvalue

spectrum explains generalization. 5] J --> K[Power law decay of

kernel eigenvalues. 21] A --> L[Diffusion models: effective

zero-shot classifiers. 6] L --> M[Diffusion models enable

few-shot learning. 7] L --> N[Diffusion models: minimax optimal

non-parametric distribution estimators. 11] A --> O[Demonstration ensembling improves

in-context few-shot learning. 8] A --> P[Fine-tuning localizes skills,

enables continual learning. 9] A --> Q[SSL objective impacts

Vision Transformer representations. 10] A --> R[Flipped learning improves

zero-shot generalization, robustness. 12] A --> S[Prompt-based fine-tuning more

kernel-like than standard. 13] A --> T[Higher masking rates can

improve pre-training. 14] T --> U[Optimal masking rate depends

on model size, strategy. 25] A --> V[Language models learn concepts

sequentially, reduce perplexity. 15] V --> W[Validation perplexity aligns with

downstream performance. 15] V --> X[Perplexity doesn't always predict

downstream performance. 16] A --> Y[Pre-training and adaptation

should be studied jointly. 16] A --> Z[Medium-scale models frontier

for empirical, theoretical research. 17] A --> AA[No clear evidence of

'emergent' abilities in LLMs. 18] A --> AB[Prompt-based fine-tuning induces

kernel-like behavior. 26] A --> AC[SSL objectives impact representations

more than architecture. 28] A --> AD[Language models learn concepts

similarly regardless of size. 29] A --> AE[Medium-scale models with

instruction-tuning, human feedback important. 30] class A,B foundation; class C,D,J,K scaling; class E,F,G,H,I overparametrization; class H,I regularization; class L,M,N diffusion; class O,P,Q,R,S,T,U,AB,AC learning; class V,W,X,Y,Z,AA,AD,AE language;

**Resume: **

**1.-**Workshop on mathematical and empirical understanding of foundation models, focusing on pre-training, adaptation, and emergent phenomena.

**2.-**Phase transitions and sharp performance changes can emerge in neural networks as data, model size, and complexity parameters are scaled.

**3.-**Overparameterization may be a byproduct of using suboptimal training algorithms like gradient descent. Approximate message passing avoids needing overparameterization.

**4.-**Optimal regularization mitigates overconfidence in overparameterized neural networks. Bayesian neural networks are well-calibrated out of the box.

**5.-**Power law scaling exponents for generalization error can be explained by the eigenvalue spectrum of data-dependent kernels.

**6.-**Text-to-image diffusion models like Imagen serve as effective zero-shot classifiers, outperforming CLIP, especially on challenging tasks requiring compositional generalization.

**7.-**Data augmentation with diffusion models and text inversion enables few-shot learning that outperforms standard augmentations.

**8.-**Demonstration ensembling, weighting demonstrations by similarity to test input, improves in-context learning over concatenation in few-shot settings.

**9.-**Fine-tuning localizes task-specific skills to small subnetworks. Simultaneously training on multiple tasks embeds non-overlapping skills. Enables continual learning by grafting.

**10.-**The Self-Supervised Learning (SSL) objective strongly impacts the learned representations in Vision Transformers, more so than architecture.

**11.-**Diffusion models are minimax optimal non-parametric distribution estimators. Sampling and likelihood evaluation have computational-statistical gaps. Manifold structure helps avoid the curse of dimensionality.

**12.-**Flipped learning, predicting instructions from input and label, improves zero-shot generalization and robustness to labels in instruction tuning.

**13.-**A kernel-based analysis shows prompt-based fine-tuning exhibits more kernel-like behavior than standard fine-tuning, explaining its better performance.

**14.-**Masking rates up to 40-50% can yield better pre-training than BERT's 15% masking. Optimal masking rate depends on model size and masking strategy.

**15.-**Large language models first learn the same concepts as smaller models, then reduce perplexity further. Validation perplexity aligns with downstream performance.

**16.-**Pre-training and adaptation (e.g. fine-tuning, prompting) should be studied jointly to inform better pre-training. Perplexity doesn't always predict downstream performance.

**17.-**The "zone of boring" (medium-scale models) is the new frontier for empirical and theoretical ML research, especially with instruction tuning and human feedback.

**18.-**Study did not find clear evidence of "emergent" abilities in large language models when examining full training trajectories.

**19.-**Optimal regularization mitigates overconfidence in overparameterized neural networks trained with SGD. Bayesian neural networks are well-calibrated out of the box.

**20.-**Approximate Message Passing avoids the need for overparameterization in certain settings where SGD requires it for good performance.

**21.-**Scaling trends for generalization error can be theoretically explained by the power law decay of data-dependent kernel eigenvalue spectra.

**22.-**Larger width (not depth) in neural networks has a bigger impact on changing the power law exponent of generalization scaling.

**23.-**Diffusion models with noising/denoising of real data enable flexible data augmentation that significantly improves few-shot learning over standard augmentations.

**24.-**In-context learning performance of large language models aligns much more closely with validation perplexity than with compute or parameters.

**25.-**Increasing masking rate up to 40-50% yields better representations than standard 15% masking in BERT-style pre-training. Optimal rate depends on model size.

**26.-**Prompt-based fine-tuning induces more kernel-like behavior in the final layers compared to standard fine-tuning, possibly explaining its better performance.

**27.-**Bayesian principles suggest overparameterization shouldn't be necessary, but may arise as a crutch due to suboptimal optimization algorithms like SGD.

**28.-**SSL objectives have a bigger impact on learned representations than model architecture in Vision Transformers. Joint embedding and reconstruction learn very different features.

**29.-**Large language models learn concepts in a similar order regardless of size, with perplexity levels aligning across models.

**30.-**Medium-scale models with instruction tuning and human feedback are an important new frontier for empirical and theoretical ML research.

Knowledge Vault built byDavid Vivancos 2024