ICLR 2023
Ananya Kumar · Tengyu Ma · Tiffany Vlaar · Aditi Raghunathan · Hanie Sedghi · Yamini Bansal · Sang Michael Xie · Percy Liang · Mathilde Caron ICLR 2023 - Workshop Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)
1.-Workshop on mathematical and empirical understanding of foundation models, focusing on pre-training, adaptation, and emergent phenomena.

2.-Phase transitions and sharp performance changes can emerge in neural networks as data, model size, and complexity parameters are scaled.

3.-Overparameterization may be a byproduct of using suboptimal training algorithms like gradient descent. Approximate message passing avoids needing overparameterization.

4.-Optimal regularization mitigates overconfidence in overparameterized neural networks. Bayesian neural networks are well-calibrated out of the box.

5.-Power law scaling exponents for generalization error can be explained by the eigenvalue spectrum of data-dependent kernels.

6.-Text-to-image diffusion models like Imagen serve as effective zero-shot classifiers, outperforming CLIP, especially on challenging tasks requiring compositional generalization.

7.-Data augmentation with diffusion models and text inversion enables few-shot learning that outperforms standard augmentations.

8.-Demonstration ensembling, weighting demonstrations by similarity to test input, improves in-context learning over concatenation in few-shot settings.

9.-Fine-tuning localizes task-specific skills to small subnetworks. Simultaneously training on multiple tasks embeds non-overlapping skills. Enables continual learning by grafting.

10.-The Self-Supervised Learning (SSL) objective strongly impacts the learned representations in Vision Transformers, more so than architecture.

11.-Diffusion models are minimax optimal non-parametric distribution estimators. Sampling and likelihood evaluation have computational-statistical gaps. Manifold structure helps avoid the curse of dimensionality.

12.-Flipped learning, predicting instructions from input and label, improves zero-shot generalization and robustness to labels in instruction tuning.

13.-A kernel-based analysis shows prompt-based fine-tuning exhibits more kernel-like behavior than standard fine-tuning, explaining its better performance.

14.-Masking rates up to 40-50% can yield better pre-training than BERT's 15% masking. Optimal masking rate depends on model size and masking strategy.

15.-Large language models first learn the same concepts as smaller models, then reduce perplexity further. Validation perplexity aligns with downstream performance.

16.-Pre-training and adaptation (e.g. fine-tuning, prompting) should be studied jointly to inform better pre-training. Perplexity doesn't always predict downstream performance.

17.-The "zone of boring" (medium-scale models) is the new frontier for empirical and theoretical ML research, especially with instruction tuning and human feedback.

18.-Study did not find clear evidence of "emergent" abilities in large language models when examining full training trajectories.

