Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Workshop on mathematical and empirical understanding of foundation models, focusing on pre-training, adaptation, and emergent phenomena.
2.-Phase transitions and sharp performance changes can emerge in neural networks as data, model size, and complexity parameters are scaled.
3.-Overparameterization may be a byproduct of using suboptimal training algorithms like gradient descent. Approximate message passing avoids needing overparameterization.
4.-Optimal regularization mitigates overconfidence in overparameterized neural networks. Bayesian neural networks are well-calibrated out of the box.
5.-Power law scaling exponents for generalization error can be explained by the eigenvalue spectrum of data-dependent kernels.
6.-Text-to-image diffusion models like Imagen serve as effective zero-shot classifiers, outperforming CLIP, especially on challenging tasks requiring compositional generalization.
7.-Data augmentation with diffusion models and text inversion enables few-shot learning that outperforms standard augmentations.
8.-Demonstration ensembling, weighting demonstrations by similarity to test input, improves in-context learning over concatenation in few-shot settings.
9.-Fine-tuning localizes task-specific skills to small subnetworks. Simultaneously training on multiple tasks embeds non-overlapping skills. Enables continual learning by grafting.
10.-The Self-Supervised Learning (SSL) objective strongly impacts the learned representations in Vision Transformers, more so than architecture.
11.-Diffusion models are minimax optimal non-parametric distribution estimators. Sampling and likelihood evaluation have computational-statistical gaps. Manifold structure helps avoid the curse of dimensionality.
12.-Flipped learning, predicting instructions from input and label, improves zero-shot generalization and robustness to labels in instruction tuning.
13.-A kernel-based analysis shows prompt-based fine-tuning exhibits more kernel-like behavior than standard fine-tuning, explaining its better performance.
14.-Masking rates up to 40-50% can yield better pre-training than BERT's 15% masking. Optimal masking rate depends on model size and masking strategy.
15.-Large language models first learn the same concepts as smaller models, then reduce perplexity further. Validation perplexity aligns with downstream performance.
16.-Pre-training and adaptation (e.g. fine-tuning, prompting) should be studied jointly to inform better pre-training. Perplexity doesn't always predict downstream performance.
17.-The "zone of boring" (medium-scale models) is the new frontier for empirical and theoretical ML research, especially with instruction tuning and human feedback.
18.-Study did not find clear evidence of "emergent" abilities in large language models when examining full training trajectories.
19.-Optimal regularization mitigates overconfidence in overparameterized neural networks trained with SGD. Bayesian neural networks are well-calibrated out of the box.
20.-Approximate Message Passing avoids the need for overparameterization in certain settings where SGD requires it for good performance.
21.-Scaling trends for generalization error can be theoretically explained by the power law decay of data-dependent kernel eigenvalue spectra.
22.-Larger width (not depth) in neural networks has a bigger impact on changing the power law exponent of generalization scaling.
23.-Diffusion models with noising/denoising of real data enable flexible data augmentation that significantly improves few-shot learning over standard augmentations.
24.-In-context learning performance of large language models aligns much more closely with validation perplexity than with compute or parameters.
25.-Increasing masking rate up to 40-50% yields better representations than standard 15% masking in BERT-style pre-training. Optimal rate depends on model size.
26.-Prompt-based fine-tuning induces more kernel-like behavior in the final layers compared to standard fine-tuning, possibly explaining its better performance.
27.-Bayesian principles suggest overparameterization shouldn't be necessary, but may arise as a crutch due to suboptimal optimization algorithms like SGD.
28.-SSL objectives have a bigger impact on learned representations than model architecture in Vision Transformers. Joint embedding and reconstruction learn very different features.
29.-Large language models learn concepts in a similar order regardless of size, with perplexity levels aligning across models.
30.-Medium-scale models with instruction tuning and human feedback are an important new frontier for empirical and theoretical ML research.
Knowledge Vault built byDavid Vivancos 2024