Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Jascha's research trajectory included physics undergrad at Cornell, working on Mars rovers at NASA JPL, and computational neuroscience grad school at Berkeley.
2.-The talk will cover three main themes - learned optimizers are starting to work well, will transform model training, and need more foundational research.
3.-Despite uncertainty, AI is transforming the world rapidly and early decisions by individuals can have huge consequences, for better or worse.
4.-Examples are given of small individual choices that had big impacts, like standardized email protocols and freely sharing the HeLa cell line.
5.-Jascha urges intentionality and thoughtfulness in AI research choices, as individuals have immense leverage to shape the future landscape of AI.
6.-Learned optimizers are introduced - learning the parameter update rule itself using an outer loop to optimize the inner optimization process.
7.-Per-parameter neural networks are a simple choice of architecture for learned optimizers, with linear layers reducing to variants of gradient descent.
8.-Key research challenges for learned optimizers are outlined - chaos/instability, compute cost, generalization to new tasks.
9.-Chaos results from the nested dynamical systems. Ideal optimizer parameters are at the edge of inner loop training instability.
10.-Loss landscapes for multi-step optimizer application become extremely chaotic, changing on smaller than pixel scale, making outer optimization difficult.
11.-Smoothing the outer loss landscape by averaging over random perturbations of optimizer parameters helps tame the chaos.
12.-Evolution strategies allow computing gradients of the smoothed loss. Antithetic sampling drastically reduces variance. Reparameterization gradients have prohibitively high variance.
13.-Outer training is very expensive due to many inner steps per outer step and high variance ES gradients. Partial inner unrolls help.
14.-Generalization to new tasks requires large diverse meta-training sets and data augmentation. Generalizing across problem scale is an open challenge.
15.-Useful architecture themes - provide optimizers lots of features, tradeoff overhead vs performance, use hierarchical compute, normalize features, build in stability.
16.-The ideal learned optimizer architecture is still to be determined and an open research problem.
17.-Velo, the current SOTA learned optimizer, aimed to work on all neural net training with no hyperparameter tuning, using large compute.
18.-On unseen ML benchmark tasks, Velo matched or beat tuned baselines on 5/6 tasks without tuning. It struggled on out-of-distribution GNN task.
19.-Velo outperforms well-tuned standard optimizers on speed-up vs % of tasks metrics. Worst case behavior is reasonable, not catastrophic.
20.-Velo makes more efficient use of large batch sizes than other first-order methods. It adapts to training length and parameter type.
21.-Velo fails on out-of-distribution very large models, long training runs, and RL tasks not seen during meta-training.
22.-Many other approaches to learned optimizers are outlined - decoupling meta and inner objectives, specializing to narrow tasks, RL, hyperparameter control, etc.
23.-As compute and task data increase, learned optimizers, losses, and architectures are expected to surpass hand-designed ones, causing a "meta-learning revolution."
24.-Code and examples are available for training simple learned optimizers. Scaling remains a challenge best suited for well-resourced labs currently.
25.-Learned optimizers implicitly discover techniques used in hand-designed optimizers. More rigorous generalization experiments across problem types are needed.
26.-Integrating curvature information is a promising direction for learned optimizers to improve on historically brittle second-order optimizers.
27.-Simple learned optimizers can be trained on a colab in minutes. Making the methods more efficient is a key research problem.
28.-AI is progressing rapidly and individual researchers have immense leverage to shape its trajectory for better or worse through their choices.
29.-Learned optimizers are starting to surpass hand-designed ones and will likely transform how models are trained as compute and data increase.
30.-Many open challenges remain in learned optimizers, including stability, generalization, compute efficiency, and ideal architectures, making it a fascinating research area.
Knowledge Vault built byDavid Vivancos 2024