Knowledge Vault 2/94 - ICLR 2014-2023
Jascha Sohl-Dickstein ICLR 2023 - Invited Talk - Learned optimizers: why they're the future, why they’re hard, and what they can do now
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef research fill:#f9d4d4, font-weight:bold, font-size:14px; classDef themes fill:#d4f9d4, font-weight:bold, font-size:14px; classDef impact fill:#d4d4f9, font-weight:bold, font-size:14px; classDef optimizers fill:#f9f9d4, font-weight:bold, font-size:14px; classDef challenges fill:#f9d4f9, font-weight:bold, font-size:14px; classDef architecture fill:#d4f9f9, font-weight:bold, font-size:14px; classDef sota fill:#f9d4f9, font-weight:bold, font-size:14px; classDef approaches fill:#d4f9d4, font-weight:bold, font-size:14px; classDef future fill:#f9d4d4, font-weight:bold, font-size:14px; A[Jascha Sohl-Dickstein
ICLR 2023] --> B[Jascha's background: physics, NASA,
neuroscience 1] A --> C[Talk themes: optimizers working,
transformative, need research 2] A --> D[AI transforming world, early
decisions have consequences 3] D --> E[Small choices can have
big impacts 4] D --> F[Urges intentionality in AI research 5] A --> G[Learned optimizers: learning
parameter update rule 6] G --> H[Per-parameter networks reduce to
gradient descent variants 7] G --> I[Key challenges: chaos, cost, generalization 8] I --> J[Chaos from nested dynamical systems 9] I --> K[Chaotic loss landscapes difficult optimization 10] I --> L[Smoothing loss by averaging perturbations 11] L --> M[Evolution strategies for gradients,
antithetic sampling reduces variance 12] I --> N[Expensive outer training, partial
unrolls help 13] I --> O[Generalization needs diverse meta-training,
data augmentation 14] G --> P[Architecture themes: features, overhead
vs performance, normalize, stability 15] P --> Q[Ideal architecture still open problem 16] A --> R[Velo: SOTA optimizer, large compute 17] R --> S[Velo matched/beat tuned baselines
on unseen tasks 18] R --> T[Velo outperforms on speed vs tasks 19] R --> U[Velo efficient with batch size,
adapts to training 20] R --> V[Velo fails on out-of-distribution tasks 21] A --> W[Other approaches: decoupling objectives,
specializing, RL, hyperparameters 22] A --> X[Increasing compute & data:
meta-learning revolution expected 23] A --> Y[Code available, scaling challenges remain 24] A --> Z[Learned optimizers discover hand-designed
techniques, need generalization tests 25] A --> AA[Curvature promising for brittle
second-order optimizers 26] A --> AB[Simple optimizers trainable in
minutes, efficiency key 27] A --> AC[AI progressing rapidly, researcher
choices shape trajectory 28] A --> AD[Learned optimizers surpassing hand-designed,
will transform training 29] A --> AE[Many challenges remain: stability,
generalization, efficiency, architectures 30] class A,B,E,F,Y,AB research; class C,W approaches; class D,AC impact; class G,H,Z,AA,AD optimizers; class I,J,K,L,M,N,O challenges; class P,Q architecture; class R,S,T,U,V sota; class X,AE future;

Resume:

1.-Jascha's research trajectory included physics undergrad at Cornell, working on Mars rovers at NASA JPL, and computational neuroscience grad school at Berkeley.

2.-The talk will cover three main themes - learned optimizers are starting to work well, will transform model training, and need more foundational research.

3.-Despite uncertainty, AI is transforming the world rapidly and early decisions by individuals can have huge consequences, for better or worse.

4.-Examples are given of small individual choices that had big impacts, like standardized email protocols and freely sharing the HeLa cell line.

5.-Jascha urges intentionality and thoughtfulness in AI research choices, as individuals have immense leverage to shape the future landscape of AI.

6.-Learned optimizers are introduced - learning the parameter update rule itself using an outer loop to optimize the inner optimization process.

7.-Per-parameter neural networks are a simple choice of architecture for learned optimizers, with linear layers reducing to variants of gradient descent.

8.-Key research challenges for learned optimizers are outlined - chaos/instability, compute cost, generalization to new tasks.

9.-Chaos results from the nested dynamical systems. Ideal optimizer parameters are at the edge of inner loop training instability.

10.-Loss landscapes for multi-step optimizer application become extremely chaotic, changing on smaller than pixel scale, making outer optimization difficult.

11.-Smoothing the outer loss landscape by averaging over random perturbations of optimizer parameters helps tame the chaos.

12.-Evolution strategies allow computing gradients of the smoothed loss. Antithetic sampling drastically reduces variance. Reparameterization gradients have prohibitively high variance.

13.-Outer training is very expensive due to many inner steps per outer step and high variance ES gradients. Partial inner unrolls help.

14.-Generalization to new tasks requires large diverse meta-training sets and data augmentation. Generalizing across problem scale is an open challenge.

15.-Useful architecture themes - provide optimizers lots of features, tradeoff overhead vs performance, use hierarchical compute, normalize features, build in stability.

16.-The ideal learned optimizer architecture is still to be determined and an open research problem.

17.-Velo, the current SOTA learned optimizer, aimed to work on all neural net training with no hyperparameter tuning, using large compute.

18.-On unseen ML benchmark tasks, Velo matched or beat tuned baselines on 5/6 tasks without tuning. It struggled on out-of-distribution GNN task.

19.-Velo outperforms well-tuned standard optimizers on speed-up vs % of tasks metrics. Worst case behavior is reasonable, not catastrophic.

20.-Velo makes more efficient use of large batch sizes than other first-order methods. It adapts to training length and parameter type.

21.-Velo fails on out-of-distribution very large models, long training runs, and RL tasks not seen during meta-training.

22.-Many other approaches to learned optimizers are outlined - decoupling meta and inner objectives, specializing to narrow tasks, RL, hyperparameter control, etc.

23.-As compute and task data increase, learned optimizers, losses, and architectures are expected to surpass hand-designed ones, causing a "meta-learning revolution."

24.-Code and examples are available for training simple learned optimizers. Scaling remains a challenge best suited for well-resourced labs currently.

25.-Learned optimizers implicitly discover techniques used in hand-designed optimizers. More rigorous generalization experiments across problem types are needed.

26.-Integrating curvature information is a promising direction for learned optimizers to improve on historically brittle second-order optimizers.

27.-Simple learned optimizers can be trained on a colab in minutes. Making the methods more efficient is a key research problem.

28.-AI is progressing rapidly and individual researchers have immense leverage to shape its trajectory for better or worse through their choices.

29.-Learned optimizers are starting to surpass hand-designed ones and will likely transform how models are trained as compute and data increase.

30.-Many open challenges remain in learned optimizers, including stability, generalization, compute efficiency, and ideal architectures, making it a fascinating research area.

Knowledge Vault built byDavid Vivancos 2024