Moderators: Amir Globerson · Piotr Koniusz ICLR 2021 - Outstanding Paper Session 2

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef nitanda fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef minervini fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef wang fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef summary fill:#f9f9d4, font-weight:bold, font-size:14px;
A[ICLR 2021

Outstanding Paper Session 2] --> B[Optimal convergence rates for

ASGD on overparameterized NNs. 1] B --> C[Rates depend on target

complexity and hypothesis space. 2] B --> D[Generalization bound faster

than 1/sqrt t. 3] B --> E[Approximate Bayes rule

minimizer over functions. 4] B --> F[Integral operator of NTK

describes function smoothness. 5] B --> G[Main assumptions: Bayes rule

smoothness, eigenvalue decay. 6] B --> H[Rate is minimax optimal

for learning problem. 7] A --> I[Answering complex queries over

incomplete KGs using NLPs. 8] I --> J[Previous SOTA: poor generalization,

no explanations. 9] I --> K[New approach: train on

simple queries only. 10] K --> L[Convert complex query into

optimization problem. 11] I --> M[Better generalization than training

on complex queries directly. 12] I --> N[Provides explanations via intermediate

variable assignments. 13] A --> O[Analyzing failure modes of

DARTS, architecture selection issues. 14] O --> P[DARTS: supernet with continuous

weights, magnitude-based selection. 15] O --> Q[Skip connection domination reasonable

for supernet training. 16] Q --> R[Supernet performs unrolled

estimation of feature maps. 17] Q --> S[Optimal alphas: skip connections

larger than convolutions. 18] O --> T[Supernet accuracy benefits more

from convolutions than skips. 19] O --> U[Perturbation-based selection proposed:

measure accuracy drop. 20] U --> V[Discovers better architectures,

alleviates skip domination. 21] O --> W[Failure due to selection

issues, not supernet training. 22] O --> X[Rethinking alpha's role could

improve diffNAS methods. 23] A --> Y[ASGD optimal rates depend

on complexity, eigenvalue decay. 24] A --> Z[Complex query answering as

optimization over KG embeddings. 25] A --> AA[Skip domination in DARTS

reasonable but problematic. 26] A --> AB[Perturbation selection resolves DARTS

failure modes, improves SOTA. 27] AB --> AC[Uniform alphas competitive with

perturbation-based selection. 28] A --> AD[New analysis, formulations, improvements

for NN training and NAS. 29] class B,C,D,E,F,G,H,Y nitanda; class I,J,K,L,M,N,Z minervini; class O,P,Q,R,S,T,U,V,W,X,AA,AB,AC wang; class A,AD summary;

Outstanding Paper Session 2] --> B[Optimal convergence rates for

ASGD on overparameterized NNs. 1] B --> C[Rates depend on target

complexity and hypothesis space. 2] B --> D[Generalization bound faster

than 1/sqrt t. 3] B --> E[Approximate Bayes rule

minimizer over functions. 4] B --> F[Integral operator of NTK

describes function smoothness. 5] B --> G[Main assumptions: Bayes rule

smoothness, eigenvalue decay. 6] B --> H[Rate is minimax optimal

for learning problem. 7] A --> I[Answering complex queries over

incomplete KGs using NLPs. 8] I --> J[Previous SOTA: poor generalization,

no explanations. 9] I --> K[New approach: train on

simple queries only. 10] K --> L[Convert complex query into

optimization problem. 11] I --> M[Better generalization than training

on complex queries directly. 12] I --> N[Provides explanations via intermediate

variable assignments. 13] A --> O[Analyzing failure modes of

DARTS, architecture selection issues. 14] O --> P[DARTS: supernet with continuous

weights, magnitude-based selection. 15] O --> Q[Skip connection domination reasonable

for supernet training. 16] Q --> R[Supernet performs unrolled

estimation of feature maps. 17] Q --> S[Optimal alphas: skip connections

larger than convolutions. 18] O --> T[Supernet accuracy benefits more

from convolutions than skips. 19] O --> U[Perturbation-based selection proposed:

measure accuracy drop. 20] U --> V[Discovers better architectures,

alleviates skip domination. 21] O --> W[Failure due to selection

issues, not supernet training. 22] O --> X[Rethinking alpha's role could

improve diffNAS methods. 23] A --> Y[ASGD optimal rates depend

on complexity, eigenvalue decay. 24] A --> Z[Complex query answering as

optimization over KG embeddings. 25] A --> AA[Skip domination in DARTS

reasonable but problematic. 26] A --> AB[Perturbation selection resolves DARTS

failure modes, improves SOTA. 27] AB --> AC[Uniform alphas competitive with

perturbation-based selection. 28] A --> AD[New analysis, formulations, improvements

for NN training and NAS. 29] class B,C,D,E,F,G,H,Y nitanda; class I,J,K,L,M,N,Z minervini; class O,P,Q,R,S,T,U,V,W,X,AA,AB,AC wang; class A,AD summary;

**Resume: **

**1.-**Nitanda discusses optimal convergence rates for averaged stochastic gradient descent (ASGD) on overparameterized two-layer neural networks under the Neural Tangent Kernel regime.

**2.-**Key assumptions are the complexity of target function and hypothesis space. Optimal rates are derived for the generalization bound on ASGD iterates.

**3.-**The rate provides a generalization bound that is faster than 1/sqrt(t), where t is the number of examples used to obtain the hypothesis.

**4.-**Problem is to approximate Bayes rule minimizer over all measurable functions. Single-path ASGD is run, using a fresh sample each iteration.

**5.-**Integral operator of neural tangent kernel is introduced. Its eigenfunctions/values describe function smoothness. Decay rate of eigenvalues controls the RKHS size.

**6.-**Main assumptions are the smoothness of Bayes rule and eigenvalue decay of the integral operator. Epsilon gap between NN and RKHS dynamics introduced.

**7.-**Rate is minimax optimal for the learning problem. Future work includes investigating the misspecified case without overparameterization.

**8.-**Minervini presents work on answering complex queries over incomplete knowledge graphs using "neural link predictors" - models that predict missing links.

**9.-**Previous SOTA required training on millions of complex generated queries, with poor generalization and no explanations for predicted answers.

**10.-**New approach: train link predictor on simple queries only, then convert complex query into an optimization problem - find optimal variable assignments maximizing query likelihood.

**11.-**Discrete and continuous optimization formulations experimented with, including greedy search and gradient-based embedding optimization to identify top variable-entity assignments.

**12.-**Despite only training on simple queries, new approach generalizes better to complex queries than models trained on complex queries directly.

**13.-**Provides explanations via the intermediate variable assignments used to arrive at the final answers. Enables detecting errors and refining link prediction model.

**14.-**Wang presents work analyzing the failure modes of differentiable neural architecture search (DARTS), specifically issues with its architecture selection phase.

**15.-**DARTS constructs a supernet containing all search space architectures, uses continuous weights (alphas) to combine operations, and selects final architecture based on alphas.

**16.-**Despite supernet accuracy improving, selected architecture accuracy often degrades over time, with skip connections dominating over other operations like convolutions.

**17.-**Wang shows skip connection domination is actually reasonable for supernet training itself, and only problematic with DARTS' magnitude-based architecture selection.

**18.-**By analyzing feature map estimation in resnets vs non-residual networks, the supernet is shown to perform "unrolled estimation" where edges estimate same optimal map.

**19.-**Optimal alphas are derived showing skip alphas should be larger than convolution alphas in a well-trained supernet, explaining the skip domination.

**20.-**However, experiments show supernet accuracy benefits more from convolutions than skips, despite skips having larger alphas. Alpha magnitudes don't represent op strength.

**21.-**Naive approach of using supernet accuracy to measure op strength is expensive. Perturbation-based selection proposed instead - measure accuracy drop when removing op.

**22.-**New perturbation-based selection discovers better architectures and alleviates skip domination vs DARTS' magnitude-based selection, with same supernet.

**23.-**Produces SOTA results on NAS benchmarks. Even fixing alphas to uniform and using perturbation selection is competitive with DARTS.

**24.-**Suggests failure of DARTS is due to architecture selection issues, not supernet training. Rethinking alpha's role could improve diffNAS methods.

**25.-**In summary, averaged SGD analyzed under NTK assumptions, showing optimal generalization rates dependent on target function complexity and kernel eigenvalue decay.

**26.-**Complex query answering reformulated as optimization of query likelihood over KG embeddings, outperforming prior work and providing explanations.

**27.-**Skip connection domination in DARTS shown to be reasonable outcome of supernet training dynamics, but problematic for magnitude-based architecture selection.

**28.-**Perturbation-based architecture selection resolves failure modes of DARTS and improves SOTA diffNAS methods with minimal changes to supernet training.

**29.-**Surprisingly, even training supernet with uniform alphas is competitive when paired with perturbation-based selection, suggesting issues lie in architecture selection stage.

**30.-**Collectively, the works provide new theoretical analysis, problem formulations, and methodological improvements for key neural network training and architecture search challenges.

Knowledge Vault built byDavid Vivancos 2024