The End Of Knowledge - Vault 2 - ICLR (2014-2023) - Outstanding Paper Session 2 ICLR 2021

graph LR classDef nitanda fill:#f9d4d4, font-weight:bold, font-size:14px; classDef minervini fill:#d4f9d4, font-weight:bold, font-size:14px; classDef wang fill:#d4d4f9, font-weight:bold, font-size:14px; classDef summary fill:#f9f9d4, font-weight:bold, font-size:14px; A[ICLR 2021
Outstanding Paper Session 2] --> B[Optimal convergence rates for
ASGD on overparameterized NNs. 1] B --> C[Rates depend on target
complexity and hypothesis space. 2] B --> D[Generalization bound faster
than 1/sqrt t. 3] B --> E[Approximate Bayes rule
minimizer over functions. 4] B --> F[Integral operator of NTK
describes function smoothness. 5] B --> G[Main assumptions: Bayes rule
smoothness, eigenvalue decay. 6] B --> H[Rate is minimax optimal
for learning problem. 7] A --> I[Answering complex queries over
incomplete KGs using NLPs. 8] I --> J[Previous SOTA: poor generalization,
no explanations. 9] I --> K[New approach: train on
simple queries only. 10] K --> L[Convert complex query into
optimization problem. 11] I --> M[Better generalization than training
on complex queries directly. 12] I --> N[Provides explanations via intermediate
variable assignments. 13] A --> O[Analyzing failure modes of
DARTS, architecture selection issues. 14] O --> P[DARTS: supernet with continuous
weights, magnitude-based selection. 15] O --> Q[Skip connection domination reasonable
for supernet training. 16] Q --> R[Supernet performs unrolled
estimation of feature maps. 17] Q --> S[Optimal alphas: skip connections
larger than convolutions. 18] O --> T[Supernet accuracy benefits more
from convolutions than skips. 19] O --> U[Perturbation-based selection proposed:
measure accuracy drop. 20] U --> V[Discovers better architectures,
alleviates skip domination. 21] O --> W[Failure due to selection
issues, not supernet training. 22] O --> X[Rethinking alpha's role could
improve diffNAS methods. 23] A --> Y[ASGD optimal rates depend
on complexity, eigenvalue decay. 24] A --> Z[Complex query answering as
optimization over KG embeddings. 25] A --> AA[Skip domination in DARTS
reasonable but problematic. 26] A --> AB[Perturbation selection resolves DARTS
failure modes, improves SOTA. 27] AB --> AC[Uniform alphas competitive with
perturbation-based selection. 28] A --> AD[New analysis, formulations, improvements
for NN training and NAS. 29] class B,C,D,E,F,G,H,Y nitanda; class I,J,K,L,M,N,Z minervini; class O,P,Q,R,S,T,U,V,W,X,AA,AB,AC wang; class A,AD summary;

Resume:

1.-Nitanda discusses optimal convergence rates for averaged stochastic gradient descent (ASGD) on overparameterized two-layer neural networks under the Neural Tangent Kernel regime.

2.-Key assumptions are the complexity of target function and hypothesis space. Optimal rates are derived for the generalization bound on ASGD iterates.

3.-The rate provides a generalization bound that is faster than 1/sqrt(t), where t is the number of examples used to obtain the hypothesis.

4.-Problem is to approximate Bayes rule minimizer over all measurable functions. Single-path ASGD is run, using a fresh sample each iteration.

5.-Integral operator of neural tangent kernel is introduced. Its eigenfunctions/values describe function smoothness. Decay rate of eigenvalues controls the RKHS size.

6.-Main assumptions are the smoothness of Bayes rule and eigenvalue decay of the integral operator. Epsilon gap between NN and RKHS dynamics introduced.

7.-Rate is minimax optimal for the learning problem. Future work includes investigating the misspecified case without overparameterization.

8.-Minervini presents work on answering complex queries over incomplete knowledge graphs using "neural link predictors" - models that predict missing links.

9.-Previous SOTA required training on millions of complex generated queries, with poor generalization and no explanations for predicted answers.

10.-New approach: train link predictor on simple queries only, then convert complex query into an optimization problem - find optimal variable assignments maximizing query likelihood.

11.-Discrete and continuous optimization formulations experimented with, including greedy search and gradient-based embedding optimization to identify top variable-entity assignments.

12.-Despite only training on simple queries, new approach generalizes better to complex queries than models trained on complex queries directly.

13.-Provides explanations via the intermediate variable assignments used to arrive at the final answers. Enables detecting errors and refining link prediction model.

14.-Wang presents work analyzing the failure modes of differentiable neural architecture search (DARTS), specifically issues with its architecture selection phase.

15.-DARTS constructs a supernet containing all search space architectures, uses continuous weights (alphas) to combine operations, and selects final architecture based on alphas.

16.-Despite supernet accuracy improving, selected architecture accuracy often degrades over time, with skip connections dominating over other operations like convolutions.

17.-Wang shows skip connection domination is actually reasonable for supernet training itself, and only problematic with DARTS' magnitude-based architecture selection.

18.-By analyzing feature map estimation in resnets vs non-residual networks, the supernet is shown to perform "unrolled estimation" where edges estimate same optimal map.

19.-Optimal alphas are derived showing skip alphas should be larger than convolution alphas in a well-trained supernet, explaining the skip domination.

20.-However, experiments show supernet accuracy benefits more from convolutions than skips, despite skips having larger alphas. Alpha magnitudes don't represent op strength.

21.-Naive approach of using supernet accuracy to measure op strength is expensive. Perturbation-based selection proposed instead - measure accuracy drop when removing op.

22.-New perturbation-based selection discovers better architectures and alleviates skip domination vs DARTS' magnitude-based selection, with same supernet.

23.-Produces SOTA results on NAS benchmarks. Even fixing alphas to uniform and using perturbation selection is competitive with DARTS.

24.-Suggests failure of DARTS is due to architecture selection issues, not supernet training. Rethinking alpha's role could improve diffNAS methods.

25.-In summary, averaged SGD analyzed under NTK assumptions, showing optimal generalization rates dependent on target function complexity and kernel eigenvalue decay.

26.-Complex query answering reformulated as optimization of query likelihood over KG embeddings, outperforming prior work and providing explanations.

27.-Skip connection domination in DARTS shown to be reasonable outcome of supernet training dynamics, but problematic for magnitude-based architecture selection.

28.-Perturbation-based architecture selection resolves failure modes of DARTS and improves SOTA diffNAS methods with minimal changes to supernet training.

29.-Surprisingly, even training supernet with uniform alphas is competitive when paired with perturbation-based selection, suggesting issues lie in architecture selection stage.

30.-Collectively, the works provide new theoretical analysis, problem formulations, and methodological improvements for key neural network training and architecture search challenges.

Knowledge Vault built byDavid Vivancos 2024