Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Nitanda discusses optimal convergence rates for averaged stochastic gradient descent (ASGD) on overparameterized two-layer neural networks under the Neural Tangent Kernel regime.
2.-Key assumptions are the complexity of target function and hypothesis space. Optimal rates are derived for the generalization bound on ASGD iterates.
3.-The rate provides a generalization bound that is faster than 1/sqrt(t), where t is the number of examples used to obtain the hypothesis.
4.-Problem is to approximate Bayes rule minimizer over all measurable functions. Single-path ASGD is run, using a fresh sample each iteration.
5.-Integral operator of neural tangent kernel is introduced. Its eigenfunctions/values describe function smoothness. Decay rate of eigenvalues controls the RKHS size.
6.-Main assumptions are the smoothness of Bayes rule and eigenvalue decay of the integral operator. Epsilon gap between NN and RKHS dynamics introduced.
7.-Rate is minimax optimal for the learning problem. Future work includes investigating the misspecified case without overparameterization.
8.-Minervini presents work on answering complex queries over incomplete knowledge graphs using "neural link predictors" - models that predict missing links.
9.-Previous SOTA required training on millions of complex generated queries, with poor generalization and no explanations for predicted answers.
10.-New approach: train link predictor on simple queries only, then convert complex query into an optimization problem - find optimal variable assignments maximizing query likelihood.
11.-Discrete and continuous optimization formulations experimented with, including greedy search and gradient-based embedding optimization to identify top variable-entity assignments.
12.-Despite only training on simple queries, new approach generalizes better to complex queries than models trained on complex queries directly.
13.-Provides explanations via the intermediate variable assignments used to arrive at the final answers. Enables detecting errors and refining link prediction model.
14.-Wang presents work analyzing the failure modes of differentiable neural architecture search (DARTS), specifically issues with its architecture selection phase.
15.-DARTS constructs a supernet containing all search space architectures, uses continuous weights (alphas) to combine operations, and selects final architecture based on alphas.
16.-Despite supernet accuracy improving, selected architecture accuracy often degrades over time, with skip connections dominating over other operations like convolutions.
17.-Wang shows skip connection domination is actually reasonable for supernet training itself, and only problematic with DARTS' magnitude-based architecture selection.
18.-By analyzing feature map estimation in resnets vs non-residual networks, the supernet is shown to perform "unrolled estimation" where edges estimate same optimal map.
19.-Optimal alphas are derived showing skip alphas should be larger than convolution alphas in a well-trained supernet, explaining the skip domination.
20.-However, experiments show supernet accuracy benefits more from convolutions than skips, despite skips having larger alphas. Alpha magnitudes don't represent op strength.
21.-Naive approach of using supernet accuracy to measure op strength is expensive. Perturbation-based selection proposed instead - measure accuracy drop when removing op.
22.-New perturbation-based selection discovers better architectures and alleviates skip domination vs DARTS' magnitude-based selection, with same supernet.
23.-Produces SOTA results on NAS benchmarks. Even fixing alphas to uniform and using perturbation selection is competitive with DARTS.
24.-Suggests failure of DARTS is due to architecture selection issues, not supernet training. Rethinking alpha's role could improve diffNAS methods.
25.-In summary, averaged SGD analyzed under NTK assumptions, showing optimal generalization rates dependent on target function complexity and kernel eigenvalue decay.
26.-Complex query answering reformulated as optimization of query likelihood over KG embeddings, outperforming prior work and providing explanations.
27.-Skip connection domination in DARTS shown to be reasonable outcome of supernet training dynamics, but problematic for magnitude-based architecture selection.
28.-Perturbation-based architecture selection resolves failure modes of DARTS and improves SOTA diffNAS methods with minimal changes to supernet training.
29.-Surprisingly, even training supernet with uniform alphas is competitive when paired with perturbation-based selection, suggesting issues lie in architecture selection stage.
30.-Collectively, the works provide new theoretical analysis, problem formulations, and methodological improvements for key neural network training and architecture search challenges.
Knowledge Vault built byDavid Vivancos 2024