The End Of Knowledge - Vault 6/89 - CVPR - 2023 - Generalization on the Unseen, Logic Reasoning and Degree Curriculum

graph LR classDef generalization fill:#f9d4d4, font-weight:bold, font-size:14px classDef boolean fill:#d4f9d4, font-weight:bold, font-size:14px classDef neural fill:#d4d4f9, font-weight:bold, font-size:14px classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px A[Generalization on the
Unseen, Logic Reasoning
and Degree Curriculum] --> B[Generalization] A --> C[Boolean
Functions] A --> D[Neural
Networks] A --> E[Learning
and
Optimization] B --> B1[Out-of-distribution
generalization. 1] B --> B2[Generalize beyond
training
lengths. 11] B --> B3[Function matching
training
data. 24] B --> B4[Generalizing to
different
distributions. 25] B --> B5[Invariant under input
transformations. 26] B --> B6[Output transforms
predictably with
input. 27] C --> C1[Binary inputs
to real
outputs. 2] C --> C2[Boolean product
function. 19] C --> C3[Boolean majority
function. 20] C --> C4[Binary vector
non-zero
elements. 21] C --> C5[Importance of Boolean
variables. 17] C --> C6[Small input variable
subset
dependence. 28] D --> D1[Minimal degree-profile
interpolation. 3] D --> D2[Energy distribution in
Fourier-Walsh. 4] D --> D3[Random projections
and activation. 5] D --> D4[Diagonal neural
network. 6] D --> D5[Self-attention neural
network. 7] D --> D6[Mean-field two-layer
network. 8] E --> E1[Learning algorithms
solution bias. 10] E --> E2[Gradually complex
training samples. 12] E --> E3[Increasing Hamming
weight
samples. 13] E --> E4[Min-degree bias with
higher
terms. 14] E --> E5[Random gradient
subset
optimization. 22] E --> E6[Adaptive learning rate
optimizer. 23] class A,B,B1,B2,B3,B4,B5,B6 generalization class C,C1,C2,C3,C4,C5,C6 boolean class D,D1,D2,D3,D4,D5,D6 neural class E,E1,E2,E3,E4,E5,E6 learning

Resume:

1.- Generalization on the Unseen (GOTU): A strong case of out-of-distribution generalization where part of the distribution domain is unseen during training.

2.- Boolean functions: Functions mapping binary inputs to real outputs, representing discrete and combinatorial tasks like arithmetic or logic.

3.- Min-degree interpolator: An interpolating function with the minimal degree-profile, favoring lower-degree monomials in its Fourier-Walsh expansion.

4.- Degree-profile: A vector representing the energy distribution across different degrees in a function's Fourier-Walsh expansion.

5.- Random features model: A neural network approximation using random projections followed by a nonlinear activation function.

6.- Diagonal linear neural network: A deep neural network with only diagonal weight matrices and a single bias term.

7.- Transformer: A neural network architecture using self-attention mechanisms, commonly used in natural language processing and computer vision.

8.- Mean-field neural network: A two-layer neural network in the mean-field parametrization, studying the limit of infinite width.

9.- Fourier-Walsh transform: A decomposition of Boolean functions into a linear combination of monomials (products of input variables).

10.- Implicit bias: The tendency of learning algorithms to favor certain solutions over others, even without explicit regularization.

11.- Length generalization: The ability of models to generalize to input lengths beyond what was seen during training.

12.- Curriculum learning: A training strategy that gradually increases the complexity of training samples.

13.- Degree-Curriculum algorithm: A curriculum learning approach that incrementally increases the Hamming weight of training samples.

14.- Leaky min-degree bias: When models learn solutions that mostly follow the min-degree bias but retain some higher-degree terms.

15.- Vanishing ideals: A set of polynomials that are zero on a given set of points, used to characterize unseen domains.

16.- Strongly expressive activation: A property of activation functions that allows for effective representation of low-degree monomials.

17.- Boolean influence: A measure of the importance of a variable in a Boolean function.

18.- Spectral bias: The tendency of neural networks to learn lower-frequency components faster in continuous settings.

19.- Parity function: A Boolean function that outputs the product of its input bits.

20.- Majority function: A Boolean function that outputs 1 if more than half of its inputs are 1, and 0 otherwise.

21.- Hamming weight: The number of non-zero elements in a binary vector.

22.- Stochastic gradient descent (SGD): An optimization algorithm that updates parameters using estimated gradients from random subsets of data.

23.- Adam optimizer: An adaptive learning rate optimization algorithm commonly used in deep learning.

24.- Interpolating solution: A function that exactly matches the training data.

25.- Out-of-distribution generalization: The ability of models to perform well on data from a different distribution than the training data.

26.- Invariance: When a function's output remains unchanged under certain transformations of its input.

27.- Equivariance: When a function's output transforms predictably under certain transformations of its input.

28.- Sparse Boolean functions: Boolean functions that depend on only a small subset of their input variables.

29.- Neural tangent kernel (NTK): A kernel that describes the behavior of wide neural networks during training.

30.- Polynomial activation functions: Activation functions in neural networks that are polynomial functions of their input.

Knowledge Vault built byDavid Vivancos 2024