Generalization on the Unseen, Logic Reasoning and Degree Curriculum

Emmanuel Abbe · Samy Bengio · Aryo Lotfi · Kevin Rizk

**Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:**

graph LR
classDef generalization fill:#f9d4d4, font-weight:bold, font-size:14px
classDef boolean fill:#d4f9d4, font-weight:bold, font-size:14px
classDef neural fill:#d4d4f9, font-weight:bold, font-size:14px
classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px
A[Generalization on the

Unseen, Logic Reasoning

and Degree Curriculum] --> B[Generalization] A --> C[Boolean

Functions] A --> D[Neural

Networks] A --> E[Learning

and

Optimization] B --> B1[Out-of-distribution

generalization. 1] B --> B2[Generalize beyond

training

lengths. 11] B --> B3[Function matching

training

data. 24] B --> B4[Generalizing to

different

distributions. 25] B --> B5[Invariant under input

transformations. 26] B --> B6[Output transforms

predictably with

input. 27] C --> C1[Binary inputs

to real

outputs. 2] C --> C2[Boolean product

function. 19] C --> C3[Boolean majority

function. 20] C --> C4[Binary vector

non-zero

elements. 21] C --> C5[Importance of Boolean

variables. 17] C --> C6[Small input variable

subset

dependence. 28] D --> D1[Minimal degree-profile

interpolation. 3] D --> D2[Energy distribution in

Fourier-Walsh. 4] D --> D3[Random projections

and activation. 5] D --> D4[Diagonal neural

network. 6] D --> D5[Self-attention neural

network. 7] D --> D6[Mean-field two-layer

network. 8] E --> E1[Learning algorithms

solution bias. 10] E --> E2[Gradually complex

training samples. 12] E --> E3[Increasing Hamming

weight

samples. 13] E --> E4[Min-degree bias with

higher

terms. 14] E --> E5[Random gradient

subset

optimization. 22] E --> E6[Adaptive learning rate

optimizer. 23] class A,B,B1,B2,B3,B4,B5,B6 generalization class C,C1,C2,C3,C4,C5,C6 boolean class D,D1,D2,D3,D4,D5,D6 neural class E,E1,E2,E3,E4,E5,E6 learning

Unseen, Logic Reasoning

and Degree Curriculum] --> B[Generalization] A --> C[Boolean

Functions] A --> D[Neural

Networks] A --> E[Learning

and

Optimization] B --> B1[Out-of-distribution

generalization. 1] B --> B2[Generalize beyond

training

lengths. 11] B --> B3[Function matching

training

data. 24] B --> B4[Generalizing to

different

distributions. 25] B --> B5[Invariant under input

transformations. 26] B --> B6[Output transforms

predictably with

input. 27] C --> C1[Binary inputs

to real

outputs. 2] C --> C2[Boolean product

function. 19] C --> C3[Boolean majority

function. 20] C --> C4[Binary vector

non-zero

elements. 21] C --> C5[Importance of Boolean

variables. 17] C --> C6[Small input variable

subset

dependence. 28] D --> D1[Minimal degree-profile

interpolation. 3] D --> D2[Energy distribution in

Fourier-Walsh. 4] D --> D3[Random projections

and activation. 5] D --> D4[Diagonal neural

network. 6] D --> D5[Self-attention neural

network. 7] D --> D6[Mean-field two-layer

network. 8] E --> E1[Learning algorithms

solution bias. 10] E --> E2[Gradually complex

training samples. 12] E --> E3[Increasing Hamming

weight

samples. 13] E --> E4[Min-degree bias with

higher

terms. 14] E --> E5[Random gradient

subset

optimization. 22] E --> E6[Adaptive learning rate

optimizer. 23] class A,B,B1,B2,B3,B4,B5,B6 generalization class C,C1,C2,C3,C4,C5,C6 boolean class D,D1,D2,D3,D4,D5,D6 neural class E,E1,E2,E3,E4,E5,E6 learning

**Resume: **

**1.-** Generalization on the Unseen (GOTU): A strong case of out-of-distribution generalization where part of the distribution domain is unseen during training.

**2.-** Boolean functions: Functions mapping binary inputs to real outputs, representing discrete and combinatorial tasks like arithmetic or logic.

**3.-** Min-degree interpolator: An interpolating function with the minimal degree-profile, favoring lower-degree monomials in its Fourier-Walsh expansion.

**4.-** Degree-profile: A vector representing the energy distribution across different degrees in a function's Fourier-Walsh expansion.

**5.-** Random features model: A neural network approximation using random projections followed by a nonlinear activation function.

**6.-** Diagonal linear neural network: A deep neural network with only diagonal weight matrices and a single bias term.

**7.-** Transformer: A neural network architecture using self-attention mechanisms, commonly used in natural language processing and computer vision.

**8.-** Mean-field neural network: A two-layer neural network in the mean-field parametrization, studying the limit of infinite width.

**9.-** Fourier-Walsh transform: A decomposition of Boolean functions into a linear combination of monomials (products of input variables).

**10.-** Implicit bias: The tendency of learning algorithms to favor certain solutions over others, even without explicit regularization.

**11.-** Length generalization: The ability of models to generalize to input lengths beyond what was seen during training.

**12.-** Curriculum learning: A training strategy that gradually increases the complexity of training samples.

**13.-** Degree-Curriculum algorithm: A curriculum learning approach that incrementally increases the Hamming weight of training samples.

**14.-** Leaky min-degree bias: When models learn solutions that mostly follow the min-degree bias but retain some higher-degree terms.

**15.-** Vanishing ideals: A set of polynomials that are zero on a given set of points, used to characterize unseen domains.

**16.-** Strongly expressive activation: A property of activation functions that allows for effective representation of low-degree monomials.

**17.-** Boolean influence: A measure of the importance of a variable in a Boolean function.

**18.-** Spectral bias: The tendency of neural networks to learn lower-frequency components faster in continuous settings.

**19.-** Parity function: A Boolean function that outputs the product of its input bits.

**20.-** Majority function: A Boolean function that outputs 1 if more than half of its inputs are 1, and 0 otherwise.

**21.-** Hamming weight: The number of non-zero elements in a binary vector.

**22.-** Stochastic gradient descent (SGD): An optimization algorithm that updates parameters using estimated gradients from random subsets of data.

**23.-** Adam optimizer: An adaptive learning rate optimization algorithm commonly used in deep learning.

**24.-** Interpolating solution: A function that exactly matches the training data.

**25.-** Out-of-distribution generalization: The ability of models to perform well on data from a different distribution than the training data.

**26.-** Invariance: When a function's output remains unchanged under certain transformations of its input.

**27.-** Equivariance: When a function's output transforms predictably under certain transformations of its input.

**28.-** Sparse Boolean functions: Boolean functions that depend on only a small subset of their input variables.

**29.-** Neural tangent kernel (NTK): A kernel that describes the behavior of wide neural networks during training.

**30.-** Polynomial activation functions: Activation functions in neural networks that are polynomial functions of their input.

Knowledge Vault built byDavid Vivancos 2024