The End Of Knowledge - Vault 6/71 - CVPR - 2022 - Towards a Mathematical Theory of Machine Learning

graph LR classDef high_dim fill:#f9d4d4, font-weight:bold, font-size:14px classDef learning fill:#d4f9d4, font-weight:bold, font-size:14px classDef errors fill:#d4d4f9, font-weight:bold, font-size:14px classDef methods fill:#f9f9d4, font-weight:bold, font-size:14px A[Towards a Mathematical
Theory of Machine
Learning] --> B[High-Dimensional
Problems] A --> C[Learning
Methods] A --> D[Errors
and
Convergence] A --> E[Optimization
and
Training] B --> B1[Function approximation,
probability
distribution. 1] B --> B2[Curse of
dimensionality:
scaling issues. 5] B --> B3[Better in
high
dimensions. 6] B --> B4[Decomposed into
three
errors. 7] B --> B5[Monte Carlo:
dimension-independent
rates. 8] B --> B6[High-dimensional
functions: new
math problem. 26] C --> C1[Supervised: approximates
target
function. 2] C --> C2[Unsupervised: approximates
distributions
with samples. 3] C --> C3[Reinforcement: solves
Bellman equations
for decisions. 4] C --> C4[Monte Carlo-like
approximations in
networks. 9] C --> C5[Associated with
RKHS. 10] C --> C6[Global minima
selection
important. 27] D --> D1[Three errors:
approximation,
estimation, optimization. 7] D --> D2[Monte Carlo:
convergence for
errors. 14] D --> D3[Fits random
noise on
data. 13] D --> D4[Two-layer network
integral
representations. 11] D --> D5[Relate function
spaces, neural
networks. 12] D --> D6[SGD dynamics
towards flatter
minima. 22] E --> E1[Challenges in
high
dimensions. 15] E --> E2[Overparameterized
network regime. 16] E --> E3[Gradient flow
on Wasserstein
metric. 17] E --> E4[SGD finds
flatter
solutions. 21] E --> E5[Escape phenomenon:
better solutions
than GD. 19] E --> E6[Prefers uniform
solutions. 20] class A,B,B1,B2,B3,B4,B5,B6 high_dim class C,C1,C2,C3,C4,C5,C6 learning class D,D1,D2,D3,D4,D5,D6 errors class E,E1,E2,E3,E4,E5,E6 methods

Resume:

1.- Machine learning involves solving standard mathematical problems in high dimensions, like function approximation and probability distribution estimation.

2.- Supervised learning aims to approximate a target function using finite training data.

3.- Unsupervised learning, like generating fake faces, approximates underlying probability distributions using finite samples.

4.- Reinforcement learning solves Bellman equations for Markov decision processes.

5.- Classical approximation theory suffers from the curse of dimensionality, with error scaling poorly as dimensionality increases.

6.- Deep neural networks appear to perform better in high dimensions than classical methods.

7.- Total error can be decomposed into approximation error, estimation error, and optimization error.

8.- Monte Carlo methods can achieve dimension-independent convergence rates for certain problems like integration.

9.- Two-layer neural networks can be represented as expectations, allowing for Monte Carlo-like approximations.

10.- Random feature models are associated with reproducing kernel Hilbert spaces (RKHS).

11.- Barron spaces are associated with two-layer neural networks and admit integral representations.

12.- Direct and inverse approximation theorems establish relationships between function spaces and neural network approximations.

13.- Rademacher complexity measures a function space's ability to fit random noise on data points.

14.- Regularized models can achieve Monte Carlo convergence rates for both approximation and estimation errors.

15.- Gradient descent training faces challenges in high dimensions due to the similarity of gradients for orthonormal functions.

16.- The neural tangent kernel regime occurs in highly overparameterized networks but may not improve upon random feature models.

17.- Mean field formulation expresses neural network training as a gradient flow on the Wasserstein metric.

18.- Global minimizers in overparameterized regimes form submanifolds with dimension related to parameter and data counts.

19.- Stochastic gradient descent (SGD) exhibits an "escape phenomenon," potentially finding better solutions than gradient descent (GD).

20.- SGD stability analysis reveals preferences for more uniform solutions compared to GD.

21.- The "flat minima hypothesis" suggests SGD converges to flatter solutions that generalize better.

22.- SDE analysis of SGD dynamics supports the idea that it moves towards flatter minima.

23.- Unsupervised learning faces challenges with memorization phenomena in methods like GANs.

24.- Recurrent neural networks encounter a "curse of memory" when approximating dynamical systems with long-term dependencies.

25.- Reinforcement learning lacks substantial results for high-dimensional state and action spaces.

26.- Understanding high-dimensional functions is a major new problem for mathematics.

27.- Global minima selection in later stages of training is an important aspect of neural network behavior.

28.- Insights can be gained through carefully designed numerical experiments and asymptotic analysis.

29.- Early stopping can sometimes improve generalization, but isn't always effective (e.g., in NTK regime).

30.- Machine learning theory combines challenges from function approximation, algebra, learning dynamical systems, and probability distributions.

Knowledge Vault built byDavid Vivancos 2024