The End Of Knowledge - Vault 6/18 - CVPR - 2016 - Recent Advances in Non-Convex Optimization

graph LR classDef main fill:#f9d4d4, font-weight:bold, font-size:14px classDef challenges fill:#d4f9d4, font-weight:bold, font-size:14px classDef methods fill:#d4d4f9, font-weight:bold, font-size:14px classDef applications fill:#f9f9d4, font-weight:bold, font-size:14px classDef future fill:#f9d4f9, font-weight:bold, font-size:14px Main[Recent Advances in
Non-Convex Optimization] Main --> A[Non-convex Optimization Challenges] A --> A1[Non-convex optimization key for
machine learning 1] A --> A2[Multiple optima, saddle points
pose challenges 2] A --> A3[Gradient descent struggles near
saddle points 3] A --> A4[Model symmetry increases equivalent
solutions 4] A --> A5[Escaping saddles avoids learning
plateaus 5] Main --> B[Optimization Methods] B --> B1[Combined methods escape non-degenerate
saddles efficiently 6] B --> B2[Stochastic gradient escapes saddles
polynomially 7] B --> B3[Degenerate saddles harder, NP-hard
escape 8] B --> B4[Matrix problems: one optimum,
Hessian saddles 9] B --> B5[Tensor decomposition harder than
matrix problems 10] B --> B6[Orthogonal tensors: exponential non-degenerate
saddles 11] Main --> C[Tensor Methods] C --> C1[Whitening decomposes non-orthogonal tensors 12] C --> C2[Latent models trained via
tensor decomposition 13] C --> C3[Tensor methods outperform variational
inference 14] C --> C4[Convolutional dictionary learning via
tensors 15] C --> C5[Fast text embeddings through
tensor decomposition 16] Main --> D[Applications] D --> D1[Neural networks trained by
moment decomposition 17] D --> D2[Input density needed for
score function 18] D --> D3[Tensor decomposition benefits reinforcement
learning 19] D --> D4[Tensor RL outperforms deep RL 20] Main --> E[Current State and Future Directions] E --> E1[Non-convex optimization: active research
area 21] E --> E2[Tensor methods reach global optima 22] E --> E3[Tensor methods lack robust libraries 23] E --> E4[Open problems in non-convex optimization 24] E --> E5[Resources provided for further
learning 25] class Main main class A,A1,A2,A3,A4,A5 challenges class B,B1,B2,B3,B4,B5,B6,C,C1,C2,C3,C4,C5 methods class D,D1,D2,D3,D4 applications class E,E1,E2,E3,E4,E5 future

Resume:

1.- Optimization is key to machine learning, but most objective functions are non-convex which makes optimization challenging.

2.- Non-convex functions can have multiple local optima. Saddle points also pose challenges and are more numerous than local optima.

3.- Saddle points have directions of improvement but gradient descent can get stuck due to small gradient values near saddle points.

4.- Symmetry and over-specification in models like neural networks lead to exponentially more equivalent solutions and saddle points.

5.- Escaping saddle points is important to avoid learning plateaus. Second-order methods like Newton's method fail near saddle points.

6.- A combination of first and second-order methods, checking negative curvature directions, can efficiently escape non-degenerate saddle points.

7.- Stochastic gradient descent with sufficient noise can also escape saddle points in polynomial time under smoothness assumptions.

8.- Degenerate saddle points with semi-definite Hessian are harder. Escaping them is NP-hard for higher-order derivatives.

9.- Matrix eigenvector problems are simple non-convex problems with one global optimum and saddle points determined by the Hessian.

10.- Tensor decomposition extends matrix problems but is harder - components can be linearly independent instead of orthogonal.

11.- Orthogonal tensor decomposition has exponential saddle points but all are non-degenerate. Gradient descent converges to global optima.

12.- A whitening procedure allows decomposition of non-orthogonal tensors when components are linearly independent. Noise analysis is challenging.

13.- Topic models and other latent variable models can be trained via moment estimation and tensor decomposition under conditional independence.

14.- Tensor methods are faster and more accurate than variational inference for topic modeling and community detection in practice.

15.- Convolutional dictionary learning finds compact representations via tensor decomposition, taking advantage of additional structure and symmetry.

16.- Fast text embeddings competitive with RNNs can be learned via tensor decomposition of low-dimensional word co-occurrences.

17.- Neural networks can be provably trained by decomposing a generalized moment involving a non-linear transformation (score function).

18.- The input density is needed to determine the score function for neural network training, connecting discriminative and generative learning.

19.- Reinforcement learning in partially observable domains benefits from tensor decomposition to model the hidden state transitions.

20.- Tensor RL is more sample-efficient than model-free deep RL in Atari games with partial observability.

21.- Non-convex optimization is an active research area. Stochastic gradient descent can escape saddle points but degenerate points are challenging.

22.- Tensor decomposition allows reaching global optima in non-convex problems but requires problem-specific moments. Extending to general losses is open.

23.- Tensor methods are currently faster and more accurate than alternatives but lack easy-to-use libraries and robustness to noise.

24.- Open problems include escaping higher-order saddle points, additional conditions for global optimality, and unifying non-convex analysis.

25.- The speaker provided resources including a blog, Facebook group, talk videos, papers and upcoming workshops on non-convex optimization.

Knowledge Vault built byDavid Vivancos 2024