Recent Advances in Non-Convex Optimization

Anima Anandkumar

**Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:**

graph LR
classDef main fill:#f9d4d4, font-weight:bold, font-size:14px
classDef challenges fill:#d4f9d4, font-weight:bold, font-size:14px
classDef methods fill:#d4d4f9, font-weight:bold, font-size:14px
classDef applications fill:#f9f9d4, font-weight:bold, font-size:14px
classDef future fill:#f9d4f9, font-weight:bold, font-size:14px
Main[Recent Advances in

Non-Convex Optimization] Main --> A[Non-convex Optimization Challenges] A --> A1[Non-convex optimization key for

machine learning 1] A --> A2[Multiple optima, saddle points

pose challenges 2] A --> A3[Gradient descent struggles near

saddle points 3] A --> A4[Model symmetry increases equivalent

solutions 4] A --> A5[Escaping saddles avoids learning

plateaus 5] Main --> B[Optimization Methods] B --> B1[Combined methods escape non-degenerate

saddles efficiently 6] B --> B2[Stochastic gradient escapes saddles

polynomially 7] B --> B3[Degenerate saddles harder, NP-hard

escape 8] B --> B4[Matrix problems: one optimum,

Hessian saddles 9] B --> B5[Tensor decomposition harder than

matrix problems 10] B --> B6[Orthogonal tensors: exponential non-degenerate

saddles 11] Main --> C[Tensor Methods] C --> C1[Whitening decomposes non-orthogonal tensors 12] C --> C2[Latent models trained via

tensor decomposition 13] C --> C3[Tensor methods outperform variational

inference 14] C --> C4[Convolutional dictionary learning via

tensors 15] C --> C5[Fast text embeddings through

tensor decomposition 16] Main --> D[Applications] D --> D1[Neural networks trained by

moment decomposition 17] D --> D2[Input density needed for

score function 18] D --> D3[Tensor decomposition benefits reinforcement

learning 19] D --> D4[Tensor RL outperforms deep RL 20] Main --> E[Current State and Future Directions] E --> E1[Non-convex optimization: active research

area 21] E --> E2[Tensor methods reach global optima 22] E --> E3[Tensor methods lack robust libraries 23] E --> E4[Open problems in non-convex optimization 24] E --> E5[Resources provided for further

learning 25] class Main main class A,A1,A2,A3,A4,A5 challenges class B,B1,B2,B3,B4,B5,B6,C,C1,C2,C3,C4,C5 methods class D,D1,D2,D3,D4 applications class E,E1,E2,E3,E4,E5 future

Non-Convex Optimization] Main --> A[Non-convex Optimization Challenges] A --> A1[Non-convex optimization key for

machine learning 1] A --> A2[Multiple optima, saddle points

pose challenges 2] A --> A3[Gradient descent struggles near

saddle points 3] A --> A4[Model symmetry increases equivalent

solutions 4] A --> A5[Escaping saddles avoids learning

plateaus 5] Main --> B[Optimization Methods] B --> B1[Combined methods escape non-degenerate

saddles efficiently 6] B --> B2[Stochastic gradient escapes saddles

polynomially 7] B --> B3[Degenerate saddles harder, NP-hard

escape 8] B --> B4[Matrix problems: one optimum,

Hessian saddles 9] B --> B5[Tensor decomposition harder than

matrix problems 10] B --> B6[Orthogonal tensors: exponential non-degenerate

saddles 11] Main --> C[Tensor Methods] C --> C1[Whitening decomposes non-orthogonal tensors 12] C --> C2[Latent models trained via

tensor decomposition 13] C --> C3[Tensor methods outperform variational

inference 14] C --> C4[Convolutional dictionary learning via

tensors 15] C --> C5[Fast text embeddings through

tensor decomposition 16] Main --> D[Applications] D --> D1[Neural networks trained by

moment decomposition 17] D --> D2[Input density needed for

score function 18] D --> D3[Tensor decomposition benefits reinforcement

learning 19] D --> D4[Tensor RL outperforms deep RL 20] Main --> E[Current State and Future Directions] E --> E1[Non-convex optimization: active research

area 21] E --> E2[Tensor methods reach global optima 22] E --> E3[Tensor methods lack robust libraries 23] E --> E4[Open problems in non-convex optimization 24] E --> E5[Resources provided for further

learning 25] class Main main class A,A1,A2,A3,A4,A5 challenges class B,B1,B2,B3,B4,B5,B6,C,C1,C2,C3,C4,C5 methods class D,D1,D2,D3,D4 applications class E,E1,E2,E3,E4,E5 future

**Resume: **

**1.-** Optimization is key to machine learning, but most objective functions are non-convex which makes optimization challenging.

**2.-** Non-convex functions can have multiple local optima. Saddle points also pose challenges and are more numerous than local optima.

**3.-** Saddle points have directions of improvement but gradient descent can get stuck due to small gradient values near saddle points.

**4.-** Symmetry and over-specification in models like neural networks lead to exponentially more equivalent solutions and saddle points.

**5.-** Escaping saddle points is important to avoid learning plateaus. Second-order methods like Newton's method fail near saddle points.

**6.-** A combination of first and second-order methods, checking negative curvature directions, can efficiently escape non-degenerate saddle points.

**7.-** Stochastic gradient descent with sufficient noise can also escape saddle points in polynomial time under smoothness assumptions.

**8.-** Degenerate saddle points with semi-definite Hessian are harder. Escaping them is NP-hard for higher-order derivatives.

**9.-** Matrix eigenvector problems are simple non-convex problems with one global optimum and saddle points determined by the Hessian.

**10.-** Tensor decomposition extends matrix problems but is harder - components can be linearly independent instead of orthogonal.

**11.-** Orthogonal tensor decomposition has exponential saddle points but all are non-degenerate. Gradient descent converges to global optima.

**12.-** A whitening procedure allows decomposition of non-orthogonal tensors when components are linearly independent. Noise analysis is challenging.

**13.-** Topic models and other latent variable models can be trained via moment estimation and tensor decomposition under conditional independence.

**14.-** Tensor methods are faster and more accurate than variational inference for topic modeling and community detection in practice.

**15.-** Convolutional dictionary learning finds compact representations via tensor decomposition, taking advantage of additional structure and symmetry.

**16.-** Fast text embeddings competitive with RNNs can be learned via tensor decomposition of low-dimensional word co-occurrences.

**17.-** Neural networks can be provably trained by decomposing a generalized moment involving a non-linear transformation (score function).

**18.-** The input density is needed to determine the score function for neural network training, connecting discriminative and generative learning.

**19.-** Reinforcement learning in partially observable domains benefits from tensor decomposition to model the hidden state transitions.

**20.-** Tensor RL is more sample-efficient than model-free deep RL in Atari games with partial observability.

**21.-** Non-convex optimization is an active research area. Stochastic gradient descent can escape saddle points but degenerate points are challenging.

**22.-** Tensor decomposition allows reaching global optima in non-convex problems but requires problem-specific moments. Extending to general losses is open.

**23.-** Tensor methods are currently faster and more accurate than alternatives but lack easy-to-use libraries and robustness to noise.

**24.-** Open problems include escaping higher-order saddle points, additional conditions for global optimality, and unifying non-convex analysis.

**25.-** The speaker provided resources including a blog, Facebook group, talk videos, papers and upcoming workshops on non-convex optimization.

Knowledge Vault built byDavid Vivancos 2024