Understanding self-supervised learning dynamics without contrastive pairs
Yuandong Tian · Xinlei Chen · Surya Ganguli
1.- Self-supervised learning: Learns representations from data augmentation without labeled datasets, useful in NLP, speech, and computer vision.

2.- Non-contrastive SSL: Doesn't require negative pairs, raising questions about why it doesn't collapse to trivial solutions.

3.- Minimum model: Simple linear model to study non-contrastive SSL dynamics, using online and target networks with a predictor.

4.- Stop gradient: Technique preventing gradient flow through the target network.

5.- Predictor importance: Essential component in non-contrastive SSL to prevent collapse.

6.- Isotropic assumptions: Simplifying assumptions about data and augmentation distributions for analysis.

7.- Symmetric predictor: Assumption inspired by empirical observations during training.

8.- Reduced dynamics: Simplified equations describing the training process under stated assumptions.

9.- Eigenspace alignment: Gradual alignment of predictor and correlation matrix eigenspaces during training.

10.- Decoupled dynamics: Simplified 1D scalar case analysis after eigenspace alignment.

11.- Phase diagram: Visual representation of system dynamics, showing trivial and non-trivial basins.

12.- Trivial basin: Region where initialization leads to collapse (trivial solution).

13.- Non-trivial basin: Region where initialization leads to meaningful representations.

14.- Weight decay effects: Influences trivial basin size and eigenspace alignment.

15.- Relative learning rate: Affects trivial basin size and eigenspace alignment condition.

16.- Exponential moving average: Impacts eigenspace alignment and training speed.

17.- DirectPredict: Novel non-contrastive SSL method aligning predictor with correlation matrix eigenspace.

18.- Online estimation: Technique to estimate correlation matrix for nonlinear networks.

19.- Eigendecomposition: Process to obtain eigenvalues and eigenvectors of estimated correlation matrix.

20.- Predictor construction: Setting predictor eigenvalues and eigenvectors based on correlation matrix and discovered invariance.

21.- Empirical performance: DirectPredict shows strong results on CIFAR-10, STL-10, and ImageNet.

22.- Hybrid approach: Combining DirectPredict with gradient updates for improved performance.

23.- ImageNet results: DirectPredict matches or exceeds BYOL performance with simpler predictor architecture.

24.- Systematic analysis: Study of non-contrastive SSL dynamics using minimal linear setting.

25.- Hyperparameter roles: Understanding effects of various hyperparameters on training dynamics.

26.- Code availability: Open-source implementation of DirectPredict.

27.- Linear vs. nonlinear predictors: Comparison of performance between simple linear and complex nonlinear predictors.

28.- Theoretical implications: Insights into why non-contrastive SSL works and doesn't collapse.

29.- Practical applications: Potential for improving AI systems using massive unlabeled datasets.

30.- Future research: Opening doors for further investigation into self-supervised learning dynamics.

