The End Of Knowledge - Vault 2 - ICLR (2014-2023)

graph LR classDef models fill:#f9d4d4, font-weight:bold, font-size:14px; classDef inference fill:#d4f9d4, font-weight:bold, font-size:14px; classDef applications fill:#d4d4f9, font-weight:bold, font-size:14px; classDef training fill:#f9f9d4, font-weight:bold, font-size:14px; classDef extensions fill:#f9d4f9, font-weight:bold, font-size:14px; A[Roland Memisevic
ICLR 2014] --> B[Learn image relation representations
for diverse tasks 1] A --> C[Standard neural nets can't
learn image relations effectively 2] A --> I[Train with conditional cost
reconstructing one given other 8] A --> K[Learns phase-shifted edge-like filters
encoding image transformations 10] A --> N[Hidden units encode transformation
invariant to image pose 13] A --> P[Applied to stereo depth
without camera calibration, decent 15] C --> D[Use graphical model with
multiplicative image unit interactions 3] D --> E[Inference sums pairwise products
of image features 4] D --> F[Related images lie on
transformation-parameterized orbits/manifolds 5] I --> J[Reduce interactions by projecting
onto lower-dimensional filters first 9] K --> L[Commuting transformations share eigenspaces,
infer angle within them 11] K --> M[Angle computed by summing
pairwise coordinate products 12] N --> O[Pose-invariant features emerge automatically
from transformed image training 14] P --> Q[Motion and depth models
slightly help action recognition 16] P --> R[Infers transformation between inputs,
applies to third 17] Q --> T[Train directly for multi-step
video frame analogies 19] Q --> W[Predicts further 3D rotations
from initial views 22] R --> S[Works on toy rotations,
faces, complex 3D 18] T --> U[Higher-level model infers 'acceleration',
applies recurrently 20] T --> V[Pretraining and backpropagation through
time necessary 21] W --> X[Captures 3D structure, renders
unseen views, degrades gracefully 23] P --> Y[Applied to predict continuation
of simple melodies 24] P --> Z[Promising for capturing abstract
structure for temporal prediction 25] class B,C,D,F models; class E,K,L,M,N inference; class I,J,O,T,U,V training; class P,Q,R,S,W,X,Y applications; class Z extensions;

Resume:

1.-Goal is to learn representations of relations between images to enable tasks like stereo depth, motion understanding, analogy making.

2.-Standard neural networks can't effectively learn relations because hidden units would decouple the two input images.

3.-Solution is to use a graphical model with multiplicative interactions between units representing the two images.

4.-Inference in this model involves summing over pairwise products of image features, enabling the model to capture relations.

5.-An alternative motivation is that related images lie on orbits or manifolds parameterized by the transformation relating them.

6.-This suggests making the weights of a model of one image be a function (e.g. linear) of the other image.

7.-Inference then naturally involves multiplicative interactions (summing pairwise products) between the two images.

8.-Training involves a conditional cost function reconstructing one image given the other. Can be trained like an autoencoder or RBM.

9.-To reduce number of pairwise interactions, can factorize the parameter tensor by projecting onto lower-dimensional filters first.

10.-The model learns phase-shifted, edge-like filters to efficiently encode transformations like translation, rotation, affine transforms.

11.-Commuting transformations share the same eigenspaces, so the model just needs to infer the angle within these eigenspaces.

12.-Angle in an eigenspace is computed by the inner product - summing pairwise coordinate products, which the model does.

13.-The model's hidden units encode the transformation invariantly to the pose of each image - they respond to the relative pose.

14.-This invariance of the learned features to pose is automatic - the model gets it for free by training on transformed images.

15.-Applied the model to stereo images to predict depth without requiring camera calibration. Performs decently but not state-of-the-art.

16.-Combined models encoding motion and depth to recognize actions in video. Depth helps slightly for some actions.

17.-The model enables analogy making by inferring the transformation between two inputs and applying it to a third input.

18.-Works on toy rotations of bars and digits, keeps person identity in face analogies, decomposes complex 3D rotations.

19.-Recent work on training the model directly for analogies by reconstructing images several steps apart in a video.

20.-Requires a higher-level model on top to infer Second-order "acceleration" from the transformations and apply it recurrently in time.

21.-Pretraining and backpropagating through time are both necessary to make this video prediction model work well.

22.-The model can predict further rotations of a 3D object from three initial views by inferring the rotation speed and acceleration.

23.-It captures the 3D structure and can render unseen views, degrades gracefully.

24.-Applied the model to predict the continuation of simple melodies represented as piano rolls.

25.-Ongoing work, but shows promise for capturing abstract structure for temporal prediction in various domains.

Knowledge Vault built byDavid Vivancos 2024