Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Goal is to learn representations of relations between images to enable tasks like stereo depth, motion understanding, analogy making.
2.-Standard neural networks can't effectively learn relations because hidden units would decouple the two input images.
3.-Solution is to use a graphical model with multiplicative interactions between units representing the two images.
4.-Inference in this model involves summing over pairwise products of image features, enabling the model to capture relations.
5.-An alternative motivation is that related images lie on orbits or manifolds parameterized by the transformation relating them.
6.-This suggests making the weights of a model of one image be a function (e.g. linear) of the other image.
7.-Inference then naturally involves multiplicative interactions (summing pairwise products) between the two images.
8.-Training involves a conditional cost function reconstructing one image given the other. Can be trained like an autoencoder or RBM.
9.-To reduce number of pairwise interactions, can factorize the parameter tensor by projecting onto lower-dimensional filters first.
10.-The model learns phase-shifted, edge-like filters to efficiently encode transformations like translation, rotation, affine transforms.
11.-Commuting transformations share the same eigenspaces, so the model just needs to infer the angle within these eigenspaces.
12.-Angle in an eigenspace is computed by the inner product - summing pairwise coordinate products, which the model does.
13.-The model's hidden units encode the transformation invariantly to the pose of each image - they respond to the relative pose.
14.-This invariance of the learned features to pose is automatic - the model gets it for free by training on transformed images.
15.-Applied the model to stereo images to predict depth without requiring camera calibration. Performs decently but not state-of-the-art.
16.-Combined models encoding motion and depth to recognize actions in video. Depth helps slightly for some actions.
17.-The model enables analogy making by inferring the transformation between two inputs and applying it to a third input.
18.-Works on toy rotations of bars and digits, keeps person identity in face analogies, decomposes complex 3D rotations.
19.-Recent work on training the model directly for analogies by reconstructing images several steps apart in a video.
20.-Requires a higher-level model on top to infer Second-order "acceleration" from the transformations and apply it recurrently in time.
21.-Pretraining and backpropagating through time are both necessary to make this video prediction model work well.
22.-The model can predict further rotations of a 3D object from three initial views by inferring the rotation speed and acceleration.
23.-It captures the 3D structure and can render unseen views, degrades gracefully.
24.-Applied the model to predict the continuation of simple melodies represented as piano rolls.
25.-Ongoing work, but shows promise for capturing abstract structure for temporal prediction in various domains.
Knowledge Vault built byDavid Vivancos 2024