Roland Memisevic ICLR 2014 - Invited Talk - Representing Relations

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef models fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef inference fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef applications fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef training fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef extensions fill:#f9d4f9, font-weight:bold, font-size:14px;
A[Roland Memisevic

ICLR 2014] --> B[Learn image relation representations

for diverse tasks 1] A --> C[Standard neural nets can't

learn image relations effectively 2] A --> I[Train with conditional cost

reconstructing one given other 8] A --> K[Learns phase-shifted edge-like filters

encoding image transformations 10] A --> N[Hidden units encode transformation

invariant to image pose 13] A --> P[Applied to stereo depth

without camera calibration, decent 15] C --> D[Use graphical model with

multiplicative image unit interactions 3] D --> E[Inference sums pairwise products

of image features 4] D --> F[Related images lie on

transformation-parameterized orbits/manifolds 5] I --> J[Reduce interactions by projecting

onto lower-dimensional filters first 9] K --> L[Commuting transformations share eigenspaces,

infer angle within them 11] K --> M[Angle computed by summing

pairwise coordinate products 12] N --> O[Pose-invariant features emerge automatically

from transformed image training 14] P --> Q[Motion and depth models

slightly help action recognition 16] P --> R[Infers transformation between inputs,

applies to third 17] Q --> T[Train directly for multi-step

video frame analogies 19] Q --> W[Predicts further 3D rotations

from initial views 22] R --> S[Works on toy rotations,

faces, complex 3D 18] T --> U[Higher-level model infers 'acceleration',

applies recurrently 20] T --> V[Pretraining and backpropagation through

time necessary 21] W --> X[Captures 3D structure, renders

unseen views, degrades gracefully 23] P --> Y[Applied to predict continuation

of simple melodies 24] P --> Z[Promising for capturing abstract

structure for temporal prediction 25] class B,C,D,F models; class E,K,L,M,N inference; class I,J,O,T,U,V training; class P,Q,R,S,W,X,Y applications; class Z extensions;

ICLR 2014] --> B[Learn image relation representations

for diverse tasks 1] A --> C[Standard neural nets can't

learn image relations effectively 2] A --> I[Train with conditional cost

reconstructing one given other 8] A --> K[Learns phase-shifted edge-like filters

encoding image transformations 10] A --> N[Hidden units encode transformation

invariant to image pose 13] A --> P[Applied to stereo depth

without camera calibration, decent 15] C --> D[Use graphical model with

multiplicative image unit interactions 3] D --> E[Inference sums pairwise products

of image features 4] D --> F[Related images lie on

transformation-parameterized orbits/manifolds 5] I --> J[Reduce interactions by projecting

onto lower-dimensional filters first 9] K --> L[Commuting transformations share eigenspaces,

infer angle within them 11] K --> M[Angle computed by summing

pairwise coordinate products 12] N --> O[Pose-invariant features emerge automatically

from transformed image training 14] P --> Q[Motion and depth models

slightly help action recognition 16] P --> R[Infers transformation between inputs,

applies to third 17] Q --> T[Train directly for multi-step

video frame analogies 19] Q --> W[Predicts further 3D rotations

from initial views 22] R --> S[Works on toy rotations,

faces, complex 3D 18] T --> U[Higher-level model infers 'acceleration',

applies recurrently 20] T --> V[Pretraining and backpropagation through

time necessary 21] W --> X[Captures 3D structure, renders

unseen views, degrades gracefully 23] P --> Y[Applied to predict continuation

of simple melodies 24] P --> Z[Promising for capturing abstract

structure for temporal prediction 25] class B,C,D,F models; class E,K,L,M,N inference; class I,J,O,T,U,V training; class P,Q,R,S,W,X,Y applications; class Z extensions;

**Resume: **

**1.-**Goal is to learn representations of relations between images to enable tasks like stereo depth, motion understanding, analogy making.

**2.-**Standard neural networks can't effectively learn relations because hidden units would decouple the two input images.

**3.-**Solution is to use a graphical model with multiplicative interactions between units representing the two images.

**4.-**Inference in this model involves summing over pairwise products of image features, enabling the model to capture relations.

**5.-**An alternative motivation is that related images lie on orbits or manifolds parameterized by the transformation relating them.

**6.-**This suggests making the weights of a model of one image be a function (e.g. linear) of the other image.

**7.-**Inference then naturally involves multiplicative interactions (summing pairwise products) between the two images.

**8.-**Training involves a conditional cost function reconstructing one image given the other. Can be trained like an autoencoder or RBM.

**9.-**To reduce number of pairwise interactions, can factorize the parameter tensor by projecting onto lower-dimensional filters first.

**10.-**The model learns phase-shifted, edge-like filters to efficiently encode transformations like translation, rotation, affine transforms.

**11.-**Commuting transformations share the same eigenspaces, so the model just needs to infer the angle within these eigenspaces.

**12.-**Angle in an eigenspace is computed by the inner product - summing pairwise coordinate products, which the model does.

**13.-**The model's hidden units encode the transformation invariantly to the pose of each image - they respond to the relative pose.

**14.-**This invariance of the learned features to pose is automatic - the model gets it for free by training on transformed images.

**15.-**Applied the model to stereo images to predict depth without requiring camera calibration. Performs decently but not state-of-the-art.

**16.-**Combined models encoding motion and depth to recognize actions in video. Depth helps slightly for some actions.

**17.-**The model enables analogy making by inferring the transformation between two inputs and applying it to a third input.

**18.-**Works on toy rotations of bars and digits, keeps person identity in face analogies, decomposes complex 3D rotations.

**19.-**Recent work on training the model directly for analogies by reconstructing images several steps apart in a video.

**20.-**Requires a higher-level model on top to infer Second-order "acceleration" from the transformations and apply it recurrently in time.

**21.-**Pretraining and backpropagating through time are both necessary to make this video prediction model work well.

**22.-**The model can predict further rotations of a 3D object from three initial views by inferring the rotation speed and acceleration.

**23.-**It captures the 3D structure and can render unseen views, degrades gracefully.

**24.-**Applied the model to predict the continuation of simple melodies represented as piano rolls.

**25.-**Ongoing work, but shows promise for capturing abstract structure for temporal prediction in various domains.

Knowledge Vault built byDavid Vivancos 2024