Generative Image Dynamics

Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:**

graph LR
classDef sutton fill:#f9d4d4, font-weight:bold, font-size:14px
classDef representation fill:#d4f9d4, font-weight:bold, font-size:14px
classDef jeff fill:#d4d4f9, font-weight:bold, font-size:14px
classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px
classDef future fill:#f9d4f9, font-weight:bold, font-size:14px
A[Generative Image Dynamics] --> B[Generative Image Dynamics:

model scene motion priors 1] A --> C[Spectral Volume: frequency-domain

dense pixel trajectories 2] A --> D[Image-Based Rendering:

animate frames from image 3] A --> E[Latent Diffusion Model:

predicts spectral volumes 4] E --> F[Frequency-Coordinated Denoising:

coherent frequency predictions 5] E --> G[Adaptive Frequency Normalization:

stable spectral coefficients 6] C --> H[Motion Texture: long-range

per-pixel trajectories 7] B --> I[Seamless Looping: endless

videos via self-guidance 8] B --> J[Interactive Dynamics: simulated

responses to forces 9] C --> K[Modal Analysis: spectral

volumes as modal bases 10] D --> L[Eulerian Motion Fields:

dense displacement maps 11] D --> M[Softmax Splatting: forward-warping

for occlusions, multiple pixels 12] A --> N[FID: evaluate generated

image quality 13] A --> O[FVD: evaluate generated

video quality, coherence 14] A --> P[DTFVD: evaluate oscillatory

motions in videos 15] A --> Q[Sliding Window Metrics:

measure quality over time 16] E --> R[VAE: compresses input,

reconstructs output 17] E --> S[U-Net: iterative denoising

diffusion architecture 18] B --> T[Classifier-Free Guidance:

guides diffusion sampling 19] I --> U[Motion Self-Guidance: enforces

looping in sampling 20] K --> V[Image-Space Modal Basis:

simulates interactive dynamics 21] C --> W[Fourier Domain Representation:

models oscillatory motion 22] D --> X[Multi-Scale Feature Extraction:

captures rendering details 23] D --> Y[Perceptual Loss: trains

visually pleasing rendering 24] D --> Z[Motion Magnitude as Depth:

source pixel weights 25] F --> AA[Frequency Attention Layers:

coordinate frequency predictions 26] T --> AB[Universal Guidance: incorporates

sampling constraints 27] J --> AC[Explicit Euler Method:

simulates modal coordinates 28] D --> AD[Feature Pyramid: multi-scale

image features 29] B --> AE[Motion Amplification/Minification:

adjusts spectral amplitudes 30] class A,B,I,J future class C,F,G,H,K,W representation class D,L,M,X,Y,Z,AD learning class E,N,O,P,Q,R,S,T,U,V,AA,AB,AC jeff

model scene motion priors 1] A --> C[Spectral Volume: frequency-domain

dense pixel trajectories 2] A --> D[Image-Based Rendering:

animate frames from image 3] A --> E[Latent Diffusion Model:

predicts spectral volumes 4] E --> F[Frequency-Coordinated Denoising:

coherent frequency predictions 5] E --> G[Adaptive Frequency Normalization:

stable spectral coefficients 6] C --> H[Motion Texture: long-range

per-pixel trajectories 7] B --> I[Seamless Looping: endless

videos via self-guidance 8] B --> J[Interactive Dynamics: simulated

responses to forces 9] C --> K[Modal Analysis: spectral

volumes as modal bases 10] D --> L[Eulerian Motion Fields:

dense displacement maps 11] D --> M[Softmax Splatting: forward-warping

for occlusions, multiple pixels 12] A --> N[FID: evaluate generated

image quality 13] A --> O[FVD: evaluate generated

video quality, coherence 14] A --> P[DTFVD: evaluate oscillatory

motions in videos 15] A --> Q[Sliding Window Metrics:

measure quality over time 16] E --> R[VAE: compresses input,

reconstructs output 17] E --> S[U-Net: iterative denoising

diffusion architecture 18] B --> T[Classifier-Free Guidance:

guides diffusion sampling 19] I --> U[Motion Self-Guidance: enforces

looping in sampling 20] K --> V[Image-Space Modal Basis:

simulates interactive dynamics 21] C --> W[Fourier Domain Representation:

models oscillatory motion 22] D --> X[Multi-Scale Feature Extraction:

captures rendering details 23] D --> Y[Perceptual Loss: trains

visually pleasing rendering 24] D --> Z[Motion Magnitude as Depth:

source pixel weights 25] F --> AA[Frequency Attention Layers:

coordinate frequency predictions 26] T --> AB[Universal Guidance: incorporates

sampling constraints 27] J --> AC[Explicit Euler Method:

simulates modal coordinates 28] D --> AD[Feature Pyramid: multi-scale

image features 29] B --> AE[Motion Amplification/Minification:

adjusts spectral amplitudes 30] class A,B,I,J future class C,F,G,H,K,W representation class D,L,M,X,Y,Z,AD learning class E,N,O,P,Q,R,S,T,U,V,AA,AB,AC jeff

**Resume: **

**1.-** Generative Image Dynamics: A method to model image-space priors on scene motion, learned from real video sequences of natural oscillatory dynamics.

**2.-** Spectral Volume: A frequency-domain representation of dense, long-range pixel trajectories, well-suited for prediction with diffusion models.

**3.-** Image-Based Rendering: A technique to animate future video frames using predicted motion and the input image.

**4.-** Latent Diffusion Model (LDM): The backbone for predicting spectral volumes from single images.

**5.-** Frequency-Coordinated Denoising: A strategy to predict spectral volumes across multiple frequency bands while maintaining coherence.

**6.-** Adaptive Frequency Normalization: A technique to normalize spectral volume coefficients across frequencies for stable training and accurate predictions.

**7.-** Motion Texture: A set of long-range, per-pixel motion trajectories derived from spectral volumes.

**8.-** Seamless Looping: A technique to create endlessly looping videos using motion self-guidance during the diffusion sampling process.

**9.-** Interactive Dynamics: The ability to simulate object responses to user-defined forces using predicted spectral volumes.

**10.-** Modal Analysis: A method to interpret spectral volumes as image-space modal bases for simulating dynamics.

**11.-** Eulerian Motion Fields: A representation of scene motion as dense displacement maps for each pixel over time.

**12.-** Softmax Splatting: A technique for forward-warping features during image-based rendering to handle occlusions and multiple source pixels.

**13.-** Fr échet Inception Distance (FID): A metric to evaluate the quality of generated images compared to real images.

**14.-** Fr échet Video Distance (FVD): A metric to evaluate the quality and temporal coherence of generated videos.

**15.-** Dynamic Texture Fr échet Video Distance (DTFVD): A metric specifically designed for evaluating natural oscillatory motions in videos.

**16.-** Sliding Window Metrics: Techniques to measure how generated video quality changes over time.

**17.-** Variational Autoencoder (VAE): A component of the LDM that compresses input to a latent space and reconstructs output.

**18.-** U-Net: A neural network architecture used in the diffusion model for iterative denoising.

**19.-** Classifier-Free Guidance: A technique to guide the diffusion sampling process without using a separate classifier model.

**20.-** Motion Self-Guidance: A method to enforce looping constraints during the diffusion sampling process for seamless video generation.

**21.-** Image-Space Modal Basis: An interpretation of spectral volumes for simulating interactive dynamics from single images.

**22.-** Fourier Domain Representation: Modeling motion in the frequency domain to capture oscillatory behaviors efficiently.

**23.-** Multi-Scale Feature Extraction: A technique used in the image-based rendering module to capture details at different scales.

**24.-** Perceptual Loss: A loss function used in training the image-based rendering module to produce visually pleasing results.

**25.-** Motion Magnitude as Depth Proxy: Using predicted flow magnitude to determine the contributing weight of source pixels in rendering.

**26.-** Frequency Attention Layers: Neural network layers used to coordinate predictions across different frequency bands in the spectral volume.

**27.-** Universal Guidance: A technique to incorporate additional constraints during the diffusion sampling process.

**28.-** Explicit Euler Method: A numerical method used to simulate the state of modal coordinates in interactive dynamics.

**29.-** Feature Pyramid: A multi-scale representation of image features used in the image-based rendering module.

**30.-** Motion Amplification/Minification: Techniques to adjust the amplitude of predicted spectral volume coefficients for exaggerated or subtle animations.

Knowledge Vault built byDavid Vivancos 2024