Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Will Grathwohl · Kevin Swersky · Milad Hashemi · David Duvenaud · Chris Maddison

**Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:**

graph LR
classDef models fill:#f9d4d4, font-weight:bold, font-size:14px
classDef sampling fill:#d4f9d4, font-weight:bold, font-size:14px
classDef optimization fill:#d4d4f9, font-weight:bold, font-size:14px
classDef applications fill:#f9f9d4, font-weight:bold, font-size:14px
A[Oops I Took

A Gradient: Scalable

Sampling for Discrete

Distributions] --> B[Energy-based

Models] A --> C[Sampling

Methods] A --> D[Optimization

Techniques] A --> E[Applications

and

Performance] B --> B1[Parameterize probability

distributions

flexibly. 1] B --> B2[Estimate with

model

samples. 2] B --> B3[Text, tabular,

proteins, molecular

graphs. 4] B --> B4[Use deep networks

for energy

functions. 21] B --> B5[Applying to

discrete

data. 22] B --> B6[Discrete distributions

as continuous. 13] C --> C1[Simple method

for discrete

sampling. 5] C --> C2[High rejection

rates. 6] C --> C3[Efficient whole-input

updates. 7] C --> C4[Accept or

reject

updates. 8] C --> C5[New sampler

using

gradients. 15] C --> C6[Faster convergence

with

gradients. 18] D --> D1[High likelihood and

entropy

balance. 9] D --> D2[Controls likelihood-entropy

trade-off. 10] D --> D3[Achieved at

temperature

2. 11] D --> D4[Naive optimal proposal

evaluation. 12] D --> D5[Efficient likelihood

estimation. 14] D --> D6[O1 evaluations

per

update. 16] E --> E1[Efficient realistic

sampling. 17] E --> E2[Outperforms pseudo-likelihood

and

Gibbs. 20] E --> E3[Outperforms VAEs

and classical

models. 24] E --> E4[Generates high-quality

samples. 25] E --> E5[Applies to high-dimensional

discrete

data. 26] E --> E6[Various discrete

distributions and

models. 27] class A,B,B1,B2,B3,B4,B5,B6 models class C,C1,C2,C3,C4,C5,C6 sampling class D,D1,D2,D3,D4,D5,D6 optimization class E,E1,E2,E3,E4,E5,E6 applications

A Gradient: Scalable

Sampling for Discrete

Distributions] --> B[Energy-based

Models] A --> C[Sampling

Methods] A --> D[Optimization

Techniques] A --> E[Applications

and

Performance] B --> B1[Parameterize probability

distributions

flexibly. 1] B --> B2[Estimate with

model

samples. 2] B --> B3[Text, tabular,

proteins, molecular

graphs. 4] B --> B4[Use deep networks

for energy

functions. 21] B --> B5[Applying to

discrete

data. 22] B --> B6[Discrete distributions

as continuous. 13] C --> C1[Simple method

for discrete

sampling. 5] C --> C2[High rejection

rates. 6] C --> C3[Efficient whole-input

updates. 7] C --> C4[Accept or

reject

updates. 8] C --> C5[New sampler

using

gradients. 15] C --> C6[Faster convergence

with

gradients. 18] D --> D1[High likelihood and

entropy

balance. 9] D --> D2[Controls likelihood-entropy

trade-off. 10] D --> D3[Achieved at

temperature

2. 11] D --> D4[Naive optimal proposal

evaluation. 12] D --> D5[Efficient likelihood

estimation. 14] D --> D6[O1 evaluations

per

update. 16] E --> E1[Efficient realistic

sampling. 17] E --> E2[Outperforms pseudo-likelihood

and

Gibbs. 20] E --> E3[Outperforms VAEs

and classical

models. 24] E --> E4[Generates high-quality

samples. 25] E --> E5[Applies to high-dimensional

discrete

data. 26] E --> E6[Various discrete

distributions and

models. 27] class A,B,B1,B2,B3,B4,B5,B6 models class C,C1,C2,C3,C4,C5,C6 sampling class D,D1,D2,D3,D4,D5,D6 optimization class E,E1,E2,E3,E4,E5,E6 applications

**Resume: **

**1.-** Energy-based models: Parameterize probability distributions using an energy function, offering flexibility in model design.

**2.-** Log likelihood gradient: Can be estimated using samples from the model, enabling training of energy-based models.

**3.-** Continuous vs. discrete data: Gradient-based sampling methods work well for continuous data, but are challenging for discrete data.

**4.-** Importance of discrete data: Many data types like text, tabular data, proteins, and molecular graphs are discrete.

**5.-** Gibbs sampling: A simple method for sampling discrete distributions by iteratively updating individual dimensions.

**6.-** Inefficiency of Gibbs sampling: Many proposed updates are rejected, wasting computation.

**7.-** Dimension-wise proposal distribution: A more efficient sampling approach that proposes updates based on the entire input.

**8.-** Metropolis-Hastings acceptance probability: Used to accept or reject proposed updates in MCMC sampling.

**9.-** Optimal proposal distribution: Balances high likelihood of proposed samples with high entropy of the proposal distribution.

**10.-** Temperature parameter: Controls the trade-off between likelihood and entropy in the proposal distribution.

**11.-** Near-optimal proposal: Achieved when the temperature is set to 2, simplifying the acceptance probability.

**12.-** Computational challenge: Naive implementation of the optimal proposal requires evaluating all possible dimension flips.

**13.-** Continuous differentiable functions: Many discrete distributions can be expressed as continuous functions restricted to discrete inputs.

**14.-** Taylor series approximation: Used to efficiently estimate likelihood differences for all dimensions.

**15.-** Gibbs with gradients: A new MCMC sampler that approximates the optimal proposal using gradient information.

**16.-** Efficiency: Gibbs with gradients requires only O(1) function evaluations per update, unlike naive Gibbs sampling.

**17.-** RBM sampling experiment: Gibbs with gradients produces realistic samples more efficiently than Gibbs sampling.

**18.-** Image denoising with Ising models: Gibbs with gradients converges faster to reasonable solutions than Gibbs sampling.

**19.-** Protein contact prediction: An important task in computational biology using POTS models.

**20.-** POTS model training: Gibbs with gradients outperforms pseudo-likelihood maximization and Gibbs sampling, especially for large proteins.

**21.-** Deep energy-based models: Recent success in using deep neural networks to parameterize energy functions.

**22.-** Discrete deep energy-based models: Applying deep energy-based models to discrete data, which was previously challenging.

**23.-** Persistent contrastive divergence: A training method for energy-based models, adapted for discrete data using Gibbs with gradients.

**24.-** Performance comparison: Deep energy-based models trained with Gibbs with gradients outperform VAEs and classical energy-based models.

**25.-** Annealed MCMC: Used to generate high-quality samples from trained energy-based models.

**26.-** Scalability: Gibbs with gradients enables application of energy-based models to high-dimensional discrete data.

**27.-** Versatility: The method can be applied to various types of discrete distributions and energy-based models.

**28.-** Implementation simplicity: Gibbs with gradients is easy to implement in standard machine learning frameworks.

**29.-** Broader impact: Enables energy-based models to be applied to a wider range of data types and problems.

**30.-** Future work: Potential applications in text modeling, structure inference, and other discrete data domains.

Knowledge Vault built byDavid Vivancos 2024