Proxy objectives in reinforcement learning from human feedback

John Schulman

**Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:**

graph LR
classDef fairness fill:#d4f9d4, font-weight:bold, font-size:14px
classDef optimization fill:#f9d4d4, font-weight:bold, font-size:14px
classDef RLHF fill:#d4d4f9, font-weight:bold, font-size:14px
A[Proxy objectives in

reinforcement learning from

human feedback] --> B[DCCA:

pioneered multimodal

representation

learning. 1] A --> C[Hyperparameter

Optimization:

rigorous algorithmic

tuning. 2] A --> D[Learning Fair

Representations:

established fairness

subfield. 3] D --> E[Key Fairness

Notions:

individual and

group fairness. 4] D --> F[Fairness

Challenges:

subset targeting,

similarity metrics. 5] D --> G[Optimization

Approach:

vendor utility,

fairness constraints. 6] D --> H[NeurIPS

Workshop:

proposed, rejected

for audience size. 7] D --> I[Extended

Fairness:

generalize to

new data. 8] A --> J[Goals:

preserve,

generalize, lose

demographic info. 9] J --> K[Implementation:

encoder, decoder,

adversarial setup. 10] J --> L[Experiments:

outperformed prior

fairness metrics. 11] J --> M[Societal Harms:

beyond privacy,

emerged workshops. 12] J --> N[Improved

Representations:

MMD, adversarial

stability. 13] J --> O[Fairness

Challenges:

philosophical,

legal definitions. 14] A --> P[Over-optimization:

proxy worsens

true objective. 15] P --> Q[Over-optimization

Examples:

Cobra effect,

Soviet nails. 16] P --> R[Proxy

Incentives:

locally correct,

globally overdone. 17] P --> S[General

Patterns:

quadratic true,

infinite proxy. 18] A --> T[RLHF:

human queries

train reward

model. 19] T --> U[Policy

Optimization:

iterate policy,

reward model. 20] T --> V[Model-based

RL:

efficient hyperparameter

tuning. 21] T --> W[RLHF

Issues:

repetitive,

verbose, refusal. 22] T --> X[Simulated

RLHF:

scalable studies,

gold reward model. 23] T --> Y[Over-optimization

Frontier:

best-of-N,

PPO RL. 24] T --> Z[Larger

Models:

resistant to

over-optimization. 25] T --> AA[KL

Divergence:

grows logarithmically

with N. 26] A --> AB[Human

Feedback:

proxy mismatches,

unbiased feedback needed. 27] AB --> AC[AI

Labeling:

debates for

critiquing models. 28] AB --> AD[Mechanism

Design:

weak human,

strong AI system. 29] AB --> AE[Frontiers:

better feedback,

assist labeling,

bridge gap. 30] class B,C,J,K,L,M,N optimization class D,E,F,G,H,I,O fairness class P,Q,R,S,T,U,V,W,X,Y,Z,AA RLHF class AB,AC,AD,AE RLHF

reinforcement learning from

human feedback] --> B[DCCA:

pioneered multimodal

representation

learning. 1] A --> C[Hyperparameter

Optimization:

rigorous algorithmic

tuning. 2] A --> D[Learning Fair

Representations:

established fairness

subfield. 3] D --> E[Key Fairness

Notions:

individual and

group fairness. 4] D --> F[Fairness

Challenges:

subset targeting,

similarity metrics. 5] D --> G[Optimization

Approach:

vendor utility,

fairness constraints. 6] D --> H[NeurIPS

Workshop:

proposed, rejected

for audience size. 7] D --> I[Extended

Fairness:

generalize to

new data. 8] A --> J[Goals:

preserve,

generalize, lose

demographic info. 9] J --> K[Implementation:

encoder, decoder,

adversarial setup. 10] J --> L[Experiments:

outperformed prior

fairness metrics. 11] J --> M[Societal Harms:

beyond privacy,

emerged workshops. 12] J --> N[Improved

Representations:

MMD, adversarial

stability. 13] J --> O[Fairness

Challenges:

philosophical,

legal definitions. 14] A --> P[Over-optimization:

proxy worsens

true objective. 15] P --> Q[Over-optimization

Examples:

Cobra effect,

Soviet nails. 16] P --> R[Proxy

Incentives:

locally correct,

globally overdone. 17] P --> S[General

Patterns:

quadratic true,

infinite proxy. 18] A --> T[RLHF:

human queries

train reward

model. 19] T --> U[Policy

Optimization:

iterate policy,

reward model. 20] T --> V[Model-based

RL:

efficient hyperparameter

tuning. 21] T --> W[RLHF

Issues:

repetitive,

verbose, refusal. 22] T --> X[Simulated

RLHF:

scalable studies,

gold reward model. 23] T --> Y[Over-optimization

Frontier:

best-of-N,

PPO RL. 24] T --> Z[Larger

Models:

resistant to

over-optimization. 25] T --> AA[KL

Divergence:

grows logarithmically

with N. 26] A --> AB[Human

Feedback:

proxy mismatches,

unbiased feedback needed. 27] AB --> AC[AI

Labeling:

debates for

critiquing models. 28] AB --> AD[Mechanism

Design:

weak human,

strong AI system. 29] AB --> AE[Frontiers:

better feedback,

assist labeling,

bridge gap. 30] class B,C,J,K,L,M,N optimization class D,E,F,G,H,I,O fairness class P,Q,R,S,T,U,V,W,X,Y,Z,AA RLHF class AB,AC,AD,AE RLHF

**Resume: **

**1.-** Deep Canonical Correlation Analysis (DCCA) pioneered principled multimodal representation learning with deep neural networks and inspired reconstruction-free self-supervised representation learning.

**2.-** Hyperparameter Optimization paper demonstrated rigorous algorithmic hyperparameter tuning, treating it as a scientific and engineering problem rather than a heuristic approach.

**3.-** Learning Fair Representations was an influential early paper that helped establish the subfield of fairness in machine learning.

**4.-** The paper introduced key notions of fairness: individual fairness (similar treatment for similar individuals) and group fairness (equality between groups).

**5.-** Challenges included subset targeting, needing task-specific similarity metrics, and assuming access to a good approximation of ground truth fairness.

**6.-** An optimization approach was used with a vendor utility function and fairness constraints like Lipschitz continuity on representations.

**7.-** In 2012, the authors proposed a NeurIPS workshop on fairness in machine learning that was rejected due to audience size concerns.

**8.-** Learning Fair Representations paper extended the ideas, casting it as a representation learning problem to generalize to new data.

**9.-** Goals were to preserve information, generalize well, and lose demographic information. An information theoretic approach used mutual information terms.

**10.-** Implementation involved an encoder to learn representations and decoders to reconstruct inputs and predict demographics, in an adversarial setup.

**11.-** Experiments showed the method outperformed prior approaches on individual fairness metrics. Open problems included richer representations and refining fairness objectives.

**12.-** By 2014, more attention on societal harms of ML beyond just privacy. Workshops and conferences emerged dedicated to fairness, accountability, transparency.

**13.-** Authors continued improving the representation learning approach, matching distributions with MMD objectives and improved adversarial training stability.

**14.-** Key challenges remain in defining fairness mathematically to match philosophical and legal notions. ML systems now widely deployed with fairness concerns.

**15.-** Over-optimization is when optimizing a proxy objective too far makes the true objective worse. It happens when assumptions break down.

**16.-** In society, over-optimization examples include Cobra effect (paying for dead cobras caused cobra breeding) and Soviet nail factories making giant nails.

**17.-** An interpretation is proxy incentives are locally correct but optimized too far globally. Lack of regularization to avoid "weird" behaviors.

**18.-** General patterns: true objective is approximately quadratic but proxy goes to infinity; poor proxy estimates in low-data regions.

**19.-** In reinforcement learning from human feedback (RLHF), queries to a human produce reward data to train a reward model.

**20.-** The policy is then optimized against the reward model. Iterate between optimizing the policy and updating the reward model.

**21.-** Model-based RL enables hyperparameter tuning without re-collecting human data, providing a sample efficiency boost over model-free.

**22.-** Over-optimization in RLHF causes issues like repetitive phrases, excessive verbosity and hedging, and refusing reasonable requests.

**23.-** Simulated RLHF setups enable scalable studies of over-optimization phenomena. Gold reward model trained on full data acts as simulated human.

**24.-** Over-optimization frontier measured for both "best of N" sampling and PPO RL against KL divergence from original policy.

**25.-** Larger reward models are more resistant to over-optimization. Functional form of over-optimization frontier differs between best-of-N and RL.

**26.-** KL between best-of-N and original policy can be computed analytically and grows logarithmically with N. RL over-optimizes less efficiently.

**27.-** Human feedback and engagement metrics are proxies with mismatches from actual objectives. Getting unbiased high-quality feedback is an open problem.

**28.-** One approach is to assist human labeling with AI to amplify their abilities, like debates for one model to critique another.

**29.-** Mechanism design aims for weak human to incentivize strong AI systems to be helpful, even if the task is too hard for the human.

**30.-** Major frontiers are improving current systems with better feedback, new techniques to assist labeling, and bridging the objective-proxy gap.

Knowledge Vault built byDavid Vivancos 2024