Knowledge Vault 6 /84 - ICML 2023
Proxy objectives in reinforcement learning from human feedback
John Schulman
< Resume Image >

Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:

graph LR classDef fairness fill:#d4f9d4, font-weight:bold, font-size:14px classDef optimization fill:#f9d4d4, font-weight:bold, font-size:14px classDef RLHF fill:#d4d4f9, font-weight:bold, font-size:14px A[Proxy objectives in
reinforcement learning from
human feedback] --> B[DCCA:
pioneered multimodal
representation
learning. 1] A --> C[Hyperparameter
Optimization:
rigorous algorithmic
tuning. 2] A --> D[Learning Fair
Representations:
established fairness
subfield. 3] D --> E[Key Fairness
Notions:
individual and
group fairness. 4] D --> F[Fairness
Challenges:
subset targeting,
similarity metrics. 5] D --> G[Optimization
Approach:
vendor utility,
fairness constraints. 6] D --> H[NeurIPS
Workshop:
proposed, rejected
for audience size. 7] D --> I[Extended
Fairness:
generalize to
new data. 8] A --> J[Goals:
preserve,
generalize, lose
demographic info. 9] J --> K[Implementation:
encoder, decoder,
adversarial setup. 10] J --> L[Experiments:
outperformed prior
fairness metrics. 11] J --> M[Societal Harms:
beyond privacy,
emerged workshops. 12] J --> N[Improved
Representations:
MMD, adversarial
stability. 13] J --> O[Fairness
Challenges:
philosophical,
legal definitions. 14] A --> P[Over-optimization:
proxy worsens
true objective. 15] P --> Q[Over-optimization
Examples:
Cobra effect,
Soviet nails. 16] P --> R[Proxy
Incentives:
locally correct,
globally overdone. 17] P --> S[General
Patterns:
quadratic true,
infinite proxy. 18] A --> T[RLHF:
human queries
train reward
model. 19] T --> U[Policy
Optimization:
iterate policy,
reward model. 20] T --> V[Model-based
RL:
efficient hyperparameter
tuning. 21] T --> W[RLHF
Issues:
repetitive,
verbose, refusal. 22] T --> X[Simulated
RLHF:
scalable studies,
gold reward model. 23] T --> Y[Over-optimization
Frontier:
best-of-N,
PPO RL. 24] T --> Z[Larger
Models:
resistant to
over-optimization. 25] T --> AA[KL
Divergence:
grows logarithmically
with N. 26] A --> AB[Human
Feedback:
proxy mismatches,
unbiased feedback needed. 27] AB --> AC[AI
Labeling:
debates for
critiquing models. 28] AB --> AD[Mechanism
Design:
weak human,
strong AI system. 29] AB --> AE[Frontiers:
better feedback,
assist labeling,
bridge gap. 30] class B,C,J,K,L,M,N optimization class D,E,F,G,H,I,O fairness class P,Q,R,S,T,U,V,W,X,Y,Z,AA RLHF class AB,AC,AD,AE RLHF

Resume:

1.- Deep Canonical Correlation Analysis (DCCA) pioneered principled multimodal representation learning with deep neural networks and inspired reconstruction-free self-supervised representation learning.

2.- Hyperparameter Optimization paper demonstrated rigorous algorithmic hyperparameter tuning, treating it as a scientific and engineering problem rather than a heuristic approach.

3.- Learning Fair Representations was an influential early paper that helped establish the subfield of fairness in machine learning.

4.- The paper introduced key notions of fairness: individual fairness (similar treatment for similar individuals) and group fairness (equality between groups).

5.- Challenges included subset targeting, needing task-specific similarity metrics, and assuming access to a good approximation of ground truth fairness.

6.- An optimization approach was used with a vendor utility function and fairness constraints like Lipschitz continuity on representations.

7.- In 2012, the authors proposed a NeurIPS workshop on fairness in machine learning that was rejected due to audience size concerns.

8.- Learning Fair Representations paper extended the ideas, casting it as a representation learning problem to generalize to new data.

9.- Goals were to preserve information, generalize well, and lose demographic information. An information theoretic approach used mutual information terms.

10.- Implementation involved an encoder to learn representations and decoders to reconstruct inputs and predict demographics, in an adversarial setup.

11.- Experiments showed the method outperformed prior approaches on individual fairness metrics. Open problems included richer representations and refining fairness objectives.

12.- By 2014, more attention on societal harms of ML beyond just privacy. Workshops and conferences emerged dedicated to fairness, accountability, transparency.

13.- Authors continued improving the representation learning approach, matching distributions with MMD objectives and improved adversarial training stability.

14.- Key challenges remain in defining fairness mathematically to match philosophical and legal notions. ML systems now widely deployed with fairness concerns.

15.- Over-optimization is when optimizing a proxy objective too far makes the true objective worse. It happens when assumptions break down.

16.- In society, over-optimization examples include Cobra effect (paying for dead cobras caused cobra breeding) and Soviet nail factories making giant nails.

17.- An interpretation is proxy incentives are locally correct but optimized too far globally. Lack of regularization to avoid "weird" behaviors.

18.- General patterns: true objective is approximately quadratic but proxy goes to infinity; poor proxy estimates in low-data regions.

19.- In reinforcement learning from human feedback (RLHF), queries to a human produce reward data to train a reward model.

20.- The policy is then optimized against the reward model. Iterate between optimizing the policy and updating the reward model.

21.- Model-based RL enables hyperparameter tuning without re-collecting human data, providing a sample efficiency boost over model-free.

22.- Over-optimization in RLHF causes issues like repetitive phrases, excessive verbosity and hedging, and refusing reasonable requests.

23.- Simulated RLHF setups enable scalable studies of over-optimization phenomena. Gold reward model trained on full data acts as simulated human.

24.- Over-optimization frontier measured for both "best of N" sampling and PPO RL against KL divergence from original policy.

25.- Larger reward models are more resistant to over-optimization. Functional form of over-optimization frontier differs between best-of-N and RL.

26.- KL between best-of-N and original policy can be computed analytically and grows logarithmically with N. RL over-optimizes less efficiently.

27.- Human feedback and engagement metrics are proxies with mismatches from actual objectives. Getting unbiased high-quality feedback is an open problem.

28.- One approach is to assist human labeling with AI to amplify their abilities, like debates for one model to critique another.

29.- Mechanism design aims for weak human to incentivize strong AI systems to be helpful, even if the task is too hard for the human.

30.- Major frontiers are improving current systems with better feedback, new techniques to assist labeling, and bridging the objective-proxy gap.

Knowledge Vault built byDavid Vivancos 2024