Knowledge Vault 6 /98 - ICML 2024
Preference Lerning
Dylan Hadfield-Menell
< Resume Image >

Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:

graph LR classDef preference fill:#f9d4d4, font-weight:bold, font-size:14px classDef safety fill:#d4f9d4, font-weight:bold, font-size:14px classDef learning fill:#d4d4f9, font-weight:bold, font-size:14px classDef feedback fill:#f9f9d4, font-weight:bold, font-size:14px classDef uncertainty fill:#f9d4f9, font-weight:bold, font-size:14px Main[Preference Lerning] --> P[Preference Systems] Main --> S[Safety & Protection] Main --> L[Learning Dynamics] Main --> F[Feedback Mechanisms] Main --> U[Uncertainty Management] P --> P1[Feedback details shape preference
learning 1] P --> P2[Human-robot goal alignment needs
work 2] P --> P3[RLHF captures preferences through
data 12] P --> P4[Board voting underlies preference
choices 14] P --> P5[Social choice theory guides
alignment 29] S --> S1[Guardrails prevent prompt engineering
misuse 6] S --> S2[Chatbots filter unwanted requests 7] S --> S3[Preference inference builds resistance 8] S --> S4[Trading performance gains safety 28] S --> S5[Risk-averse methods aggregate
preferences 30] L --> L1[Context shapes AI reward design 3] L --> L2[Human preferences evolve learning 17] L --> L3[Robots adapt to human growth 18] L --> L4[Win-lose patterns show rewards 19] L --> L5[Team learning needs bounds 20] F --> F1[Bayesian inference improves rewards 4] F --> F2[Hidden factors affect preferences 13] F --> F3[Teaching methods risk errors 22] F --> F4[Feedback changes with expectations 23] F --> F5[Flexible horizons match results 24] U --> U1[Goal uncertainty affects specs 5] U --> U2[Missing features change behavior 9] U --> U3[Limited views cause overconfidence 10] U --> U4[Simple rewards misalign goals 11] U --> U5[Distribution learning manages uncertainty 15] U5 --> U6[Uncertainty prevents system breaks 16] U5 --> U7[Model sensitivity follows density 25] U4 --> U8[Uncertainty awareness strengthens
alignment 26] U4 --> U9[Context management never ends 27] class Main,P,P1,P2,P3,P4,P5 preference class S,S1,S2,S3,S4,S5 safety class L,L1,L2,L3,L4,L5 learning class F,F1,F2,F3,F4,F5 feedback class U,U1,U2,U3,U4,U5,U6,U7,U8,U9 uncertainty

Resume:

1.- Importance of modeling feedback details in preference learning systems

2.- Robots maximize objectives controlled by humans, leading to goal misalignment

3.- Context matters when designing AI reward functions

4.- Inverse reward design uses Bayesian inference for better generalization

5.- Incompleteness is fundamental uncertainty in goal specification

6.- Automatic guardrails for inverse prompt engineering prevents misuse

7.- Travel assistant chatbot example demonstrates filtering jailbreak attempts

8.- Self-attack evaluations show increased robustness through preference inference

9.- Missing features affect robot obedience and decision-making

10.- Overconfidence occurs when robots have restricted worldview features

11.- Proxy rewards with fewer features lead to utility misalignment

12.- RLHF attempts to capture subjective preferences through data

13.- Hidden context affects preference data collection

14.- Board count voting mechanism underlies RLHF preference aggregation

15.- Distributional preference learning helps manage uncertainty

16.- Jailbreak robustness improves with uncertainty modeling

17.- Learning affects human preferences over time

18.- Robot assistance must account for human learning process

19.- Win-stay-lose-shift strategy reveals previous reward information

20.- Mutual information bounds team performance in learning

21.- Information-dense preference communication increases brittleness

22.- Pedagogical approaches are more sensitive to errors

23.- Teaching feedback varies based on expected time horizon

24.- Uncertainty about horizons can match known-horizon performance

25.- Information density correlates with model sensitivity

26.- Uncertainty-aware preference learning improves alignment robustness

27.- Unmodeled context requires ongoing management

28.- Information-revealing policies trade optimal performance for robustness

29.- Social choice theory connects to AI alignment

30.- Preference aggregation methods may build in risk aversion

Knowledge Vault built byDavid Vivancos 2024