The End Of Knowledge - Vault 2 - ICLR (2014-2023)

graph LR classDef rl fill:#f9d4d4, font-weight:bold, font-size:14px; classDef concepts fill:#d4f9d4, font-weight:bold, font-size:14px; classDef examples fill:#d4d4f9, font-weight:bold, font-size:14px; classDef problem fill:#f9f9d4, font-weight:bold, font-size:14px; classDef algorithms fill:#f9d4f9, font-weight:bold, font-size:14px; classDef implementations fill:#d4f9f9, font-weight:bold, font-size:14px; classDef future fill:#f9d4d4, font-weight:bold, font-size:14px; A[David Silver
ICLR 2015] --> B[RL: AI agents achieve goals 1] A --> C[Main concepts: policy, value 4] C --> D[Policy: action selection 5] C --> E[Value: state/action goodness 6] B --> F[Actions influence world, rewards 2] B --> G[Goal: maximize future rewards 3] B --> H[RL problems: control, optimization, games 7] B --> I[Agent-environment interaction 8] I --> J[Optimal policy maximizes reward 9] I --> K[Optimal value: max reward 10] A --> L[Value iteration: Bellman equation 11] L --> M[Neural nets generalize value 12] A --> N[Q-learning: approximate action-value 13] N --> O[Naive Q-learning unstable 14] N --> P[DQN: replay, target nets, clipping 15] P --> Q[Replay breaks correlations 16] P --> R[Target nets stabilize learning 17] P --> S[DQN: superhuman Atari performance 18] S --> T[DQN trained from raw pixels 19] S --> U[Replay, targets crucial for DQN 20] P --> V[Reward normalization improves DQN 21] A --> W[Gorila: parallel DQN 22] W --> X[Gorila: faster, better than DQN 23] A --> Y[Policy gradients optimize policy 24] Y --> Z[Deterministic policy gradient 25] Y --> AA[Actor-critic: policy + value 26] Y --> AB[Continuous control from pixels 27] A --> AC[RL: general AI framework 28] AC --> AD[Single RL agent, various tasks 29] AC --> AE[Limitations: sparse rewards, reasoning 30] class A,B,F,G,H,I,J,K rl; class C,D,E concepts; class L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,AB algorithms; class AC,AD,AE future;

Resume:

1.-Reinforcement learning (RL) is a general framework for building AI agents that can act in the world to achieve goals.

2.-RL agents take actions that influence the world, change the state, and affect future rewards the agent receives.

3.-The goal in RL is to select actions over time to maximize the sum of future rewards.

4.-Two main concepts in RL are policy (how the agent selects actions) and value (how good a state/action is).

5.-Policy-based RL directly searches for the optimal policy that achieves maximum reward from every state.

6.-Value-based RL estimates the optimal value function - the maximum reward achievable from each state by any policy.

7.-Examples of RL problems include robot control, user interaction optimization, games, and sequential decision making in machine learning.

8.-The RL problem is formalized as an agent interacting with an environment, receiving states, taking actions, and getting rewards.

9.-The optimal policy is one that maximizes future reward from every state. Finding this solves the RL problem.

10.-The optimal value function captures the maximum possible reward from each state. Finding this also solves the RL problem.

11.-Value iteration algorithms solve for the optimal value function by iteratively applying the Bellman optimality equation.

12.-Tabular value iteration methods don't scale to large state/action spaces. Neural networks can represent value functions to enable generalization.

13.-Q-learning trains a neural network to approximate the optimal action-value function by minimizing Bellman error.

14.-Naive Q-learning with neural networks is unstable due to correlated data, sensitivity to Q values, and varying reward scales.

15.-Deep Q-Networks (DQN) provide a stable solution using experience replay, target networks, and clipping rewards.

16.-Experience replay stores past transitions and samples from them randomly to break correlations and learn from varied past policies.

17.-Target networks are frozen for periods to keep Q-learning targets stable as the policy changes.

18.-DQN was applied to Atari games, learning to play from raw pixels using the same architecture and only game score.

19.-On many Atari games, DQN achieved human-level or superhuman performance after 2 weeks of training.

20.-Experience replay and target networks were both crucial for stabilizing learning and achieving good performance with DQN.

21.-Reward clipping in DQN was improved by a normalization technique to preserve reward scale while bounding gradients.

22.-Gorila architecture enables massively parallel DQN training by separating acting from learning and using distributed components.

23.-Gorila outperformed DQN on the majority of Atari games and reached DQN-level performance about 10x faster.

24.-Policy gradient methods directly optimize the policy to maximize rewards, useful for continuous action spaces.

25.-Deterministic policy gradient provides an end-to-end approach to adjust a policy network's parameters to improve expected reward.

26.-Actor-critic methods combine policy gradients with value estimation, using a critic to estimate Q-values and an actor to improve the policy.

27.-Continuous domain control from raw pixels was demonstrated using deterministic policy gradients with an actor-critic architecture.

28.-RL provides a general-purpose framework for AI. Many problems can be solved end-to-end by deep RL.

29.-Single deep RL agents can now solve a variety of challenging tasks specified as reward maximization problems.

30.-Limitations remain for complex problems with sparse rewards requiring long-term reasoning. Ongoing research aims to address these challenges.

Knowledge Vault built byDavid Vivancos 2024