The End Of Knowledge - Vault 6/26 - CVPR - 2017 - Deep Reinforcement Learning, Decision Making, and Control

graph LR classDef main fill:#f9d9c9, font-weight:bold, font-size:14px classDef foundations fill:#d4f9d4, font-weight:bold, font-size:14px classDef methods fill:#d4d4f9, font-weight:bold, font-size:14px classDef applications fill:#f9f9d4, font-weight:bold, font-size:14px classDef challenges fill:#f9d4f9, font-weight:bold, font-size:14px Main[Deep Reinforcement Learning,
Decision Making, and
Control] Main --> A[Foundations] Main --> B[RL Methods] Main --> C[Advanced Concepts] Main --> D[Applications] Main --> E[Challenges and Future Directions] A --> A1[Tutorial covers deep RL,
decision making, control 1] A --> A2[Sequential decision making affects
future states 2] A --> A3[Deep RL combines deep learning,
reinforcement learning 3] A --> A4[RL: generate samples, fit model,
improve policy 4] A --> A5[Policy gradient: differentiate policy
for improvement 5] A --> A6[Variance reduction: causality, baseline,
natural gradients 6] B --> B1[Actor-critic: actor predicts, critic
evaluates actions 7] B --> B2[Q-learning: policy maximizes learned
Q-function 8] B --> B3[RL as probabilistic inference:
graphical model 9] B --> B4[Soft optimality: maximize entropy
and reward 10] B --> B5[Soft Q-learning: soft max
for Q-function 11] B --> B6[Inverse RL: infer reward
from demonstrations 12] C --> C1[MaxEnt IRL: probabilistic model,
GAN equivalent 13] C --> C2[Sampling-based inverse RL algorithms 14] C --> C3[Model-based RL: learn dynamics,
optimize policy 15] C --> C4[Using learned model: gradients,
MPC, local models 16] C --> C5[Guided policy search: local
models to global policy 17] C --> C6[High-dimensional observations: learn latent
or direct 18] D --> D1[Model-based RL: efficient but
model-limited 19] D --> D2[Sample efficiency: curiosity, hierarchy,
stochastic policies 21] D --> D3[Safe exploration: uncertainty, off-policy,
human oversight 22] D --> D4[Reward specification: preferences, IRL,
goals, language 23] D --> D5[Task transfer: build on
previous knowledge 24] D --> D6[Automatic task generation and
curricula important 25] E --> E1[Open challenges: efficiency, safety,
rewards, transfer 20] E --> E2[Uncertainty in policies for
safe exploration 26] E --> E3[Off-policy learning from demonstrations 27] E --> E4[Human intervention for safe
learning boundaries 28] E --> E5[Simulation-to-real transfer for safe
skill deployment 29] E --> E6[Minimal supervision may aid
human-level AI pursuit 30] class Main main class A,A1,A2,A3,A4,A5,A6 foundations class B,B1,B2,B3,B4,B5,B6 methods class C,C1,C2,C3,C4,C5,C6 methods class D,D1,D2,D3,D4,D5,D6 applications class E,E1,E2,E3,E4,E5,E6 challenges

Resume:

1.- The tutorial covers deep reinforcement learning, decision making, and control. Slides are available online.

2.- Sequential decision making is needed when an agent's actions affect future states and decisions. Applications include robotics, autonomous driving, finance.

3.- Deep reinforcement learning combines deep learning for rich sensory inputs with reinforcement learning for actions that affect outcomes.

4.- Reinforcement learning involves generating samples, fitting a model/estimator to evaluate returns, and using it to improve the policy in a cycle.

5.- In the policy gradient method, the policy is directly differentiated to enable gradient ascent, formalizing trial and error learning.

6.- Variance of policy gradients can be reduced by exploiting causality and introducing a baseline. Natural gradients improve convergence.

7.- Actor-critic algorithms have an actor that predicts actions and a critic that evaluates actions. Critic is used to estimate advantage.

8.- In direct value function methods like Q-learning, the policy implicitly maximizes the learned Q-function. Can be used with continuous actions.

9.- Reinforcement learning can be viewed as probabilistic inference. Value functions and Q-functions emerge from inference in a graphical model.

10.- Soft optimality emerges from a graphical model of trajectories, values and rewards. Policy maximizes entropy along with expected reward.

11.- Soft Q-learning uses a soft max instead of a hard max for the Q-function. Helps with exploration and compositionality.

12.- Inverse reinforcement learning aims to infer the reward function from expert demonstrations. It's ambiguous and requires solving the forward problem.

13.- Maximum entropy inverse reinforcement learning handles ambiguity with a probabilistic model. It's equivalent to GAN with special discriminator.

14.- Guided cost learning and generative adversarial imitation learning are sampling-based inverse RL algorithms that work without solving the forward problem.

15.- Model-based RL aims to learn the dynamics model and optimize the policy using the model. More efficient than model-free RL.

16.- Ways to use a learned model include back-propagating gradients through it, model-predictive control, learning local models.

17.- Guided policy search learns local models and policies for multiple initial states and distills them into a global policy.

18.- With high-dimensional observations, the dynamics model can be learned in a low-dimensional latent space or directly in observation space.

19.- Model-based RL can be more efficient and generalizable than model-free RL, but is limited by model accuracy.

20.- Open challenges in deep RL include improving sample efficiency, safe exploration, reward specification, and transfer learning.

21.- Sample efficiency can potentially be improved through curiosity, hierarchy, stochastic policies, and transfer across tasks.

22.- Safe exploration may involve uncertainty estimation, learning from off-policy data, human oversight, or learning first in simulation.

23.- Reward specification can leverage human preferences, inverse RL, goal images, object motions, or language instructions.

24.- Agents should learn to quickly solve new tasks by building on knowledge from previous tasks, rather than learning tabula rasa.

25.- Automatically generating tasks and curricula is an important problem for building more capable agents.

26.- Incorporating uncertainty into policies can help agents explore safely by avoiding actions with highly uncertain outcomes.

27.- Learning from off-policy data, such as human demonstrations, can allow agents to learn without risky trial-and-error.

28.- Human intervention when an agent is about to make an unsafe decision can keep the agent within safe boundaries during learning.

29.- Simulation-to-real transfer allows agents to learn in a safe virtual environment before deploying those skills in the real world.

30.- The paradigm of learning from rewards with minimal supervision may help in pursuit of human-level artificial intelligence.

Knowledge Vault built byDavid Vivancos 2024