The End Of Knowledge - Vault 6/10 - CVPR - 2015 - Policy Search: Methods and Applications

graph LR classDef main fill:#f9d4d4, font-weight:bold, font-size:14px classDef intro fill:#d4f9d4, font-weight:bold, font-size:14px classDef policysearch fill:#d4d4f9, font-weight:bold, font-size:14px classDef gradients fill:#f9f9d4, font-weight:bold, font-size:14px classDef em fill:#f9d4f9, font-weight:bold, font-size:14px classDef advanced fill:#d4f9f9, font-weight:bold, font-size:14px Main[Policy Search: Methods
and Applications] Main --> A[Introduction and Motivation] A --> A1[Autonomous robots need complex
skill learning 1] A --> A2[Challenges: high-dimensional spaces, data
costs, safety 2] A --> A3[Value-based RL: unstable, extensive
exploration 3] A --> A4[Policy search: parameterized, correlated,
local updates 4] A --> A5[Taxonomy: model-free vs model-based
methods 5] A --> A6[Outline: taxonomy, methods, extensions,
model-based 6] Main --> B[Policy Search Fundamentals] B --> B1[Policy representations: trajectories, controllers,
networks 7] B --> B2[Model-free vs model-based: samples
vs learning 8] B --> B3[Step vs episode-based exploration:
action/parameter space 9] B --> B4[Policy update: direct optimization
or EM 10] B --> B5[Exploration: balance smoothness and
variability 11] B --> B6[Correlated parameter exploration yields
smoother trajectories 12] Main --> C[Policy Gradient Methods] C --> C1[Conservative vs greedy updates:
exploration-exploitation tradeoff 13] C --> C2[Policy gradients: log-likelihood trick
estimates gradient 14] C --> C3[Baseline subtraction reduces variance
without bias 15] C --> C4[Step-based gradients use state-action
value function 16] C --> C5[State-dependent baseline further reduces
variance 17] C --> C6[Metric choice impacts update
step size 18] Main --> D[Advanced Policy Gradient Techniques] D --> D1[Natural gradients: Fisher information
normalizes gradient 19] D --> D2[Natural actor-critic: gradients with
function approximation 20] D --> D3[State-value function reduces advantage
function variance 21] D --> D4[Policy gradients learn motor
skills slowly 22] Main --> E[Expectation-Maximization Methods] E --> E1[EM-based search: reward-weighted maximum
likelihood 23] E --> E2[EM works for step/episode-based settings 24] E --> E3[Reward weighting: baseline subtraction,
rescaling 25] E --> E4[Moment projection: KL minimization,
closed-form updates 26] Main --> F[Advanced Topics and Applications] F --> F1[Applications: complex robot skills
forthcoming 27] F --> F2[Contextual search learns generalizable,
adaptable skills 28] F --> F3[Hierarchical search: high-level sequencing,
low-level primitives 29] F --> F4[Model-based search: PILCO, guided
policy search 30] class Main main class A,A1,A2,A3,A4,A5,A6 intro class B,B1,B2,B3,B4,B5,B6 policysearch class C,C1,C2,C3,C4,C5,C6,D,D1,D2,D3,D4 gradients class E,E1,E2,E3,E4 em class F,F1,F2,F3,F4 advanced

Resume:

1.- Motivation: Autonomous robots will become more prevalent and require learning methods to acquire complex skills.

2.- Challenges: High-dimensional continuous state/action spaces, real-world data collection costs, safety concerns during exploration.

3.- Value-based RL: Estimate value function globally, updates can be unstable with function approximation, requires extensive exploration.

4.- Policy search principles: Parameterized policies, correlated exploration, local policy updates, fewer assumptions than value-based RL.

5.- Policy search taxonomy: Model-free vs model-based, different exploration and update methods within each category.

6.- Tutorial outline: Taxonomy, model-free methods (policy gradients, EM), extensions (contextual, hierarchical), model-based methods.

7.- Policy representations: Trajectory generators (e.g. DMPs), linear controllers, RBF networks, neural networks.

8.- Model-free: Use state, action, reward samples. Model-based: Learn a model, sample-efficient but sensitive to model errors.

9.- Step-based vs episode-based exploration: Step-based uses action space exploration, episode-based uses correlated parameter space exploration.

10.- Policy update: Direct policy optimization (gradient-based) or Expectation-Maximization (weighted maximum likelihood).

11.- Exploration must balance smoothness for safety vs variability for learning. Metric should measure KL divergence.

12.- Correlated exploration in parameter space yields smoother trajectories than independent exploration at each step.

13.- Conservative vs greedy policy updates illustrate exploration-exploitation tradeoff. KL divergence balances smoothness and progress.

14.- Policy gradients use log-likelihood trick to estimate gradient of expected return with respect to policy parameters.

15.- Baseline subtraction reduces variance of policy gradient estimates without introducing bias. Optimal baseline approximates value function.

16.- Step-based policy gradients (REINFORCE) decompose return into per-step quantities, enabling use of state-action value function.

17.- Introducing a state-dependent baseline (e.g. value function) can further reduce variance of policy gradient estimates.

18.- Choice of metric (e.g. Euclidean vs Fisher information) impacts policy update step size and convergence behavior.

19.- Natural policy gradients use Fisher information matrix to normalize gradient, enabling invariance to reparameterization.

20.- Natural actor-critic combines natural policy gradients with compatible function approximation of state-action value function.

21.- Incorporating a state-value function reduces variance of the advantage function (Q-V) used in actor-critic methods.

22.- Policy gradients can effectively learn motor skills like baseball swings but convergence may be slow.

23.- EM-based policy search (reward-weighted ML) updates policy to match previous policy reweighted by exponential reward.

24.- EM-policy search works for both step-based and episode-based settings, relates to optimal control and information-theoretic methods.

25.- Weighting schemes for rewards include subtracting a baseline and rescaling to improve policy updates.

26.- Moment projection minimizes KL divergence between new and reweighted policies, yielding closed-form updates for Gaussians.

27.- Applications of policy search to complex robot skills like ball-in-cup, table tennis to be covered next.

28.- Contextual policy search learns generalizable skills that adapt to different situations (e.g. goal locations).

29.- Hierarchical policy search enables learning of high-level skill sequencing on top of low-level motor primitives.

30.- Model-based policy search to be covered, including PILCO and guided policy search via trajectory optimization.

Knowledge Vault built byDavid Vivancos 2024