Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:
Resume:
1.- Motivation: Autonomous robots will become more prevalent and require learning methods to acquire complex skills.
2.- Challenges: High-dimensional continuous state/action spaces, real-world data collection costs, safety concerns during exploration.
3.- Value-based RL: Estimate value function globally, updates can be unstable with function approximation, requires extensive exploration.
4.- Policy search principles: Parameterized policies, correlated exploration, local policy updates, fewer assumptions than value-based RL.
5.- Policy search taxonomy: Model-free vs model-based, different exploration and update methods within each category.
6.- Tutorial outline: Taxonomy, model-free methods (policy gradients, EM), extensions (contextual, hierarchical), model-based methods.
7.- Policy representations: Trajectory generators (e.g. DMPs), linear controllers, RBF networks, neural networks.
8.- Model-free: Use state, action, reward samples. Model-based: Learn a model, sample-efficient but sensitive to model errors.
9.- Step-based vs episode-based exploration: Step-based uses action space exploration, episode-based uses correlated parameter space exploration.
10.- Policy update: Direct policy optimization (gradient-based) or Expectation-Maximization (weighted maximum likelihood).
11.- Exploration must balance smoothness for safety vs variability for learning. Metric should measure KL divergence.
12.- Correlated exploration in parameter space yields smoother trajectories than independent exploration at each step.
13.- Conservative vs greedy policy updates illustrate exploration-exploitation tradeoff. KL divergence balances smoothness and progress.
14.- Policy gradients use log-likelihood trick to estimate gradient of expected return with respect to policy parameters.
15.- Baseline subtraction reduces variance of policy gradient estimates without introducing bias. Optimal baseline approximates value function.
16.- Step-based policy gradients (REINFORCE) decompose return into per-step quantities, enabling use of state-action value function.
17.- Introducing a state-dependent baseline (e.g. value function) can further reduce variance of policy gradient estimates.
18.- Choice of metric (e.g. Euclidean vs Fisher information) impacts policy update step size and convergence behavior.
19.- Natural policy gradients use Fisher information matrix to normalize gradient, enabling invariance to reparameterization.
20.- Natural actor-critic combines natural policy gradients with compatible function approximation of state-action value function.
21.- Incorporating a state-value function reduces variance of the advantage function (Q-V) used in actor-critic methods.
22.- Policy gradients can effectively learn motor skills like baseball swings but convergence may be slow.
23.- EM-based policy search (reward-weighted ML) updates policy to match previous policy reweighted by exponential reward.
24.- EM-policy search works for both step-based and episode-based settings, relates to optimal control and information-theoretic methods.
25.- Weighting schemes for rewards include subtracting a baseline and rescaling to improve policy updates.
26.- Moment projection minimizes KL divergence between new and reweighted policies, yielding closed-form updates for Gaussians.
27.- Applications of policy search to complex robot skills like ball-in-cup, table tennis to be covered next.
28.- Contextual policy search learns generalizable skills that adapt to different situations (e.g. goal locations).
29.- Hierarchical policy search enables learning of high-level skill sequencing on top of low-level motor primitives.
30.- Model-based policy search to be covered, including PILCO and guided policy search via trajectory optimization.
Knowledge Vault built byDavid Vivancos 2024