Model free Learning - Estimate the value function of an unknown MDP

Topics:

Monte Carlo
Temporal-Difference Learning
TD(lamda)

Assumptions:

Policy is given
Dynamics of the environment (transition probability and reward function is unknown)

Goal is find out how good are the states under policy pi. (EVALUATION) Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns

Temporal-Difference Learning

If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). The relationship between TD, DP, and Monte Carlo methods is a recurring theme in the theory of reinforcement learning

TD(λ) algorithm:

PreviousSolve a known MDP - dynamic programming:NextApproximate Methods - Deep RL

Last updated 2 years ago