Model free Learning - Estimate the value function of an unknown MDP
Last updated
Last updated
Topics:
Monte Carlo
Temporal-Difference Learning
TD(lamda)
Assumptions:
Policy is given
Dynamics of the environment (transition probability and reward function is unknown)
Goal is find out how good are the states under policy pi. (EVALUATION) Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns
If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). The relationship between TD, DP, and Monte Carlo methods is a recurring theme in the theory of reinforcement learning
TD(λ) algorithm: