Reinforcement Learning
  • Fundamentals
  • Solve a known MDP - dynamic programming:
  • Model free Learning - Estimate the value function of an unknown MDP
  • Approximate Methods - Deep RL
  • Open Problems
  • Key Papers
Powered by GitBook
On this page

Model free Learning - Estimate the value function of an unknown MDP

PreviousSolve a known MDP - dynamic programming:NextApproximate Methods - Deep RL

Last updated 2 years ago

Topics:

  1. Monte Carlo

  2. Temporal-Difference Learning

  3. TD(lamda)

Assumptions:

  • Policy is given

  • Dynamics of the environment (transition probability and reward function is unknown)

Goal is find out how good are the states under policy pi. (EVALUATION) Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns

Temporal-Difference Learning

If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). The relationship between TD, DP, and Monte Carlo methods is a recurring theme in the theory of reinforcement learning

TD(λ) algorithm:

You only increment the counter once in each Episode - In Every time step you can keep on incrementing it
difference between the value we got(TD target) after we took a step and what we had estimated perviously (difference of estimate before and after taking a step is the TD error)
In Monte Carlo we have to wait till the end to do the updates whereas in TD we can do it at each step