Reinforcement Learning
  • Fundamentals
  • Solve a known MDP - dynamic programming:
  • Model free Learning - Estimate the value function of an unknown MDP
  • Approximate Methods - Deep RL
  • Open Problems
  • Key Papers
Powered by GitBook
On this page
  • Value and Policy Iteration
  • Policy Evaluation:

Solve a known MDP - dynamic programming:

PreviousFundamentalsNextModel free Learning - Estimate the value function of an unknown MDP

Last updated 2 years ago

Value and Policy Iteration

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically. All of these methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment.

Policy evaluation refers to the (typically) iterative computation of the value functions for a given policy. Policy improvement refers to the computation of an improved policy given the value function for that policy. Putting these two computations together, we obtain policy iteration and value iteration, the two most popular DP methods.

It will converge in a tabular setting - where all states actions can be listed in a table Policy iteration: You maintain policy value is right now and you try to improve the policy over time.

Policy Evaluation:

VALUE Function is used to evaluate a certain policy pi So for example for a random policy pi, we will compute value function for each of the state using the bellman euqation which provides a mechanism for a lookahead

value of the staet is the reward you get and the value from the value function applied to the next state s’.

Policy Iteration is a process of improving a (input/ouput are both policies). We plugin previous values of the iteration at the leaves to computes the new values(see example below)

Example:

value is v(s) is calculated by doing a 1 step look ahead - once we have the values we can act greedily in our policy (Estimate and Act greedily)
Value of the room is given by one step lookahead where we consider all the actions we(agent) might take and the probabilitiy of those action leading to certain states(randomness by enviornment) - we look at the values of successor states and we back that all up - sum it all up (weighted by the probabilities) - the expectation of Bellman Expectation equation - Vector form
Values given by the value function applied to the random policy give us a better policy when we take those values and act greedily
This process of evaluation and improvement will eventually lead to the optimal policy
we define acting greedily as pick actions in a way that gives us the maximum action value