Solve a known MDP - dynamic programming:

Value and Policy Iteration

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically. All of these methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment.

Policy evaluation refers to the (typically) iterative computation of the value functions for a given policy. Policy improvement refers to the computation of an improved policy given the value function for that policy. Putting these two computations together, we obtain policy iteration and value iteration, the two most popular DP methods.

It will converge in a tabular setting - where all states actions can be listed in a table Policy iteration: You maintain policy value is right now and you try to improve the policy over time.

Policy Evaluation:

VALUE Function is used to evaluate a certain policy pi So for example for a random policy pi, we will compute value function for each of the state using the bellman euqation which provides a mechanism for a lookahead

value of the staet is the reward you get and the value from the value function applied to the next state sā€™.

Policy Iteration is a process of improving a (input/ouput are both policies). We plugin previous values of the iteration at the leaves to computes the new values(see example below)

Example:

Last updated