Reinforcement Learning
  • Fundamentals
  • Solve a known MDP - dynamic programming:
  • Model free Learning - Estimate the value function of an unknown MDP
  • Approximate Methods - Deep RL
  • Open Problems
  • Key Papers
Powered by GitBook
On this page

Open Problems

PreviousApproximate Methods - Deep RLNextKey Papers

Last updated 2 years ago

Challenges: Exploration vs Exploitation, Scalability, Convergence gureentess, when markov doesn’t hold(multi-agent systems)

The first approach is to search in the space of behaviour in order to and one that performs well in the environment. The second is to use statistical techniques and dynamic programming methods to estimate the utility of taking actions in states of the world.

  1. Balancing Exploration vs Exploitation

  2. The optimal policy must be inferred by trial-and-error interaction with the environment. The only learning signal the agent receives is the reward. • The observations of the agent depend on its actions and can contain strong temporal correlations. • Agents must deal with long-range time dependencies: Often the consequences of an action only materialise after many transitions of the environment. This is known as the (temporal) credit assignment problem

  3. State Space - Scalability Challenges

  4. Theoretical Guarantees of Convergence with Approximation methods

  5. Markov assumption is held by the majority of RL algorithms, it is somewhat unrealistic, as it requires the states to be fully observable. - Multi-Agent Systems: Markov Assumotion is violated, these algorithms apparently do poor in multi-agent space…and dynamic enviornment means (state transition probabilities could be changing)

https://youtu.be/fIKkhoI1kF4