RL World

  • MDP World:
Source: http://incompleteideas.net/book/bookdraft2017nov5.pdf
  • Bandit World
  • Action Value methods and Gradient bandit Methods
  • Model-Based:
  • Model Free :
  • Value-based: In this method we try to find the best estimation of future value. Because at the end we need to maximize the rewards. We estimate the value function and the Policy is implicit.
  • Policy-Based: In this method we try to find the best policy. Here we just want the agent to learn how to behave optimally in the environment. Theoretically, it is the best approach. There is No value functionand in this method, we just try to estimate the policy and optimize it.
  • Actor Critic Method: This method is a combination of the above 2 methods, known as Actor Critic. This method takes use of Value function in addition to the policy function. The value function is used to update the policy. In this method the “Critic” estimates the value function and “Actor” updates the policy distribution in direction of the Critic.

Bellman Equation:

Deterministic vs Stochastic Policy

  • Deterministic:
    - It maps a state to an action. A single action is returned by the policy to be taken.
    - They are used in deterministic environment. ex chess.
  • Stochastic:
    - In stochastic environment we have a probability distribution of the actions. There is a probability we will take a different action.
    - It is used when an environemtn is uncertain
  • Behaviour Policy is the policy used by the agent to determine its actions in a given state.
  • Target policy is the policy used by the agent to learn from rewards recieved for its actions, to update the Q-value.

Value-Based are divided into 2 types:

  • On- Policy : The target policy is same as the behavior policy.
  • Off Policy : The target policy is different from the behavior policy.

Policy-Based is divided into 2 types :

  • Gradient free: This technique, the population of the parameter vector is mutated randomly, and the reward function is calculated. The parameter vector with the highest reward is recombined for the selection of the next parameter vector.
  • Gradient-Based: In this technique we try to populate the parameter vector, by searching for the local maxima and calculate the reward for each of this maxima found. If the reward increases then we can either maintain the stepsize or decrease it, or else we increase the step size till we get higher rewards. This techniques relies on the optimization of the parametrized policies with respect to the expected return (which we want to maximize). This method targets the modeling and optimizing the policy directly.

State Value Function

State Action Value Function ( Q function)


  1. https://medium.com/@m.alzantot/deep-reinforcement-learning-demysitifed-episode-2-policy-iteration-value-iteration-and-q-978f9e89ddaa
  2. https://link.springer.com/chapter/10.1007/978-981-15-4095-0_3
  3. http://www.mit.edu/~9.54/fall14/slides/Reinforcement%20Learning%202-Model%20Free.pdf
  4. https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store