# On-Policy First-Visit Monte Carlo

Summary and notes on On-Policy First-Visit Monte Carlo method, based on course offered at McGill, in addition to chapter 5 of Richard S. Sutton and Andrew G. Barto, “Reinforcement learning: An introduction”, Second Edition, MIT Press.

One way to get an estimate of value functions to discover optimal policies is to use the so-called Monte Carlo Methods. These methods make the agent learn from the actual experience it encounters. Furthermore, an interesting property of this method is that it doesn’t require any prior knowledge of the environment the agent will be interacting with.

Other variants of Monte Carlo methods such as Every-Visit and off-policy (either First/Every visit) are not explored in this post. Refer to the book for more details.

### General idea

Similarly to the bandits algorithms and problems, the agent will create a value function based on reward from each action generated by its experience. We will be using the concept of general policy iteration (GPI) to update and determine the desired value functions. As mentioned above, the main difference is that we will be estimating the value functions based on the sample average returns, which over time, should converge to the desired expected value.

The main concept of Monte Carlo methods, is that the updates to the value-functions are made after the whole episode has been generated. This can be both an advantage and a disadvantage depending on the problem on hand.

### Computing the state-values $v_\pi(s)$

We know that during an episode of the task, multiple states will be visited. It is possible, depending on the problem on hand, that the same state is visited multiple times. Let us consider the method known as First Visit MC method, where $v_\pi(s)$ is determined as the average reward following the first visit to state $s$.

The following is the algorithm of first-visit MC policy evaluation.

The algorithm in words:

1. Initialize the policy, state-value function
2. Start by generating an episode according to the current policy
1. Keep track of the states encountered through that episode
3. Select a state in 2.1
1. Add to a list the return received after first occurence of this state
2. Average over all returns
3. Set the value of the state as that computed average
4. Repeat step 3
5. Repeat 2-4 until satisfied

Using this algorithm, the estimated value-state function $V(s)$ will converge to the true value-state function $v_\pi(s)$.

### Computing the action-values with Monte Carlo methods

So far, we have explored a way to evaluate the state-value by experience through Monte Carlo method.

An alternative that doesn’t require to deal with the value of the state $V(s)$, is to focus on the value of the (state, action) pair $Q(s, a)$. The subtle difference that yet makes it useful is that it goes through each pair of $latex (s, a)$ that it encountered in an episode, therefore limiting the need for a full model of the state/actions and can draw purely from experience.

In addition, after updating the value of each pair, we can improve the policy for all states encountered in the episode by selecting the most valued action according to our newly updated action-value function.

However, in order to ensure improvement of the policy and convergence towards the optimal policy, consider the explore/exploit dilemma encountered in the bandits algorithms. One way to ensure exploration under this method is to select a state at random as the initial state for an episode. Using such an approach is called using exploring starts. Combining exploring starts with the action-value improvement method detailed, we end up with the Monte Carlo Exploring Starts (MC-ES) method.

### On-policy First-Visit

One approach to ensure exploration without exploring starts is to consider the epsilon-greedy (discussed in bandits algorithms) algorithm combined with the MC method. Under such an approach, rather than arbitrarily initializing the policy, we could initialize it to a non-deterministic policy and maintain its non-deterministic property throughout training. The latter would ensure improvement to the policy, otherwise setting the policy as a purely greedy one would stall the improvements.

This is where the epsilon-greedy comes into play. Here is a figure from the book with the general on-policy MC method.

Figure credit: Richard S. Sutton and Andrew G. Barto, “Reinforcement learning: An introduction”, Second Edition, MIT Press

Conceptually, we do the same as the MC-ES method, but now instead of initiating a random state, we replace that exploration factor with the epsilon-greedy approach.