# Bandits algorithms

Summary and notes of the first class of the Reinforcement Learning (RL) course offered at McGill held on January 6th, in addition to chapter 2 of Richard S. Sutton and Andrew G. Barto, “Reinforcement learning: An introduction”, Second Edition, MIT Press.

### Evaluative vs instructive feedback

Evaluative feedback is a way that evaluates the actions, compared to instructive feedback that simply states the correct action. This can be compared (in some ways) to feedback on a homework

• Teacher that simply says the right answer — instructive
• Teacher that explains the answer — evaluative

Much of the prior work with regards to evaluative feedback has been done in a non-associative setting. This is a simplified setting that restricts the learning to one situation (i.e. doesn’t put the agent under multiple situations..). This is of course a simplified view of most problems, but it allows to explore the fundamentals of evaluative feedback.

### K-armed bandits problem

Basis of the k-armed bandits problem, is where you have a set of actions that can be taken (i.e. 100 presses of a button). Each time, or execution, a set of actions is offered, one is selected/executed and a reward is provided based on the action. The reward is non-deterministic and therefore depends on randomness.

Typical example of this type of problem is you have $k$ identical slot machines, where each one of the $k$ slot machines represent a possible action. In addition, there is an expected payout attached to each of these slot machines. We describe this as the expected reward, given that a slot machine is activated.

More formally, we note the value of an action $a$ as $q_*(a) = \mathbb{E} [R_t \vert A_t = a]$, where it is defined to be the expected reward given we select that option. We often do not know the value of an action with certainty, and therefore cannot use $q_*(a)$. We therefore use the $Q_t(a)$ notation as an estimate of the value at time $t$ for action $a$.

With regards to the explore/exploit dilemma, in this situation if the action selected is the one with the highest estimated value, it is qualified as exploitation, as opposed to exploration where a different action would be selected.

### Action-value method

The simplest form of value for a given action, is simply to take the sum of the rewards when that action was selected, over the number of times it was selected. Such a method is often called sample-average, as it involves, taking the actual sample mean.

Tracking these values can also be done incrementaly as opposed to fully computing it at each step. As opposed to the full $Q_{n+1} = \frac{1}{n} \sum\limits_{i=1}^n R_i$, it can be denoted as $Q_{n+1} = Q_n + \frac{1}{n} \Big [ R_n - Q_n \Big ]$.

A way to see the incremental update to the value of an action is through a step-size parameter $\alpha$, which can also be a function of $t$ and $a$.

Under the stationary problem, $\alpha$ under its form $\frac{1}{n}$ will become less and less important as $n$ grows. However under a non-stationary problem, to ensure that new information is given appropriate weight/importance, the $alpha$ is usually constant. This makes the estimated value of an action a weighted average of all previous rewards.

### $\epsilon - greedy$ algorithm

In order to explore and not only exploit actions, one must sometimes choose not to take the action with the highest value. A method that can be applied is the $\epsilon - greedy$ method. Under this method, an exploration action is selected with probability $\epsilon$. This ensures that non-optimal actions have a chance to be explored at every step.

If reward value functions have high variance, the epsilon-greedy method will tend to perform better than pure exploitation due to the fact that it takes more high exploration to end up finding the optimal action. This happens because of the high volatility in the reward for a given action. A pure greedy algorithm (or low epsilon), could never fall on the high reward of the optimal action, and never come back to it (or only few times).

In addition, one could incorporate a diminishing $\epsilon$ over time to reduce the exploration. In a sense, at some point you can feel comfortable with the values and may want to focus on exploitation only.

### Optimistic Initial Values

This method incorporates optimal initial values in the sense that it will initiate the value of actions ($Q_0(t)$) to be high (i.e. as opposed to 0). Conceptually it will help achieve higher reward since it will at least force the model to explore the actions it has not yet tried since they will have high value.

After an action has been chosen a couple of times, its value would decrease lower than the initial optimistic value, and therefore a greedy (or even epsilon-greedy) algorithm would/could select the one with the highest value, which in this case might be one that is artificially and temporary high due to this method. Using this technique can have the effect of ensuring there is exploration and kickstart more rapidly the convergence.

### Upper-Confidence-Bound (UCB) Action Selection

The UCB action selection method incorporates somewhat the same principal as optimistic initial values, in a sense that it prefers actions that have yet been explore. However, it does so on a more continuous basis, mostly based on the number of times a given action was selected and over time.

The concept is to add an artificial component to the action value, and reduce that component over time and the number of occurrence of an action. It will act as a way to artificially represent the variance of an action’s value.

More formally, $A_t = \text{argmax}_a \Big [ Q_t(a) + c \sqrt{\frac{\log t}{N_t(a)}} \Big]$

Where $t$ is the number of steps and $N_t(a)$ is the number of occurrence of action $a$ up until $t$. The constant $c$ is analogous to the factor from a normal distribution term in a confidence interval of a sample mean. The resulting value is therefore analogous itself to the upper-bound of a confidence interval.

As time (or steps) go by, we need a way to update the preference of each action. This is when the gradient algorithm comes into play. Using the general gradient ascent algorithm $H_{t+1}(a) = H_t(a) + \alpha \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)}$, we obtain the following update algorithm (refer to book for details on derivation).
• $H_{t+1}(A_t) = H_t(A_t) + \alpha (R_t - \bar{R}_t) (1 - \pi_t(A_t)$ For the action selected and
• $H_{t+1}(a) = H_t(a) - \alpha (R_t - \bar{R}_t) \pi_t(a)$ For all other actions
The relative preference resulting from the softmax operation can also be interpreted as a probability. As a preference for a given action is updated, as per update algorithm. We can observe that for the action that was selected, if the reward is higher than the baseline ($\bar{R}_t$), the preference for that action will be increased, and if it is lower, it will be negatively affected.