In this third Part of Applying Temporal Difference Methods to Machine Learning, I will be experimenting with the intra-sequence update variant of TD learning. It is a method where after each time step, the parameters are updated rather than waiting at the end of the sequence.
Intra-sequence parameter updating
The true essence of the TD learning methods evolves around intra-sequence parameters updating. Sutton sums it best in the original TD learning paper.
Since each observation of a sequence generates an increment to the weight vector, in many respects it would be simpler to update the weight vector immediately after each observation.
Indeed, could we find a way not to wait at the end of a sequence to update the parameters? This could potentially help us converge more quickly and require less episodes/examples. Sutton mentions the most intuitive update rule would be the following update rule at each time step ,
Similarly to the method where we update only at the end of the sequence, the change of parameters would be function of the current TD error and the cumulative sum of gradients. In addition, each prediction would be function of the previous set of parameters.
But as the author points out, this has the problem where the observed TD error , is a function of a change in parameters as well as the new observation at . Ideally, we would like to have only the impact of the new observation in the TD error.
Fixing the parameters
One recommendation Sutton proposes to alleviate the dependance of the TD error to the parameter change, is to make the prediction at time step with the current set of parameters before updating it. The full proposed update rule is the following,
Under this approach, we would therefore determine the TD error according to our current set of parameters. On the next time step, the predictions for the TD error are redetermined based on the newly updated parameters.
for t in xrange(784): # apply mask at fixed interval, making sequences of pixels appear if (t + 1) % 28 == 0: nb_seq += 1 seq_x = train_x * mask[t] pred = predict(seq_x)[0, 0] grad = preds_grad(seq_x) if nb_seq == 1: sum_grad = np.copy(grad) else: sum_grad += np.copy(grad) if t < 783: seq_x_prime = train_x * mask[t + 28] pred_y_prime = predict(seq_x_prime)[0, 0] TD_error = (pred_y_prime - pred) error = (true_y - pred) else: TD_error = (true_y - pred) error = (true_y - pred) param_values = lyr.get_all_param_values(network) delta_params = LRN_RATE * TD_error * sum_grad param_values += delta_params lyr.set_all_param_values(network, param_values)
One important component to notice is the fact that the sum of the gradients up to time step is still needed to evaluate the update. This makes it more computationally intensive than the original TD learning method explored in Part 2, since we are computing the same things, but actually doing the update during the sequence.
Another thing to point out is that all the methods so far have been using the full sequence of gradients, which are analogous to Monte Carlo method.
The following graph shows the same metric as before but for the incremental TD rule (including previous methods).
We can notice the performance is once again the same as the previous methods. This is the case because we have the full lengths of the gradient of the sequence to make the update. This is an interesting results as we are using the one-step prediction as the error feedback, but by considering all previous gradients with intra-sequence updates we are having the same updates.
In the following section, we will see how not using the full sequence of gradients affects the performance.
The next improvement that can be thought of is to actually not require all the previous history of gradients and rather only focus on the last one. This methods can be seen analogous to TD(0). The update rule now becomes the following,
Here is an excerpt from my code with this approach implemented,
for t in xrange(784): # apply mask at fixed interval, making sequences of pixels appear if (t + 1) % 28 == 0: nb_seq += 1 seq_x = train_x * mask[t] pred = predict(seq_x)[0, 0] grad = preds_grad(seq_x) if t < 783: seq_x_prime = train_x * mask[t + 28] pred_y_prime = predict(seq_x_prime)[0, 0] TD_error = (pred_y_prime - pred) error = (true_y - pred) else: TD_error = (true_y - pred) error = (true_y - pred) param_values = lyr.get_all_param_values(network) delta_params = LRN_RATE * TD_error * grad param_values += delta_params lyr.set_all_param_values(network, param_values)
This now becomes a compromise of increased efficiency while introducing a bias in our update. In practice, this method is not guaranteed to converge when using non-linear approximation such as a small neural network in this case study.
Below are is the results for one-step updates where only the recent gradient is considered. The other method’s results have been included for comparison.
The thing that can be clearly seen is how this approach underperforms compared to the full updates previously seen. This is quite disappointing because it is the most memory efficient compared to other methods.
It can be explained simply that when we used the full sum of gradients update, we have much more information to provide to our weight updates. Therefore it will be faster to converge. Under the one-step update, we only have the current gradient to propagate. It seems that the single-step updates does not help the performance on this task.
The main takeaway from this case study is that the first TD method explored (the one with an update at the end of the sequence) only offers greater benefits than the traditional machine learning approach at no additional cost. This slight modification to the update rule allows to save on memory while still making the same updates.
The question also can be raised with regard to applying intra-sequence updates for a more complexe model such as the family of Recurrent Neural Networks (RNNs). However, the loss implied by the type of tasks usually associated to RNNs isn’t the MSE and couldn’t benefit from the modification initially proposed by Sutton.
Some possible direction to expand on is to consider the full TD() family of algorithms where the sum of gradients is a geometrically weighted sum of previous gradients. This has proven under traditional reinforcement learning tasks to increase performance.