In this Part 2 of Applying Temporal Difference Methods to Machine Learning, I will show results of applying what Sutton refers to the traditional machine learning approach compared to the Temporal Difference approach.
For more information on this series, refer to the first part.
An important consideration with regard to the problem I am using to apply TD learning methods is TD learning uses the mean squared error (MSE) as its loss function. The MNIST dataset is usually used as a classification task and implicitly the cross-entropy as its loss function. Even though using the MSE isn’t optimal, it can still be applied by considering MNIST as a regression problem on the class number.
After considering the sequence problem, the ML traditional implementation is similar to the Monte Carlo approach. In a sense, at each step we make a prediction for the outcome, store it, and once at the end of the sequence, we go through all our predictions, compare it to the true outcome and make an update to our parameters.
In order to simplify and increase the speed, rather than considering each pixel as a time step, I consider each row of pixels as a time step. Since each input is originally 28×28 pixels, we therefore have sequences of 28 time steps.
Here is an excerpt of my code showing the procedure
for t in xrange(784): # apply mask at fixed interval, making sequences of pixels appear if (t + 1) % 28 == 0: nb_seq += 1 seq_x = train_x * mask[t] pred_y = predict(seq_x) grad = preds_grad(seq_x) # update params based on experience old_param_values = lyr.get_all_param_values(network) new_param_values = old_param_values for pred, grad in zip(batch_preds, batch_grads): error = (true_y - pred) delta_params = LRN_RATE * error * grad new_param_values += delta_params batch_error.append(error) lyr.set_all_param_values(network, new_param_values)
To consider this as a multi-step prediction problem, we need to assume the real outcome is only available at the end of the sequence. This is why the true outcome is only used after the sequence has been explored. The important component is the following, which is calculated for each in our sequence.
error = (true_y - pred) delta_params = LRN_RATE * error * grad
We can notice we always refer to the true outcome for updating our parameters, which makes it without any bias, but still doesn’t bootstrap the updates with its own predictions.
To relate this problem with reinforcement learning, we can consider an example from the dataset as being an episode that the agent experiences.
Below is the plot of the running average of the L2 norm of the error sequence over 4 iterations of the dataset. The error sequence refers to the error that is made at each time step during a sequence (the lower the better).
Temporal Difference approach
Now let’s consider the approach discussed where we can determine the updates by only considering the TD errors (errors between subsequent predictions) to determine the update to our parameters. You can refer to Part 1 for more details on this approach.
For this approach, we will need to modify our previous code in order to track the sum of gradients up to the current time step, and also determine the error based on the following prediction.
As a reminder, here is the equation for our parameter update coming from time step .
Here is the same part of my code but under the TD approach.
old_param_values = lyr.get_all_param_values(network) new_param_values = old_param_values for t in xrange(784): # apply mask at fixed interval, making sequences of pixels appear if (t + 1) % 28 == 0: nb_seq += 1 seq_x = train_x * mask[t] pred = predict(seq_x)[0, 0] grad = preds_grad(seq_x) if nb_seq == 1: sum_grad = np.copy(grad) else: sum_grad += np.copy(grad) if t < 783: seq_x_prime = train_x * mask[t + 28] pred_y_prime = predict(seq_x_prime)[0, 0] TD_error = (pred_y_prime - pred) error = (true_y - pred) else: TD_error = (true_y - pred) error = (true_y - pred) delta_params = LRN_RATE * TD_error * sum_grad new_param_values += delta_params batch_error.append(error) # update params based on experience lyr.set_all_param_values(network, new_param_values)
You might notice I am still tracking the true error, and it is for comparison purposes to the other method. Otherwise, the TD error cannot be compared to the true error.
We can notice now that rather than having a delta parameter at time step based on the current error and current gradient, we have the following,
TD_error = (pred_y_prime - pred) [...] delta_params = LRN_RATE * TD_error * sum_grad
Due to the change in procedure, there is a gain in memory since we no longer have to keep the predictions in memory during an episode/sample.
As previously described in Part 1, we should expect this method to generate the same results as for the previous approach. This is true because our modified update rule was obtained from the same source and no bias was introduced.
In order to compare both methods, I show the running average of the L2 norm of the error sequence just like the previous approach. In addition, both methods superposed are showed in the right figure (this time average is over the full length of the training).
We can therefore see the Temporal Different approach proposed by Sutton indeed produces the same results while having a memory advantage of not having to store ongoing predictions.
This is because both of these do generate the same parameter updates, but also both of these methods only update the parameters after the sequence is over.
In the following part, I will cover the variant of TD learning where updates to the parameters are made during the sequence. These types of methods are known as bootstrapping methods, which is where the TD learning approach really shines.