🤖

From algorithm to code: TD3

Tags
RL
Published on
January 18, 2021

NOTE 1: This is the second post in a three-part series on the TD3 algorithm.

  • Part 1: theory explaining the different components that build up the algorithm.
  • Part 2: how the algorithm is translated to code.
  • Part 3: how different hyperparameters affect the behaviour of the algorithm.

NOTE 2: The complete code discussed in this post can be found here on Github.

Hello there, dear reader! Welcome back to the next instalment on the TD3 algorithm. I hope you are doing well! Did you attempt to implement the code for the algorithm based on the theory that we discussed in the previous post? If yes, give yourself a big pat on the back. But in case you are still stuck with the implementation, let's work through the code together! 🤓

Recap: TD3 Theory 🔁

The success of TD3 can be attributed to three improvements over the DDPG algorithm, namely:

  1. Clipped Double Q-learning
  2. Delayed policy and target updates
  3. Target policy smoothing

Compared to DDPG, TD3 also does away with the explicit initialisation of layers of the Actor- and Critic-networks and replaces the Ornstein-Uhlenbeck noise with a small zero-mean Gaussian noise for exploration.

Fig 1: The TD3 Algorithm
Fig 1: The TD3 Algorithm

Structure of the code

Looking at the algorithm a few things jump right out. We need classes that encapsulate the Actor-network, the Critic-network and the Replay Buffer. Additionally, we need a class, let's call it TD3Agent, that encapsulates all the behavioural information, namely, how to sample from the behaviour policy and how to update the critics, the actor and the targets.

Actor

Fig 2: Architecture of the neural network for the Actor
Fig 2: Architecture of the neural network for the Actor

Let's use an actor with 2 hidden layers (as shown in Fig. 2 above), with the first one containing 400 neurons and the second one 300. The input to this network would be the state of the environment observed by the agent and the output would be the continuous-valued action to be applied. Thus, for the LunarLanderContinuous environment with an observation space of 8 and an action space of 2, the input layer would have a size of 8, while the output layer would have 2 neurons.

Additionally, for the output from the actor, we need to ensure that the predicted action is within the valid range. To achieve this, the tanh operation is applied to the output layer to squash the results between the range of -1 and +1. This squashed result is then scaled to match the range of valid action values. Do note that this code assumes that all actions lie within the same symmetric range.

Critic

Fig 3: Architecture of the neural network for the Critic
Fig 3: Architecture of the neural network for the Critic

Let's use a similar 2 hidden layer architecture (as shown in Fig 3 above) with 400 and 300 neurons for the critic as well. The input to this network would be a concatenation of the state of the environment observed by the agent and the action performed in response. The output layer contains only 1 neuron as the critic returns a single Q-value for each state-action pair provided as input.

Replay Buffer

This one is a standard component. If you have implemented DQN before, you can just use the same code for this class even with TD3. 👻

This class needs to exhibit two main functionalities:

  • the ability to store new experiences gathered by the agent's interaction with the environment
  • the ability to sample a batch of random experiences from those stored in memory

There are multiple ways this class can be implemented, e.g. using deques or having separate lists for each of the five elements in the experience tuple and so on. In the code snippet above, I use a single circular list of lists to achieve the desired functionality.

TD3Agent

This is the most interesting file, as it brings together all the classes we have created till now. This class needs to encapsulate the following functionality:

  • creation of the 6 networks
  • selection of the action that the agent executes in the environment
  • learning from the agent's experiences

Since this is a big class, let's look at the code for each of the functionalities separately.

The __init__ function (as seen in the code snippet above) of the class is used to create state variables for the memory, the actor-network, the two critic-networks and their corresponding target networks. Also, don't forget to initialise the target networks to have the same weights as their corresponding actor- and critic-networks.

As we saw in the first post in this series, the TD3 agent explores the environment using a stochastic behaviour policy. Thus, we need to add a small Gaussian noise to the action predicted by the actor-network. The perturbed actions then need to be clipped to ensure that they remain within the valid range of values. Do note that this noise is only added during training.

The paper uses Polyak averaging to slowly update the target networks with a $\tau$ value of 0.005.

We are finally in the learn function, the last and the most involved part of this class. This function contains the logic for updating all 6 neural networks.

We first start by computing the noisy target actions for Target Policy Smoothing. This is done by adding a small Gaussian noise (unrelated to the noise of the behaviour policy) to the action predicted by the target actor-network for the next_state. The noisy target actions and the next_state act as inputs to the two target critic-networks that produce the Q-values, and . In order to apply Clipped Double Q-learning, the minimum of and is discounted (don't forget to set the Q-values for terminal states to 0.0!) and then summed with the reward values to compute the TD-Target . The loss for each of the critics is computed as the mean-squared-error of the TD-Target and the output of the (non-target) critic-network.

Delayed policy and target updates are achieved by updating the actor- and target-networks only on alternate steps. The actor update involves performing a gradient ascent along critic1, by computing the Q-value on the state and the associated non-noisy action predicted by the actor-network. To convert the operation to gradient descent, we need to move along the opposite direction, i.e., take the negative of the Q-value obtained. The final step in the 'learn' function involves calling the 'soft_update_targets' function (from earlier) to update all three target-networks with Polyak averaging.

Bringing it all together: main.py 🍻

Now that we have the code for all the classes that we need, the final step is to build the training loop and testing loop.

The training loop involves using the 'select_action' function from the TD3 class to predict the agent's action in the given state. The agent then executes the action in the environment. The interaction of the agent with the environment is saved as a 5-tuple of \<state, action, next_state, reward, done\> to the replay buffer by calling the 'store' function associated with the class. Finally, the TD3 agent updates its networks using the 'learn' function.

Conversely, during testing, the agent simply needs to perform iterations for the specified number of epochs, calling only the 'select_action' function from the TD3 class, with exploration_noise set to 0.0. The selected action is then performed in the environment, however, there is no need to store the experiences in memory or to perform the learning step.

Results 📈

Et Voilà! We have successfully converted the TD3 algorithm to a working code in Python! The full code can be found on Github. The graph below shows the train and test performance of this implementation on OpenAI Gym's LunarLanderContinuous-v2 environment.

Fig 4: Graph showing the train (left) and test (right) performance of the TD3 algorithm. Left: Shows rewards accumulated by the agent per episode (light blue) during training, along with a moving average of the scores of the last 100 episodes (dark blue). Right: Graphs the rewards per episode accumulated by the agent in test mode across 5 seeds. Note: Here a reward of above 200 is considered as acceptably solving the environment (marked by the red dotted line).
Fig 4: Graph showing the train (left) and test (right) performance of the TD3 algorithm. Left: Shows rewards accumulated by the agent per episode (light blue) during training, along with a moving average of the scores of the last 100 episodes (dark blue). Right: Graphs the rewards per episode accumulated by the agent in test mode across 5 seeds. Note: Here a reward of above 200 is considered as acceptably solving the environment (marked by the red dotted line).

A different way of structuring the code

The authors of the TD3 paper have made their code publicly available. It is well written and easy to follow. I would suggest spending some time studying it. However, you might notice a few minor differences between their code and what we worked on above. So, let's spend a few minutes discussing these to make the code associated with the paper easier to assimilate. 🤓

The difference in code can be attributed to the way that the critic is implemented. The critic class in Scott Fujimoto's repository contains two separate neural networks representing the two critics and , thus, leading to a two-headed output.

This also means that we need to update the TD3Agent class. With the new implementation, we would need to create only two instances of the critic class (one for current and one for target).

Finally, the portion of the 'learn' function associated with updating the critic-networks would need to be modified. The target Q-values, and can now be computed with a single forward pass of the target critic-network owing to the two-headed output of the critic class. Similarly, the current Q-values, and are obtained with a single forward pass of the critic network. The losses of the two critics are computed as the mean-squared-error as earlier, with the only difference being that the two loss values are summed together, as both the critic-networks now share a single optimiser.

I am sure you must be wondering, "But Saasha, does the two-headed implementation of the critic class cause a change in performance?" To answer this, let's compare the plots from both the implementations (the initial one discussed earlier, i.e. one-headed critic, and the modified, i.e. two-headed critic).

Fig 5: Graph comparing the performance of the TD3 algorithm with the one-headed (initial implementation) and the two-headed (author inspired implementation) critic on the LunarLanderContinuous-v2 environment. Left: Shows the moving average of the scores of the last 100 episodes. Right: Graphs the rewards per episode accumulated by the agent in test mode across 5 seeds. Note: Here a reward of above 200 is considered as acceptably solving the environment (marked by the red dotted line).
Fig 5: Graph comparing the performance of the TD3 algorithm with the one-headed (initial implementation) and the two-headed (author inspired implementation) critic on the LunarLanderContinuous-v2 environment. Left: Shows the moving average of the scores of the last 100 episodes. Right: Graphs the rewards per episode accumulated by the agent in test mode across 5 seeds. Note: Here a reward of above 200 is considered as acceptably solving the environment (marked by the red dotted line).

To ensure that the performance of the implementations is comparable, the same hyperparameter settings have been used for both. As per the graph above (Fig 5), the two-headed variant appears to be more stable, but seems to exhibit a lower test time performance than the one-headed version. However, do note that I have not performed rigorous hyperparameter tuning on the implementations.

So, what's next? 🚀

Dear reader, I now urge you to test the code for yourself and explore the effects of each of the hyperparameters (we would be looking into it in the next post 😉). The complete implementation of both the versions of the code discussed in this post can be found on Github. The folder 'critic_one_head' refers to the code from the initial walkthrough, while the folder 'critic_two_head' refers to the version based on the implementation provided by Scott Fujimoto, the author of the TD3 paper (available here).

Hope you found this post helpful. Dear reader, please do let me know if you like this new format, or if you have any suggestions on how to improve it. You can get in touch with me via saasha.allthingsai@gmail.com or via Twitter.

See you soon and wish you a productive week ahead! 💼

Further reading

  1. TD3 Implementation, by Scott Fujimoto -- code by the author of the paper on TD3
  2. Addressing Function Approximation Error in Actor-Critic Methods, Scott Fujimoto et. al. -- the paper on TD3
  3. Twin Delayed DDPG, by OpenAI Spinning -- breaks down the various parts of the algorithm to support quick and easy implementaiton
  4. TD3: Learning to Run with AI, by Donal Byrne -- a blogpost that provides a detailed code walkthrough