Twin Delayed DDPG (TD3): Theory

What is Overestimation Bias?

Photo by Wesley Tingey on Unsplash

Let us consider specifically Q-value based methods to understand this problem. Recall, Q-value estimates the "goodness" of executing a certain action given the information about the state (for further details read the post on DQN here). In Q-learning, as with its deep counterpart, DQN, learning occurs using the difference between the target Q-value (referred to as the "TD-Target") and the predicted Q-value. The calculations of the target Q-value exploit the recursive nature of the Bellman Equation. Thus, the target Q is the sum of the immediate reward and the discounted Q-value of the "best" action in the next state, $s'$, computed as the maximum over all possible next actions, $a'$, i.e.,

$$Q(s,a) = r + \gamma \cdot max_{a'} Q(s', a')$$

Though theoretically, this makes sense, however, due to the fact that $Q(s', a')$ is an approximation, taking a max over it can be problematic. With deep alternatives, like DQN and DDPG, that use neural networks to approximate the Q-value the problem is exaggerated.

Fig 1: Neural Networks have a tendency to induce noise in the aprroximated Q-values leading to some points (red crosses) being above or below the true Q-value (blue curve)

If you are thinking "Huh, Saasha what do you mean?", let's build a toy example to understand this problem. Assume an RL agent that is afforded two actions by the environment, namely 'Left' and 'Right'. The true value for $Q(s', \text{'Left'})$ and $Q(s', \text{'Right'})$ are 0.4 and 0.6 respectively. Using the neural network we gather the predicted Q-value of $s'$, however since the neural network represents an approximation, the Q-values are noisy. Thus the neural network might return, 0.55 for the 'Left' action and 0.51 for the 'Right' action. Since we now apply the max operator to these Q values, the agent learns to prefer the 'Left' action. Similar noisy updates when occurring across multiple time steps (as shown in Fig 1 above) result in the RL agent suffering from "Overestimation Bias". Just like the human suffering from overconfidence bias, the RL agent believes it has a better understanding of the actions to be taken than is true. In RL lingo, the phenomenon is also referred to as Maximization Bias (Section 6.7 of Reinforcement Learning: An Introduction, by Sutton and Barto) or Positive Bias.

Why are we concerned about overestimation bias?

At this point you might be thinking, "Alright, I get what you are saying, but DQN does learn, right? So, what is the problem then?". Other than slowing down the learning process, leaving the bias unchecked can lead to the bias accumulating to significant levels over time. Since the Q-value dictates the nature of policy learnt by the agent, one can imagine that the agent learns to take bad actions in certain states, hence learning a suboptimal policy. It can also cause the agents to exhibit extremely brittle behaviours. Additionally, in Actor-Critic settings, similar to DDPG and TD3, where the actor and critic learn from each other, it can lead to a feedback loop causing the performance of both the actor and critic to degrade over time.

How does TD3 deal with overestimation?

Alright, so we agree that noisy updates are a problem for RL. TD3, rather than just focusing on the symptom exhibited as Overestimation Bias, also tackles the root cause of the problem, i.e. the variance/noise in the Q-values. So, how does TD3 do this?

As stated earlier, TD3 specifically focuses on the Actor-Critic setting by applying neat tricks to the DDPG algorithm, namely:

Clipped Double Q-Learning
Delayed Policy and Target updates
Target Policy Smoothing

The author, Scott Fujimoto on an episode of the TalkRL podcast, however, points out that these tricks can, in fact, be applied with minor tweaks and adjustments to other algorithms as well.

Fig 2: TD3 uses 6 neural networks, namely, 2 critics and 1 actor, along with their corresponding targets

Before we dive into the spefics of each of the modification listed above, let's take a minute to observe the network schematics. As indicated in Fig. 2 above, TD3 used a total of six neural networks, namely, two critics $Q1$ and $Q2$ with parameters $\theta_1$ and $\theta_2$ , two critic targets $Q1'$ and $Q2'$ with parameters $\theta_1'$ and $\theta_2'$ and an actor $\pi$ (also represented as $\mu$ sometimes, as in the DDPG paper) and corresponding target $\pi'$ with parameters $\phi$ and $\phi'$ respectively.

Clipped Double Q-Learning

Going back to the initial discussion on Overestimation Bias, this trick deals with updating how the TD-Target (denoted by $y$ to match the standard ML notation) is calculated. In the DDPG setting, the target actor network predicts the action, $a'$ , for the next state, $s'$ . These are then used as input to the target critic network to compute the Q-value of performing $a'$ in state $s'$ . This can be formaluted as:

$y = r + \gamma \cdot Q'(s', \pi'(s'))$

In comparison, TD3 builds on the concepts introduced in Double Q-learning. For context, Double Q-learning uses two separate value estimates, such that each Q-value is updated using the estimate of the other one as shown below:

$y1 = r + \gamma \cdot Q2(s', Q1(s', a))$

$y2 = r + \gamma \cdot Q1(s', Q2(s', a))$

Fig 3: Figure indicating the flow of data through the different networks and operators to calculate the TD-Target using the Clipped Double Q-Learning approach

This formulation of Double Q-Learning is based on the condition that $Q1$ and $Q2$ are completely independent and are updated on separate sets of experiences, thus leading to an unbiased estimate. However, in the actor-critic setting of DDPG, a replay buffer is used to sample the experiences to learn from, hence, the condition cannot be guaranteed. To deal with this, the authors suggest a "clipped" version of Double Q-learning (as summarised in Fig 3), taking the minimum of the two Q-values to compute the target. Additionally, to reduce the computational cost, using two critics $Q1$ and $Q2$ with their corresponding target networks $Q1'$ and $Q2'$ is suggested, but, with only a single actor $\pi$ optimised against $Q1$ . This leads to generating a single TD-Target $y$ that is then used for updating both $Q1$ and $Q2$ . Thus, the final formulation of the learning step for TD3 can be expressed as:

$y = r + \gamma \cdot \text{min} \{Q1'(s', \pi'(s')), Q2'(s', \pi'(s'))\}$

$L_{\text{critic1}} = \text{MSE}(y - Q1(s, a))$

$L_{\text{critic2}} = \text{MSE}(y - Q2(s, a))$

Delayed Policy and Target Updates

Function approximators, such as neural networks, induce noise in the value estimations leading to divergent behaviours while training. Target networks were introduced to deal with this. The target networks are updated at regular intervals thereby providing a stable target for updates at each step.

While target networks are used in DDPG as well, the authors of TD3 note that the interplay between the actor and critic in such settings can also be a reason for failure. As is stated in the paper,

"Value estimates diverge through overestimation when the policy is poor, and the policy will become poor if the value estimate itself is inaccurate."

Fig 4: The two critic networks are more frequently updated than the actor and the three target networks. As per the paper, the critics are updated at every step, while the actor and the targets are updated every second step.

Thus, the authors suggest keeping the actor fixed for a certain number of steps while updating the two critics with each step (as indicated in Fig 4 above). This allows the Q-values estimated by the critic networks to converge, thereby reducing the value error, which, in turn allows for more stable policy updates by the actor.

Additionally, when it comes to the three target networks, similar to DDPG, this paper suggests a slow-updating technique using Polyak averaging, using the formula below with a very small value for $\tau$ :

$\theta_1' \leftarrow \tau \cdot \theta_1 + (1 - \tau) \cdot \theta_1'$

$\theta_2' \leftarrow \tau \cdot \theta_2 + (1 - \tau) \cdot \theta_2'$

$\phi \leftarrow \tau \cdot \phi + (1 - \tau) \cdot \phi'$

Target Policy Smoothing

In the continuous action space, in contrast to its discrete counterpart, the actions have certain implicit meaning and relations. For example, consider an RL agent controlling a vehicle by dictating the steering angle. The actions to control the steering can lie between -1 and 1, here the actions -0.01, 0.0 and +0.01 are all very close and should have similar Q-values for a given state. If not enforced, it can lead to brittle policies that overfit certain actions, espcially in deterministic policies.

Fig 5: Figure indicating how the Clipped Double Q-Learning is modified to add noise to the target policy to allow for regularisation.

To combat this, the authors suggest fitting a small Gaussian around the action, such that all actions within this small area have similar state-action values, thereby reducing the variance in the associated estimations. The paper applies this by inducing a zero-mean Gaussian noise with a small standard deviation on the action generated by the target policy, which is then passed as input to the target critics for calculating the TD-Target value. Additionally, the noise itself, as well as the perturbed action are clipped. The noise is clipped to ensure that it applies to only a small region around the action, while the perturbed action is clipped to ensure that it lies within the range of valid action values. Thus, the calculation of the TD-Target $y$ can be updated as (summarised in Fig 5 above):

$y = r + \gamma \cdot \text{min} \{Q1'(s', \pi'(s') + \epsilon), Q2'(s', \pi'(s') + \epsilon)\}$

$\epsilon \sim clip(\mathcal{N}(0, \widetilde{\sigma}), -c, c)$

The missing pieces of the puzzle

So, till now we saw that noise in the value updates can be detrimental to the performance of RL updates, for which TD3 introduces two critics and one actor, and target networks associated with each of them, thus leading to a total of six networks. We looked at how the critics were updated, how the Q-values were regularised and how frequently the actor and target networks were updated. The natural question in your mind then should be:

How does the actor learn?
How does the agent explore the environment?

So, let's spend the next few minutes uncovering the answers to these.

Updates of the actor

Fig 6: Diagram depicting the flow of data through different networks and operators to obtain the actor.

As with DDPG, the objective of the actor is to maximise the expected return at the current state. Thus, the actor updates (as visualised in Fig 6 above) by taking the negative (because we need to maximise) of the mean of the value returned by the critic on the actions selected deterministically (without noise) by the actor. The only thing to remember when applying this to TD3 is that the critic value is computed via the network $Q1$ , and that critic network $Q2$ is NOT involved in this computation. Thus, the actor loss can be formulated as:

$L_{\text{actor}} = - Q1(s, \pi(s))$

Exploration by the TD3 Agent

Fig 7: Action prediction during Training vs Testing phase in TD3

TD3 aims to learn a deterministic policy, however, the challenge in such a setting is that the agent does not explore the environment adequately. DDPG and TD3 are designed for off-policy learning, i.e., the policy used to generate the behaviour does not need to be the same as the policy learnt by the actor. Thus, DDPG and TD3 capitalise on this using a stochastic behaviour policy (as shown in Fig 7 above) to add transactions to the Replay Buffer, while learning a deterministic policy from it. Stochasticity in the behaviour policy is induced by adding noise to the actions executed by the agents in the environment. While DDPG uses an Ornstein-Uhlenbeck noise, the authors of TD3 note that it is sufficient to, in fact, use an uncorrelated zero-mean Gaussian noise with a small standard deviation.

The TD3 Algorithm: Putting the pieces of the puzzle together

Having spent the entire post looking at each of the individual components that make TD3 work the way that it does, it is finally time for that moment where we see how all the pieces fit together.

Fig 8: The TD3 Algorithm

Next steps

Now that we have an understanding of how TD3 works and how the various components of the algorithm are assembled together, I urge you, dear reader, to spend some time attempting to implement the algorithm. In the next post, we will continue to look at the TD3 algorithm, focusing on how the above algorithm can be translated to code. If you do need help with your implementation though, there is a working but non-documented version of the code already up on Github (accessible here). Additionally, the authors of the paper have also made their code publicly accessible (found here).

Thank you, dear reader, for taking the time to read this post, hope you found it helpful. Looking forward to hearing from you, drop me a message at saasha.allthingsai@gmail.com, or hit me up on Twitter. 💙

See you in the next one! 👋