πŸ€–

Twin Delayed DDPG (TD3): Theory

Tags
RL

Note: Hi, dear reader, if you are familiar with my RL tutorials, you would notice that this post looks different. I am toying with the idea of changing the format. I intend distributing the contents of TD3 algorithm across three posts: -

  • Part 1: the theory explaining the different components that build up the algorithm.
  • Part 2: how the algorithm is translated to code.
  • Part 3: how different hyperparameters affect the behaviour of the algorithm.

Hope you enjoy this new format. πŸ€—

Photo byΒ 
Photo byΒ Ben MullinsΒ onΒ Unsplash

Consider this, you have an exam to write, say on this very topic of TD3. You read the paper, and you feel confident that you would be able to answer any questions that are asked. When you reach the exam hall, the question paper leaves you completely flabbergasted. You were so sure you had understood the paper, so what happened, why could you not answer the questions? Well, it turns out that you had grossly overestimated your skills on the topic, leading you not to put in the time to understand the math behind the topic or to attempt to translate the algorithm to code.

Overconfidence bias, as it is called in psychology, is caused due to a misalignment between your evaluation of your skills and abilities and the reality. However, this phenomenon is not unique to humans. It is quite common in RL agents as well and is referred to as "Overestimation Bias" in RL jargon! Thus, the TD3 algorithm was introduced to curb these issues in the Actor-Critic RL setting, focusing mainly on the shortcomings of the DDPG algorithm.

Before diving into the specifics of TD3, we should spend a few minutes building an intuitive understanding of the problem that the algorithm attempts to resolve.

What is Overestimation Bias?

Photo byΒ 
Photo byΒ Wesley Tingey onΒ Unsplash

Let us consider specifically Q-value based methods to understand this problem. Recall, Q-value estimates the "goodness" of executing a certain action given the information about the state (for further details read the post on DQN here). In Q-learning, as with its deep counterpart, DQN, learning occurs using the difference between the target Q-value (referred to as the "TD-Target") and the predicted Q-value. The calculations of the target Q-value exploit the recursive nature of the Bellman Equation. Thus, the target Q is the sum of the immediate reward and the discounted Q-value of the "best" action in the next state, $s'$, computed as the maximum over all possible next actions, $a'$, i.e.,

$$Q(s,a) = r + \gamma \cdot max_{a'} Q(s', a')$$

Though theoretically, this makes sense, however, due to the fact that $Q(s', a')$ is an approximation, taking a max over it can be problematic. With deep alternatives, like DQN and DDPG, that use neural networks to approximate the Q-value the problem is exaggerated.

Fig 1: Neural Networks have a tendency to induce noise in the aprroximated Q-values leading to some points (red crosses) being above or below the true Q-value (blue curve)
Fig 1: Neural Networks have a tendency to induce noise in the aprroximated Q-values leading to some points (red crosses) being above or below the true Q-value (blue curve)

If you are thinking "Huh, Saasha what do you mean?", let's build a toy example to understand this problem. Assume an RL agent that is afforded two actions by the environment, namely 'Left' and 'Right'. The true value for $Q(s', \text{'Left'})$ and $Q(s', \text{'Right'})$ are 0.4 and 0.6 respectively. Using the neural network we gather the predicted Q-value of $s'$, however since the neural network represents an approximation, the Q-values are noisy. Thus the neural network might return, 0.55 for the 'Left' action and 0.51 for the 'Right' action. Since we now apply the max operator to these Q values, the agent learns to prefer the 'Left' action. Similar noisy updates when occurring across multiple time steps (as shown in Fig 1 above) result in the RL agent suffering from "Overestimation Bias". Just like the human suffering from overconfidence bias, the RL agent believes it has a better understanding of the actions to be taken than is true. In RL lingo, the phenomenon is also referred to as Maximization Bias (Section 6.7 of Reinforcement Learning: An Introduction, by Sutton and Barto) or Positive Bias.

Why are we concerned about overestimation bias?

At this point you might be thinking, "Alright, I get what you are saying, but DQN does learn, right? So, what is the problem then?". Other than slowing down the learning process, leaving the bias unchecked can lead to the bias accumulating to significant levels over time. Since the Q-value dictates the nature of policy learnt by the agent, one can imagine that the agent learns to take bad actions in certain states, hence learning a suboptimal policy. It can also cause the agents to exhibit extremely brittle behaviours. Additionally, in Actor-Critic settings, similar to DDPG and TD3, where the actor and critic learn from each other, it can lead to a feedback loop causing the performance of both the actor and critic to degrade over time.

How does TD3 deal with overestimation?

Alright, so we agree that noisy updates are a problem for RL. TD3, rather than just focusing on the symptom exhibited as Overestimation Bias, also tackles the root cause of the problem, i.e. the variance/noise in the Q-values. So, how does TD3 do this?

As stated earlier, TD3 specifically focuses on the Actor-Critic setting by applying neat tricks to the DDPG algorithm, namely:

  1. Clipped Double Q-Learning
  2. Delayed Policy and Target updates
  3. Target Policy Smoothing

The author, Scott Fujimoto on an episode of the TalkRL podcast, however, points out that these tricks can, in fact, be applied with minor tweaks and adjustments to other algorithms as well.

Fig 2: TD3 uses 6 neural networks, namely, 2 critics and 1 actor, along with their corresponding targets
Fig 2: TD3 uses 6 neural networks, namely, 2 critics and 1 actor, along with their corresponding targets

Before we dive into the spefics of each of the modification listed above, let's take a minute to observe the network schematics. As indicated in Fig. 2 above, TD3 used a total of six neural networks, namely, two critics Q1Q1 and Q2Q2 with parameters ΞΈ1\theta_1 and ΞΈ2\theta_2, two critic targets Q1β€²Q1' and Q2β€²Q2' with parameters ΞΈ1β€²\theta_1' and ΞΈ2β€²\theta_2' and an actor Ο€\pi (also represented as ΞΌ\mu sometimes, as in the DDPG paper) and corresponding target Ο€β€²\pi' with parameters Ο•\phi and Ο•β€²\phi' respectively.

Clipped Double Q-Learning

Going back to the initial discussion on Overestimation Bias, this trick deals with updating how the TD-Target (denoted by yy to match the standard ML notation) is calculated. In the DDPG setting, the target actor network predicts the action, aβ€²a', for the next state, sβ€²s'. These are then used as input to the target critic network to compute the Q-value of performing aβ€²a' in state sβ€²s'. This can be formaluted as:

y=r+Ξ³β‹…Qβ€²(sβ€²,Ο€β€²(sβ€²))y = r + \gamma \cdot Q'(s', \pi'(s'))

In comparison, TD3 builds on the concepts introduced in Double Q-learning. For context, Double Q-learning uses two separate value estimates, such that each Q-value is updated using the estimate of the other one as shown below:

y1=r+Ξ³β‹…Q2(sβ€²,Q1(sβ€²,a))y1 = r + \gamma \cdot Q2(s', Q1(s', a))

y2=r+Ξ³β‹…Q1(sβ€²,Q2(sβ€²,a))y2 = r + \gamma \cdot Q1(s', Q2(s', a))

Fig 3: Figure indicating the flow of data through the different networks and operators to calculate the TD-Target using the Clipped Double Q-Learning approach
Fig 3: Figure indicating the flow of data through the different networks and operators to calculate the TD-Target using the Clipped Double Q-Learning approach

This formulation of Double Q-Learning is based on the condition that Q1Q1 and Q2Q2 are completely independent and are updated on separate sets of experiences, thus leading to an unbiased estimate. However, in the actor-critic setting of DDPG, a replay buffer is used to sample the experiences to learn from, hence, the condition cannot be guaranteed. To deal with this, the authors suggest a "clipped" version of Double Q-learning (as summarised in Fig 3), taking the minimum of the two Q-values to compute the target. Additionally, to reduce the computational cost, using two critics Q1Q1 and Q2Q2 with their corresponding target networks Q1β€²Q1' and Q2β€²Q2' is suggested, but, with only a single actor Ο€\pi optimised against Q1Q1. This leads to generating a single TD-Target $y$ that is then used for updating both Q1Q1 and Q2Q2. Thus, the final formulation of the learning step for TD3 can be expressed as:

y=r+Ξ³β‹…min{Q1β€²(sβ€²,Ο€β€²(sβ€²)),Q2β€²(sβ€²,Ο€β€²(sβ€²))}y = r + \gamma \cdot \text{min} \{Q1'(s', \pi'(s')), Q2'(s', \pi'(s'))\}

Lcritic1=MSE(yβˆ’Q1(s,a))L_{\text{critic1}} = \text{MSE}(y - Q1(s, a))

Lcritic2=MSE(yβˆ’Q2(s,a))L_{\text{critic2}} = \text{MSE}(y - Q2(s, a))

Delayed Policy and Target Updates

Function approximators, such as neural networks, induce noise in the value estimations leading to divergent behaviours while training. Target networks were introduced to deal with this. The target networks are updated at regular intervals thereby providing a stable target for updates at each step.

While target networks are used in DDPG as well, the authors of TD3 note that the interplay between the actor and critic in such settings can also be a reason for failure. As is stated in the paper,

"Value estimates diverge through overestimation when the policy is poor, and the policy will become poor if the value estimate itself is inaccurate."

Fig 4: The two critic networks are more frequently updated than the actor and the three target networks. As per the paper, the critics are updated at every step, while the actor and the targets are updated every second step.
Fig 4: The two critic networks are more frequently updated than the actor and the three target networks. As per the paper, the critics are updated at every step, while the actor and the targets are updated every second step.

Thus, the authors suggest keeping the actor fixed for a certain number of steps while updating the two critics with each step (as indicated in Fig 4 above). This allows the Q-values estimated by the critic networks to converge, thereby reducing the value error, which, in turn allows for more stable policy updates by the actor.

Additionally, when it comes to the three target networks, similar to DDPG, this paper suggests a slow-updating technique using Polyak averaging, using the formula below with a very small value for Ο„\tau:

ΞΈ1′←τ⋅θ1+(1βˆ’Ο„)β‹…ΞΈ1β€²\theta_1' \leftarrow \tau \cdot \theta_1 + (1 - \tau) \cdot \theta_1'

ΞΈ2′←τ⋅θ2+(1βˆ’Ο„)β‹…ΞΈ2β€²\theta_2' \leftarrow \tau \cdot \theta_2 + (1 - \tau) \cdot \theta_2'

ϕ←τ⋅ϕ+(1βˆ’Ο„)β‹…Ο•β€²\phi \leftarrow \tau \cdot \phi + (1 - \tau) \cdot \phi'

Target Policy Smoothing

In the continuous action space, in contrast to its discrete counterpart, the actions have certain implicit meaning and relations. For example, consider an RL agent controlling a vehicle by dictating the steering angle. The actions to control the steering can lie between -1 and 1, here the actions -0.01, 0.0 and +0.01 are all very close and should have similar Q-values for a given state. If not enforced, it can lead to brittle policies that overfit certain actions, espcially in deterministic policies.

Fig 5: Figure indicating how the Clipped Double Q-Learning is modified to add noise to the target policy to allow for regularisation.
Fig 5: Figure indicating how the Clipped Double Q-Learning is modified to add noise to the target policy to allow for regularisation.

To combat this, the authors suggest fitting a small Gaussian around the action, such that all actions within this small area have similar state-action values, thereby reducing the variance in the associated estimations. The paper applies this by inducing a zero-mean Gaussian noise with a small standard deviation on the action generated by the target policy, which is then passed as input to the target critics for calculating the TD-Target value. Additionally, the noise itself, as well as the perturbed action are clipped. The noise is clipped to ensure that it applies to only a small region around the action, while the perturbed action is clipped to ensure that it lies within the range of valid action values. Thus, the calculation of the TD-Target yy can be updated as (summarised in Fig 5 above):

y=r+Ξ³β‹…min{Q1β€²(sβ€²,Ο€β€²(sβ€²)+Ο΅),Q2β€²(sβ€²,Ο€β€²(sβ€²)+Ο΅)}y = r + \gamma \cdot \text{min} \{Q1'(s', \pi'(s') + \epsilon), Q2'(s', \pi'(s') + \epsilon)\}

ϡ∼clip(N(0,Οƒ~),βˆ’c,c)\epsilon \sim clip(\mathcal{N}(0, \widetilde{\sigma}), -c, c)

The missing pieces of the puzzle

So, till now we saw that noise in the value updates can be detrimental to the performance of RL updates, for which TD3 introduces two critics and one actor, and target networks associated with each of them, thus leading to a total of six networks. We looked at how the critics were updated, how the Q-values were regularised and how frequently the actor and target networks were updated. The natural question in your mind then should be:

  1. How does the actor learn?
  2. How does the agent explore the environment?

So, let's spend the next few minutes uncovering the answers to these.

Updates of the actor

Fig 6: Diagram depicting the flow of data through different networks and operators to obtain the actor.
Fig 6: Diagram depicting the flow of data through different networks and operators to obtain the actor.

As with DDPG, the objective of the actor is to maximise the expected return at the current state. Thus, the actor updates (as visualised in Fig 6 above) by taking the negative (because we need to maximise) of the mean of the value returned by the critic on the actions selected deterministically (without noise) by the actor. The only thing to remember when applying this to TD3 is that the critic value is computed via the network Q1Q1, and that critic network Q2Q2 is NOT involved in this computation. Thus, the actor loss can be formulated as:

Lactor=βˆ’Q1(s,Ο€(s))L_{\text{actor}} = - Q1(s, \pi(s))

Exploration by the TD3 Agent

Fig 7: Action prediction during Training vs Testing phase in TD3
Fig 7: Action prediction during Training vs Testing phase in TD3

TD3 aims to learn a deterministic policy, however, the challenge in such a setting is that the agent does not explore the environment adequately. DDPG and TD3 are designed for off-policy learning, i.e., the policy used to generate the behaviour does not need to be the same as the policy learnt by the actor. Thus, DDPG and TD3 capitalise on this using a stochastic behaviour policy (as shown in Fig 7 above) to add transactions to the Replay Buffer, while learning a deterministic policy from it. Stochasticity in the behaviour policy is induced by adding noise to the actions executed by the agents in the environment. While DDPG uses an Ornstein-Uhlenbeck noise, the authors of TD3 note that it is sufficient to, in fact, use an uncorrelated zero-mean Gaussian noise with a small standard deviation.

The TD3 Algorithm: Putting the pieces of the puzzle together

Having spent the entire post looking at each of the individual components that make TD3 work the way that it does, it is finally time for that moment where we see how all the pieces fit together.

Fig 8: The TD3 Algorithm
Fig 8: The TD3 Algorithm

Next steps

Now that we have an understanding of how TD3 works and how the various components of the algorithm are assembled together, I urge you, dear reader, to spend some time attempting to implement the algorithm. In the next post, we will continue to look at the TD3 algorithm, focusing on how the above algorithm can be translated to code. If you do need help with your implementation though, there is a working but non-documented version of the code already up on Github (accessible here). Additionally, the authors of the paper have also made their code publicly accessible (found here).

Thank you, dear reader, for taking the time to read this post, hope you found it helpful. Looking forward to hearing from you, drop me a message at saasha.allthingsai@gmail.com, or hit me up on Twitter. πŸ’™

See you in the next one! πŸ‘‹

Further reading

  1. Addressing Function Approximation Error in Actor-Critic Methods, Scott Fujimoto et. al. -- the paper on TD3
  2. Continuous Control with Deep Reinforcement Learning, Timothy Lillicrap et. al. -- the paper on DDPG
  3. Double Q-learning, Hado van Hasselt
  4. Issues in Using Function Approximation for Reinforcement Learning, Sebastian Thrun et. al. -- explains how the use of neural networks leads to overestimation errors in Q-learning
  5. Policy Gradient Algorithms, by Lilian Weng -- a long post on policy gradient methods, with a section on DDPG and TD3
  6. Twin Delayed DDPG, OpenAI Spinning Up -- explains the components that form TD3