Rainbow DQN — The Best Reinforcement Learning Has to Offer? | by Wouter van Heeswijk, PhD | Dec, 2022

By Jessie Hobb On Dec 9, 2022

What happens if the most successful techniques in Deep Q-Learning are combined into a single algorithm?

Rainbow DQN performance compared to other DQN techniques. Are these results too good to be true? [source: DeepMind paper]

Reinforcement Learning has come a long way since the era of classical tabular Q-learning. Deep Q-Learning Networks (DQN) drove a revolution in the field, enabling powerful generalizations of states.

By replacing a Q-table (storing the values of all state-action pair) with a powerful neural network (theoretically able to transform any input state to an output value per action), we could handle problems with massive state spaces, signifying a breakthrough in the field.

Working with neural network also brings a myriad of problems though. In response, a plethora of techniques has been used to improve performance of DQN algorithms. Common techniques include replay buffers, target networks and entropy bonuses. Often, incorporating these techniques ensures considerable performance improvements.

If one technique can so drastically improve performance, why not use all of them?

This is precisely what the DeepMind team (their 2017 paper has 10 authors, so I’ll just refer to ‘the team’) must have wondered. If you would combine all these individual techniques into a single algorithm, would you get a Dream Team or Frankenstein’s monster?

We’ll suspend the performance question for a bit, and first dive into the six techniques that are deployed and combined in Rainbow DQN.

I. Double Q-Learning

Q-learning selects actions in a way that maximizes state-action pair. However, this comes with a fundamental flaw. As the Q-values are mere approximations, the algorithm favors actions whose Q-values are inflated. This means that we systematically overestimate Q-values! Naturally, this bias does not help when trying to find the best decision-making policy. Worse, with Q-learning being a bootstrapping method, the overestimating effect may remain even after many iterations.

Double Q-learning combats this flaw by deploying one network to select an action and another one to update the values. Thus, even if we use an inflated Q-value to select the action, we update with one that (probably) does not. Each time, we randomly choose one network for action selection and the other one for the update. Thus, we decouple the selection mechanism from the evaluation mechanism. The update procedure for network A is given below (obviously, network B deploys the reverse):

Update mechanism of Deep Q-learning. The best known action is selected with one network, for which the Q-value is determined with the other network.

Empirically, Double Q architecture successfully reduces bias and leads to better decision-making policies.

II. Prioritized Experience Replay

Collection observations is often costly. Q-learning — being an off-policy method — allows to store observations in a replay buffer, from which we can subsequently sample experiences to update our policy. Additionally, it breaks correlation between states — as subsequent states often exhibit strong correlation, neural networks tend to locally overfit.

However, not every past experience is equally relevant to revisit. To maximize learning, we might like to sample tuples that yielded high (absolute) value function errors in the past, with the idea that we want to evaluate these experiences again and reduce the error.

Priority sampling scheme, where p_i represents the priority of the experience (e.g., its rank or absolute TD-error+some noise) and exponent α determines the strength of the prioritization.

Simply selecting experiences with the highest errors would introduce sampling bias. To remedy this, a form of weighted importance sampling is used to perform the weight updates:

Weighted importance sampling scheme, where N is the batch size and β is a scaling parameter.

Prioritized experience replay was shown to consistently outperform uniform replay sampling.

III. Dueling Networks

Deep Q-networks typically return a Q-value Q(s,a), serving as an approximation for the value function corresponding to a given state-action pair (s,a). In the original Bellman equations this makes perfect sense, as computing the optimal value functions yields the optimal policy. In Q-learning the concatenation of state- and action values can be problematic though, as it may be hard to distinguish whether a value stems primarily from the state, the action, or both.

The Q-value can be decomposed as follows: Q(s,a)=V(s)+A(s,a). Given that we operate under the best known policy (i.e., the policy returns the best action according to its estimates), the expected advantage is 0 and the Q-value equals the state value.

In particular, for some states the action selection simply doesn’t matter that much — e.g., consider a self-driving car in a state without obstacles — and advantage values are negligible. Or, to give another example, a state could be poor regardless of the action we take in it. For such states, we prefer to not waste computational effort by exploring all actions.

A dueling architecture allows to more quickly determine good actions, as poor actions can be discarded easier and comparable actions can be identified more quickly. By decoupling the advantage value from the state value, we gain a more granular perspective on the action value.

Comparison between traditional Deep Q-network (top) and Dueling Deep Q-network (bottom) architectures. The latter has separate streams to output state values and advantage values, which added up return the Q-value. [figure adapted from Wang et al., 2016]

The network has a joint set of layers parameterized by θ, a stream parameterized by α to determine the advantage function, and a stream parameterized by β to determine the value function. Added up, the two streams return the Q-value.

To enforce that the value function and advantage function indeed rely on separately parameterized streams, the mean advantage over all actions is subtracted from the selected action’s advantage:

In dueling networks, one stream (parameterized by β) returns the state-dependent value function, and the other stream (parameterized by α) returns the advantage function.

Empirical results show that the proposed architecture quickly identifies good actions, as identifying action values has become an explicit part of the learning problem.

IV. Multi-step learning

The impact of an action is not always directly visible. Many individual actions may not even yield an immediate reward. For instance, a bad turn in a racing game might result in losing pole position only some moments later. Typical Q-learning updates — or more generally, TD(0) updates — only consider the reward directly following the action, bootstrapping the downstream value of the decision.

At the other end of the spectrum, Monte Carlo/ TD(1) learning first observes all rewards before updating Q-values, but this often results in high variance and slow convergence. Multi-step learning — also known as n-Q-learning — strikes a balance between bias and variance, observing a number of time steps before performing an update. For instance, if n=3, the update would be a combination of three observed rewards and a bootstrapped value function for the remainder of the time horizon.

In general form, the multi-step reward can be computed as follows:

In multi-step learning, we track rewards for n time steps, and use a Q-value to estimate effects afterwards. This approach often strikes the right balance between temporal difference- and Monte Carlo learning.

The corresponding update scheme is defined as follows:

Weight update scheme for multi-step Q-learning.

For many problem settings, multi-step learning yields a substantial improvement over TD(0) and TD(1) learning, mitigating bias and variance effects and acknowledging delayed impact of decisions. The main challenge is to appropriately tune n for the problem at hand.

V. Distributional reinforcement learning

Most reinforcement learning algorithms work with expected return values. Instead, we could attempt to learn the distribution of returns (for which the Q-value would represent the expected value). Although traditionally used mainly to derive insights (e.g., riskiness of returns), it appears this style of learning actually improves performance as well.

The main use case for learning distributions rather than expected values is the application in stochastic environments. Even if rewards are noisy and do not directly aid in a converging expected value, we can still utilize them to get a better grasp on the underlying reward-generating distribution.

Whereas typical Q-values are defined by Q(s,a)=r(s,a)+γQ(s’,a’), the distributional variant is defined as follows:

Definition of reward distribution, replacing the Q-value. In distributional RL, we aim to learn the entire distribution, with its expected value equalling the Q-value.

Here, Z denotes the distribution of returns rather than the expected (Q-)value. Furthermore, R, S’ and A’ are random variables.

Instead of reducing TD error, we aim to minimize the sampled Kullback-Leibner divergence with the target distribution. The loss we minimize is the following:

Loss function for distributional RL. We construct a sample reward distribution, which we use to update the predicted reward distribution.

Here, 𝒯 is a contraction mapping, Φ a distributional projection, and bar θ a frozen parameter copy to represent the target network. I fear you really need the full paper to make sense out of it.

Perhaps more intuitively: the network outputs a discretized probability mass function, where each output (‘atom’) represents the probability of a small reward interval. With each update, we try to bring the predicted reward distribution closer to the observed one.

Distributional RL yielded good results, especially in problem settings with rare and noisy rewards.

VI. Noisy exploration layers

Traditional Q-learning uses epsilon-greedy learning, randomly selecting actions with probability ϵ (usually 0.05 or 0.10) instead of the one with the expected maximum value. This mechanism allows escaping local optima by ensuring exploration. Entropy bonuses are another common technique to reward deviating actions.

In noisy environments we might need much more exploration though. We can achieve this by adding noisy exploration layers to our neural network. We replace a standard linear layer — defined in canonical form by y=Wx, where x is the input, W the set of weights, and y the output — with a layer composed of a deterministic stream and a noisy stream:

Noisy linear layer in a neural network. The output y depends on both a determinist stream and a noisy stream. The weights corresponding to both streams are updated over time.

As you’d imagine, the noise generated by the layer transformation impacts the resulting Q-values, and the highest Q-value yielded by the network can vary, driving explorative behavior. Over time, the algorithm may learn to ignore the noisy stream, as b_noisy and W_noisy are learnable parameters. However, the ratio between signal and noise can be variable respective to the state, encouraging exploration where necessary and converging to deterministic actions where appropriate.

Noisy neural networks appear to be a more robust exploration mechanisms than, e.g., epsilon-greedy or entropy bonus. Like the other techniques mentioned here, it is empirically shown to improve performance on many benchmark problems.

The DeepMind team put their Rainbow algorithm — combining the aforementioned six techniques — to the test against DQN algorithms using a single technique at a time. The figure below represents the average performance over 57 Atari games, with 100% indicating human-level performance.

Empirical performance, averaged over 57 Atari games. Rainbow DQN strongly outperforms individual DQN techniques, both in terms of rate of learning and final performance [source: DeepMind paper]

As seen, Rainbow DQN strongly outperforms all benchmarks that use a single component DQN. Also interesting is that it achieves good performance much quicker than the other ones. It rivals human-like performance after about 15m frames; four of the benchmarks never hit that point at all. After 44/200m frames, Rainbow DQN is already at a level where it consistently outperforms all competitors, regardless of runtime. After 200m frames, it is the undisputed winner of the lot.

The DeepMind team also checked whether all components are necessary, by omitting individual techniques from the rainbow. Some techniques contribute more than others, but — simplifying the paper’s conclusions a bit — each technique adds something to the algorithm in terms of performance. See the comparison below.

Although not each technique contributes equally to the performance of Rainbow DQN, the experimental results indicate every one of them adds value [source: DeepMind paper]

Overall, the experimental results look absolutely stunning, so why don’t we all use Rainbow DQN all the time?

First of all, due to deploying so many elaborate techniques, Rainbow DQN is slow. Every technique requires a set of additional computations, and naturally these add up. Per frame the algorithm may learn a lot, but the time to process a frame is substantially longer. Thus, a time-based comparison with the other algorithms will be less favorable.

A second drawback of Rainbow DQN is the extensive tuning. The number of steps to take in the multi-step procedure, the amount of noise in the noisy layers, the update frequency of target networks— everything has to be appropriately tuned. Knowing how quickly a grid search explodes, tuning is not easy in such a heavily parameterized algorithm, especially factoring in the low speed of the algorithm.

Finally, in part due to the previous points, Rainbow DQN can be unstable and hard to debug. A single coding mistake in one technique can ruin the entire algorithm — good luck finding the bug. Behavior not as expected? It could be anything, even some peculiar interplay between various techniques.

Given the difficulties of the algorithm, Rainbow DQN is presently not included in popular RL libraries, and therefore cannot easily be used off-the shelf. In a sense, the rainbow indeed is a Dream Team of RL techniques, but seemingly winning is not everything.

As we saw, Rainbow DQN reports very impressive results with respect to its benchmarks. However, it should be noted that the original paper only compares it to other DQN techniques. In particular, the entire class of policy-based methods — including popular algorithms such as TRPO and PPO — has been left out of the comparison.

In a study by OpenAI (not on the same problems though), performance was compared to PPO, the present go-to algorithm in continuous control. With standard implementations, Rainbow DQN handily beat PPO. However, when deploying a joint training mechanism — essentially transfer learning while keeping the algorithms themselves the same — joint PPO narrowly beat joint Rainbow DQN. Given that PPO is much simpler and faster than Rainbow DQN, its higher popularity is not surprising.

Ultimately, performance is highly important, but it is not the only relevant aspect of an RL algorithm. We should be able to work with it, modify it, debug it, and most of all — understand it. Rainbow DQN demonstrated that advances in the field can strengthen each other and the algorithm yielded state-of-the-art results, but ultimately did not convince human users to wholeheartedly embrace it.