Self-Destructive RL Agents. Innocent changes to RL reward functions… | by Edgar A Aguilar | Jun, 2022

By Jessie Hobb On Jul 5, 2022

Innocent changes to RL reward functions produce surprising behaviors

Image by Stefan Keller from Pixabay.

Problem Setting

Reinforcement Learning is a popular branch of AI/ML that tries to learn optimal behavior by interacting with the world and maximizing a reward signal. For example, when playing games the agent will make a decision on how to interact with the world, and the environment game will reward accordingly.

Atari Breakout. Here the agent’s reward signal is the score of the game. Environment from OpenAI Gym.

Mathematical Optimization is the study of finding “the best element”. This is typically finding the maximum or minimum of a function and can be additionally constrained. For example, supervised learning is typically written down as the task of minimizing a loss function between the predicted values of a neural network and the target values of a dataset. Then we solve this by stochastic gradient descent.

When first learning reinforcement learning, viewing it as a simple optimization problem is very tempting but can be misleading. Let us assume a finite horizon setting, such that the RL goal is to find a policy π:S→A that maximizes the undiscounted sum of returns:

We use a finite horizon and γ=1 to make the discussion easier to follow but equal lessons can be learned for the general case. Consider now, a new reward function which is just the original reward offset by a constant:

Naively, we could be tempted to conclude that:

If that were true, it would seem that the optimal policy π∗ for the original reward function is the same as the optimal policy for the new modified reward. The objective function that the agent wants to maximize is just modified by the value (N+1)c. This value should be independent of its behavior, right? Actually, this is not necessarily the case! In fact, very different behavior can be observed for different choices of constant c.

But why? The short answer is that N is actually a random variable that is also policy-dependent. Most RL environments come with conditions that terminate an episode, and an agent’s actions can influence the duration of the episode.

Let’s take a look at a concrete instance of this problem to better understand what is going on.

Intuitive Example

Probably the easiest environment to play around with Deep RL and to develop intuition is the cartpole problem. To recap: the original reward assigns r=+1 for every timestep the cartpole is still alive, and the episode ends if:

The angle of the pole exceeds a threshold:
∣θ∣>12
The cart goes out of bounds:
|x|>2.4
The time limit (N=500) is reached.

To establish a baseline, let us look at the training performance of different agents, as a function of the constant offset c. For this, we use stable-baselines3 and use an MLP Q-network. Below is an average training curve over 16 different agents. The training was stopped after 2,500,000 steps (although better performance can be achieved) since this is just to illustrate what is going on.

DQN Cartpole training — base reward (r=1). Image by author.

Nothing surprising here. The agent is learning to keep the pole balanced as expected. The only non-default parameter that I manually tuned was setting the exploration_fraction=0.2

Next, we change the reward function and train again. For a warmup, we set c=−0.5 , i.e. at every step the agent will receive a reward of r=0.5 until the episode terminates. Then, what do you think will happen? To have an apples-to-apples comparison, we will plot the number of steps of the environment over time (which happens to be the original reward).

DQN Cartpole training, r=0.5 (c=-0.5). Image by author.

As we see, nothing too dramatic happened. The time to reach convergence is slightly different, but overall the agent learns to maximize the new reward, and on average all agents learn to balance the pole after about 1.5 million steps.

So far, so good. But what would happen if we set c=−2? That is, at each step the agent receives a reward of r=−1.

DQN Cartpole training, r=-1 (c=-2). Image by author.

Prime example of reward hacking! Environment from OpenAI Gym.

Completely different behavior! Actually, the agent learns to end the episode as soon as possible. It effectively becomes self-destructive and useless for the task that we had in mind, even if the reward was modified by a constant! The explanation for this comes, of course, from the fact that the episode lengths are variable and influenceable by the agent’s actions.

Concluding Thoughts

I think that this behavior is not the most intuitive at first sight. Why is this agent failing to learn the balancing policy? After all, the reward function was just offset by a constant. The reason is that Reinforcement Learning is much more subtle and complex than traditional optimization. In the episodic setting, there are conditions that can interfere with the running of the training environment. This means that there is no guarantee that the episode lasts the full N steps. That is why the above naive simplification does not hold!

In this case, in particular, the agent realizes that if it self-destructs as soon as possible, then the negative rewards will stop accumulating!

If all the agent knows is negative rewards, then it will learn that a shorter episode is the best way to reduce its overall misery.

The moral of the story is hopefully clear:

Changing the reward function of an RL environment can lead to vastly different optimal policies, even with an inconspicuous change like an additive constant.

Just to clarify — the surprising thing is not that a change in the reward function per se leads to different behavior. After all, if we believe in the reward hypothesis, then the reward function is what is encoding the optimal behavior. The surprising part is that even multiplicative or additive constants can have such a big effect.

As a closing thought, you might have noticed that the training curve for the c=−0.5 case looked different. Agents trained with this modified reward function converge to the same optimal solution, but at a different rate. This is related to the way that gradients propagate when training neural networks via backpropagation, so more of an artifact that comes from the type of training rather than from the theoretical basis of RL. This is why there is also a field of “reward shaping” where the reward function is tuned slightly trying to reach convergence to the desired policy faster.