When Stochastic Policies Are Better Than Deterministic Ones | by Wouter van Heeswijk, PhD | Feb, 2023

By Jessie Hobb On Feb 18, 2023

Why we let randomness dictate our action selection in Reinforcement Learning

Rock-paper-scissors would be a boring affair with deterministic policies [Photo by Marcus Wallis on Unsplash]

If you are used to deterministic decision-making policies (e.g., as in Deep Q-learning), the need for and use of stochastic policies might elude you. After all, deterministic policies offer a convenient state-action mapping π:s ↦ a, ideally even the optimal mapping (that is, if all the Bellman equations are learned to perfection).

In contrast, stochastic policies — represented by a conditional probability distribution over the actions in a given state, π:P(a|s) — seem rather inconvenient and imprecise. Why would we allow randomness to direct our actions, why leave the selection of the best known decisions to chance?

In reality, a huge number of Reinforcement Learning (RL) algorithms indeed deploys stochastic policies, judging by the sheer number of actor-critic algorithms out there. Evidently, there must be some benefit to this approach. This article discusses four cases in which stochastic policies are superior to their deterministic counterparts.

Predictable is not always good.

From a game of rock-paper-scissors, it is perfectly clear that a deterministic policy would fail miserably. The opponent would quickly figure out that you always play rock, and choose the counter-action accordingly. Evidently, the Nash equilibrium here is a uniformly distributed policy that selects each action with probability 1/3. How to learn it? You guessed it: stochastic policies.

Especially in adversarial environments — in which opponents have diverging objectives and try to anticipate your decisions — it is often beneficial to have a degree of randomness in a policy. After all, game theory dictates there is often no pure strategy that always formulates a single optimal response to an opponent, instead propagating mixed strategies as the best action selection mechanism for many games.

Unpredictability is a powerful competitive tool, and if this trait is desired, stochastic policies are the clear way to go.

When facing a worth adversary, always playing the same game will quickly backfire. [Photo by Christian Tenguan on Unsplash]

In many situations, we do not have a perfect picture of the true problem state, but instead try to infer them from imperfect observations. The field of Partially Observed Markov Decision Processes (POMDPs) is built around this discrepancy between state and observation. The same imperfection applies when we represent states by means of features, as is often needed to handle large state spaces.

Consider the famous aliased GridWorld by David Silver. Here, states are represented by the observation of surrounding walls. In the illustration below, in both shaded states the agent observes a wall on the upside and one on the downside. Although the true states are distinct and require different actions, they are identical in the observation.

Based on the imperfect observation alone, the agent must make a decision. A value function approximation (e.g., Q-learning) can easily get stuck here, always picking the same action (e.g., always left) and thus never reaching the reward. An ϵ-greedy reward might mitigate the situation, but still gets stuck most of the time.

In contrast, a policy gradient algorithm will learn to go left or right with probability 0.5 for these identical observations, thus finding the treasure much quicker. By acknowledging that the agent has an imperfect perception of its environment, it deliberately takes probabilistic actions to counteract the inherent uncertainty.

Deterministic policy in Aliased GridWorld. The agent can only observe walls and is thus unable to distinguish the shaded states. A deterministic policy will make identical decisions in both states, which will often trap the agent in the upper-left corner. [Image by author, based on example by David Silver, clipart by GDJ [1, 2] via OpenClipArt.org]

Stochastic policy in Aliased GridWorld. The agent can only observe walls and is thus unable to distinguish the shaded states. An optimal stochastic policy will choose left and right with equal probability in the aliased states, making it much more likely to find the treasure. [Image by author, based on example by David Silver, clipart by GDJ [1, 2] via OpenClipArt.org]

Most environments — especially those in real life — exhibit substantial uncertainty. Even if we were to make the exact same decision in the exact same state, the corresponding reward trajectories may vary wildly. Before we have a reasonable estimate of expected downstream values, we may have to perform many, many training iterations.

If we encounter such considerable uncertainty in the environment itself — as reflected in its transition function — stochastic policies often aid in its discovery. Policy gradient methods offer a powerful and inherent exploration mechanism that is not present in vanilla implementations of value-based methods.

In this context, we do not necessarily seek an inherently probabilistic policy as an end goal, but it surely helps while exploring the environment. The combination of probabilistic action selection and policy gradient updates directs our improvement steps in uncertain environments, even if that search ultimately guides us to a near-deterministic policy.

When deploying stochastic policies, the interplay between control and environment generate a powerful exploration dynamic [image by author]

Truth be told, if we look past the standard ϵ-greedy algorithm in value function approximation, there are number of powerful exploration strategies that work perfectly well while learning deterministic policies:

Although there are some workarounds, to apply value-based methods in continuous spaces we are generally required to discretize the action space. The more fine-grained the discretization, the closer the original problem is approximated. However, this comes at the cost of increased computational complexity.

Consider a self-driving car. How hard to hit the gas, how hard to hit the breaks, how much to throttle the gas — they are all intrinsically continuous actions. In a continuous action space, they can be represented by three variables that can each adopt values within a certain range.

Suppose we define 100 intensity levels for both gas and break and 360 degrees for the steering wheel. With 100*100*360=3.6M combinations, we have a pretty big action space, while still lacking the fine touch of continuous control. Clearly, the combination of high dimensionality and continuous variables is particularly hard to handle through discretization.

In contrast, policy gradient methods are perfectly capable of drawing continuous actions from representative probability distributions, making them the default choice for continuous control problems. For instance, we might represent the policy by three parameterized Gaussian distributions, learning both mean and standard deviation.

Driving a car is an example of inherently continuous control. The finer the action space is discretized, the more cumbersome value-based learning becomes [Photo by Nathan Van Egmond on Unsplash]

Before concluding the article, it is important to emphasize that a stochastic policy does not imply that we keep making semi-random decisions until the end of time.

In some cases (e.g., the aforementioned rock-paper-scissors or Aliased GridWorld) the optimal policy requires mixed action selection (with probabilities 30%/30%/30% and 50%/50%, respectively).

In other cases, (e.g., identifying the best slot machine) the optimal response may in fact be a deterministic one. In such cases, the stochastic policy will converge to a near-deterministic one, e.g., selecting a particular action with 99.999% probability. For continuous action spaces, the policy will converge to very small standard deviations.

That said, the policy will never be completely deterministic. For mathematicians that write convergence proofs this is actually a nice property, ensuring infinite exploration in the limit. Real-world practitioners may have to be a bit pragmatic to avoid the occasional idiotic action.

Depending on the problem, stochastic policies may converge to near-deterministic ones, virtually always picking the action yielding the highest expected reward [image by author]

And there you have it, four cases in which stochastic policies are preferable over deterministic ones:

Multi-agent environments: Our predictability gets punished by other agents. Adding randomness to our actions makes it hard for the opponent to anticipate.
Stochastic environments: Uncertain environments beg a high degree of exploration, which is not inherently provided by algorithms based on deterministic policies. Stochastic policies automatically explore the environment.
Partially observable environments: As observations (e.g., feature representations of states) are imperfect representations of the true system states, we struggle to distinguish between states that require different actions. Mixing up our decisions may resolve the problem.
Continuous action spaces: We must otherwise finely discretize the action space to learn value functions. In contrast, policy-based methods gracefully explore continuous action spaces by drawing from corresponding probability density functions.