# Reward Is Not Enough for Risk-Averse Reinforcement Learning | by Ido Greenberg | Nov, 2022

## Why can’t we address risk-sensitivity just by setting the rewards properly?

**TL;DR**: Risk-aversion is essential in many RL applications (e.g., driving, robotic surgery and finance). Some modified RL frameworks consider risk (e.g., by optimizing a risk-measure of the return instead of its expectation), but pose new algorithmic challenges. Instead, it is often suggested to stick with the old and good RL framework, and just set the rewards such that negative outcomes are amplified. Unfortunately, as discussed below, modeling risk using expectation over redefined rewards is often unnatural, impractical or even mathematically impossible, hence cannot replace explicit optimization of risk-measures. This is consistent with similar results from decision theory, where risk optimization is not equivalent to expected utility maximization.

**See also **my related posts about how to actually do risk-averse RL [NeurIPS 2022], and how to know when things go wrong anyway [ICML 2021].

Reinforcement Learning (RL) is a subfield of machine learning that focuses on sequential decision making. In the typical setting, an *agent* is trained to operate in *episodes*; every episode requires a sequence of decisions to be made (*actions*), each leading to a *reward*, and **the cumulative rewards form the episode’s outcome ( return)**. The standard objective of the agent is to find the decision-making policy that

**maximizes the expected return**.

*E[R]*While RL has shown impressive results in a variety of games and simulations, it still struggles to serve real-world applications. One challenge is the high risk-sensitivity of many natural applications of RL (e.g., driving, robotic surgery and finance): while a gaming bot is allowed to occasionally falter, an autonomous driver must perform reliably under any circumstances. The sensitivity to risk also reflects the natural user preference in many applications [1], as risk aversion is a common and universal behavior, believed to be rooted in human evolution [2].

The necessity of risk averse optimization motivated the research of settings where **a risk measure of the return is to be maximized, rather than its expected value**. Of a particular interest is the *Conditional Value at Risk* (** CVaR**, also known as

*Expected Shortfall*), which measures the average over the

*α*lowest quantiles (0<

*α<1*) in the returns distribution (in contrast to expectation, which measures the average over the whole distribution). It is a coherent risk measure (in fact, any coherent risk measure can be written as a combination of CVaR measures [3]). CVaR is widely used in financial risk management [4,5,6] and even banking regulation [8]; and more recently also in medicine, energy production, supply chain management and scheduling [9]. CVaR optimization has also been studied in the framework of RL [10,11,12], e.g., for driving [13] and robotics [14].

Optimizing a risk-measure such as the CVaR turns out to pose significant challenges. In policy-gradient methods, focusing on the worst scenarios causes the agent to overlook most of the data in general (*sample inefficiency*) and successful strategies in particular (*blindness to success*) [12]. In value-based methods, naïve planning wrt a risk measure leads to inconsistent risk-aversion over time (namely, by being reasonably risk-averse every time-step, we end up being too conservative over the whole episode) [11].

So what should we do if we wish to optimize the CVaR? While certain solutions address the limitations above, some propose to simply bypass the complications: instead of

it is suggested to **simply replace the rewards r with some r’ that amplify negative outcomes, and solve the standard RL problem wrt r’**:

For example, when training a car-driver, instead of optimizing the CVaR — just assign extremely negative rewards to accidents! This follows the spirit of the more general approach of *Reward Is Enough* [15], and may be argued to both simplify the algorithm and reflect our “true” utility function. In decision theory, this approach is known as *expected utility maximization* [16].

Both the mean return and the CVaR return are legitimate objectives in different applications. As discussed above, CVaR optimization has strong motivations in many risk-sensitive problems. In such cases, **unfortunately, it is often infeasible to reduce the CVaR optimization to a mean problem by replacing the rewards**. This result is known in decision theory, where expected utility maximization differs from CVaR optimization (unless there is one policy that strictly dominates all the others) [16: Section IV]. We discuss this discrepancy in the context of RL, where the temporal structure forms an additional factor. Specifically, we present the following claims:

**It is difficult to quantify our “true” utility function.****Quantifying the utility function objective is not enough: our objective is defined in terms of***returns*, while a reduction to mean-optimization requires us to redefine the*rewards*. The translation of utility from returns to rewards is often impractical or even mathematically impossible (see the MDP illustrated above).**Even if we can design***rewards*that faithfully reflect our objective, their optimization may still suffer from similar limitations to those of risk-measures optimization.

For convenience, below we discuss the claims in a different order (1,3,2).

## (1) “True” rewards are unknown

Designing the rewards to correctly reflect our risk-aversion is a difficult task. Consider the driving problem: how many extra-minutes would you care to spend on the road in order to prevent a single car accident? Note that the answer is not infinite, otherwise cars would extinct. The “right” answer is unknown, and by guessing we may encourage undesired behaviors — either too bold or too conservative.

Of course, CVaR optimization also requires us to quantify our risk aversion level (*α*). However, this choice is arguably more intuitive and universal: instead of “how many orders-of-magnitude are there between an accident and a minute of traffic?”, we have “shall we make decisions according to the *α=10% *worst scenarios, or the *α=1% *worst scenarios?”.

Indeed, our actual driving practices are arguably more similar to CVaR optimization: there is no internal value that we implicitly assign to accidents and to wasted-minutes before we make decisions on the road. Rather, we imagine outcomes and act accordingly. If we are risk-averse, we imagine worse outcomes (e.g., fear the car from the other lane to suddenly move toward our lane), resulting in safer behavior. This is exactly how CVaR optimization works: the agent learns to act conditioned on the *α* worst outcomes.

## (3) Redesigned rewards may just shift the problem to somewhere else

So undershooting the cost of an accident would tolerate too many accidents, whereas overshooting may cause us to just stay at home. But what if we magically knew the “right” cost of an accident, in terms of wasted minutes on the road? Since accidents are inherently rare and costly, naïve optimization for safe driving would still suffer from effectively-sparse rewards and (as a result) sample-inefficiency. Thus, even though we didn’t formulate the problem as CVaR optimization this time, methods from risk-averse RL may still be of interest (e.g., over-sampling certain types of episodes as in [12]).

## (2) Risk-averse returns *≠* risk-averse rewards

Consider a financial portfolio, where the return is the annual profit and the rewards are the daily profits. Consider your risk-averse friend, who hates losing money: annual losses cost him thrice the mental health that annual profits gain. Instead of bothering with risk-measures, why can’t we simply transform the outcomes using the trivial utility function: *U(profit) = profit if profit≥0, else 3*profit*?

Well, to answer this, we should specify whether we apply this outcome-transformation to the returns or to the rewards. If we transform the return, we no longer follow the standard objective of the expected cumulative rewards, and we do in fact optimize a risk measure of the return. If we transform the rewards, then we do reduce to a standard RL problem, but it no longer respects our actual objective, since the utility is defined over annual profits, whereas the rewards are defined on daily basis. Losing 10$ today and gaining 20$ tomorrow still contribute *-10+20=+10$* to the annual profit — not *-3*10+20=-10$*! That is, **risk aversion over returns is not directly translated into rewards, and a naïve translation usually results in over-conservative policies** (e.g., avoiding losses *every single day*)**.**

In the example above, it is unclear how to define a reward function that respects the risk-aversion wrt the returns. The desired reward function should depend on the structures within the environment (e.g., correlations over days). In that sense, we need to solve the problem in advance, just so we can define the rewards that would let us solve the problem again…

But even this impractical solution is not always possible.

**Claim**: There exists an MDP *(S,A,P,r,H)*, such that for any redefined reward function *r’*, the CVaR-optimal policy wrt to *r* differs from the mean-optimal policy wrt *r’*.

**Proof**: The technical proof is provided in a separate document. The main idea is captured by the MDP illustrated above.

**Remark**: If states can be modified in addition to rewards, then in certain cases, CVaR *can *be reduced to mean optimization over an extended state-space [11]. Naturally, this raises new difficulties, and in particular increases sample complexity and thus limits scalability. Alternative methods can often optimize the CVaR without such a significant increase in sample complexity [12]. This can be seen as analogous to non-Markov decision processes: such problems can be mathematically reduced to MDPs (by extending the states to include history); yet, practical approaches address these problems using dedicated algorithms (e.g., by adding a memory unit to the agent).

## When **can* *we reduce to mean optimization?

*can**

Following the discussion above, we can mark certain situations where CVaR optimization can be safely reduced to a standard RL problem. First, of course, we have to trust our utility function — the values that we assign to various outcomes. Second, the inconsistency between high-risk rewards and high-risk returns has to be negligible. This is true, for example, if extremely-negative returns are mostly caused by a single extremely-negative reward (like a car accident) instead of a sequence of negative rewards (like financial losses). Alternatively, in small-scale problems, we may apply the reduction by extending the state-space [11].

In addition, while CVaR optimization is quite useful, not every problem that appears risk-sensitive actually follows a CVaR objective in the first place. For example, if the rewards are truly additive and reliably quantified, and the number of i.i.d test episodes is sufficiently large, then the expected return is a sensible objective to optimize (thanks to the Central Limit Theorem over episodes).

Writing the problem in a way that fits the standard algorithm is a key procedure in the art of applicational machine learning, as it allows us to take advantage of the great algorithms in the field. Yet, in certain cases the problem cannot fit in, and the algorithm must be modified instead. In this article, we explained why risk-averse RL is one of these cases. Fortunately, recent works propose efficient algorithms for optimization of risk-measures in RL [12,13,14,17].

Finally, I wish to thank Shie Mannor, Uri Gadot and Eli Meirom for their feedback and helpful advice for this essay.

## References

[1] *Prospect Theory: An Analysis of Decision under Risk*, Kahneman and Tversky, *Econometrica 1979*

[2] *Risk sensitivity as an evolutionary adaptation*, Hintze, Olson, Adami and Hertwig, *Nature 2015*

[3] *On the significance of expected shortfall as a coherent risk measure*, Inui and Kijima, *Journal of Banking & Finance 2005*

[4] *Optimization of Conditional Value-at-Risk*, Rockafellar and Uryasev, *Journal of Risk 2000*

[5] *Some Remarks on the Value-at-Risk and the Conditional Value-at-Risk*, Pflug, *2000*

[6] *VaR and expected shortfall in portfolios of dependent credit risks: Conceptual and practical insights*, Frey and McNeil, *Journal of Banking & Finance 2002*

[7] *Expected Shortfall and Beyond*, Tasche, *Journal of Banking & Finance 2002*

[8] *Back-testing expected shortfall*, Acerbi and Szekely, *Risk 2014*

[9] *Conditional value‐at‐risk beyond finance: a survey*, Filippi, Guastaroba and Speranza, *International Transactions in Operational Research 2019*

[10] *Optimizing the CVaR via Sampling*, Tamar, Glassner and Mannor, *AAAI 2015*

[11] *Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach*, Chow, Tamar, Mannor and Pavone, *NIPS 2015*

[12] *Efficient Risk-Averse Reinforcement Learning*, Greenberg, Chow, Ghavamzadeh and Mannor, *NeurIPS 2022*

[13] *Worst Cases Policy Gradients*, Tang, Zhang and Salakhutdinov, *CoRL 2019*

[14] *Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping*, Bodnar et al., *Robotics Science and Systems 2020*

[15] *Reward is enough*, Silver, Singh, Precup and Sutton, *Artificial Intelligence 2021*

[16] *Comparative Analyses of Expected Shortfall and Value-at-Risk (2): Expected Utility Maximization and Tail Risk*, Yamai and Yoshiba, *Monetary and economic studies 2002*

[17] *Implicit quantile networks for distributional reinforcement learning*, Dabney, Ostrovski, Silver and Munos, *ICML 2018*

## Why can’t we address risk-sensitivity just by setting the rewards properly?

**TL;DR**: Risk-aversion is essential in many RL applications (e.g., driving, robotic surgery and finance). Some modified RL frameworks consider risk (e.g., by optimizing a risk-measure of the return instead of its expectation), but pose new algorithmic challenges. Instead, it is often suggested to stick with the old and good RL framework, and just set the rewards such that negative outcomes are amplified. Unfortunately, as discussed below, modeling risk using expectation over redefined rewards is often unnatural, impractical or even mathematically impossible, hence cannot replace explicit optimization of risk-measures. This is consistent with similar results from decision theory, where risk optimization is not equivalent to expected utility maximization.

**See also **my related posts about how to actually do risk-averse RL [NeurIPS 2022], and how to know when things go wrong anyway [ICML 2021].

Reinforcement Learning (RL) is a subfield of machine learning that focuses on sequential decision making. In the typical setting, an *agent* is trained to operate in *episodes*; every episode requires a sequence of decisions to be made (*actions*), each leading to a *reward*, and **the cumulative rewards form the episode’s outcome ( return)**. The standard objective of the agent is to find the decision-making policy that

**maximizes the expected return**.

*E[R]*While RL has shown impressive results in a variety of games and simulations, it still struggles to serve real-world applications. One challenge is the high risk-sensitivity of many natural applications of RL (e.g., driving, robotic surgery and finance): while a gaming bot is allowed to occasionally falter, an autonomous driver must perform reliably under any circumstances. The sensitivity to risk also reflects the natural user preference in many applications [1], as risk aversion is a common and universal behavior, believed to be rooted in human evolution [2].

The necessity of risk averse optimization motivated the research of settings where **a risk measure of the return is to be maximized, rather than its expected value**. Of a particular interest is the *Conditional Value at Risk* (** CVaR**, also known as

*Expected Shortfall*), which measures the average over the

*α*lowest quantiles (0<

*α<1*) in the returns distribution (in contrast to expectation, which measures the average over the whole distribution). It is a coherent risk measure (in fact, any coherent risk measure can be written as a combination of CVaR measures [3]). CVaR is widely used in financial risk management [4,5,6] and even banking regulation [8]; and more recently also in medicine, energy production, supply chain management and scheduling [9]. CVaR optimization has also been studied in the framework of RL [10,11,12], e.g., for driving [13] and robotics [14].

Optimizing a risk-measure such as the CVaR turns out to pose significant challenges. In policy-gradient methods, focusing on the worst scenarios causes the agent to overlook most of the data in general (*sample inefficiency*) and successful strategies in particular (*blindness to success*) [12]. In value-based methods, naïve planning wrt a risk measure leads to inconsistent risk-aversion over time (namely, by being reasonably risk-averse every time-step, we end up being too conservative over the whole episode) [11].

So what should we do if we wish to optimize the CVaR? While certain solutions address the limitations above, some propose to simply bypass the complications: instead of

it is suggested to **simply replace the rewards r with some r’ that amplify negative outcomes, and solve the standard RL problem wrt r’**:

For example, when training a car-driver, instead of optimizing the CVaR — just assign extremely negative rewards to accidents! This follows the spirit of the more general approach of *Reward Is Enough* [15], and may be argued to both simplify the algorithm and reflect our “true” utility function. In decision theory, this approach is known as *expected utility maximization* [16].

Both the mean return and the CVaR return are legitimate objectives in different applications. As discussed above, CVaR optimization has strong motivations in many risk-sensitive problems. In such cases, **unfortunately, it is often infeasible to reduce the CVaR optimization to a mean problem by replacing the rewards**. This result is known in decision theory, where expected utility maximization differs from CVaR optimization (unless there is one policy that strictly dominates all the others) [16: Section IV]. We discuss this discrepancy in the context of RL, where the temporal structure forms an additional factor. Specifically, we present the following claims:

**It is difficult to quantify our “true” utility function.****Quantifying the utility function objective is not enough: our objective is defined in terms of***returns*, while a reduction to mean-optimization requires us to redefine the*rewards*. The translation of utility from returns to rewards is often impractical or even mathematically impossible (see the MDP illustrated above).**Even if we can design***rewards*that faithfully reflect our objective, their optimization may still suffer from similar limitations to those of risk-measures optimization.

For convenience, below we discuss the claims in a different order (1,3,2).

## (1) “True” rewards are unknown

Designing the rewards to correctly reflect our risk-aversion is a difficult task. Consider the driving problem: how many extra-minutes would you care to spend on the road in order to prevent a single car accident? Note that the answer is not infinite, otherwise cars would extinct. The “right” answer is unknown, and by guessing we may encourage undesired behaviors — either too bold or too conservative.

Of course, CVaR optimization also requires us to quantify our risk aversion level (*α*). However, this choice is arguably more intuitive and universal: instead of “how many orders-of-magnitude are there between an accident and a minute of traffic?”, we have “shall we make decisions according to the *α=10% *worst scenarios, or the *α=1% *worst scenarios?”.

Indeed, our actual driving practices are arguably more similar to CVaR optimization: there is no internal value that we implicitly assign to accidents and to wasted-minutes before we make decisions on the road. Rather, we imagine outcomes and act accordingly. If we are risk-averse, we imagine worse outcomes (e.g., fear the car from the other lane to suddenly move toward our lane), resulting in safer behavior. This is exactly how CVaR optimization works: the agent learns to act conditioned on the *α* worst outcomes.

## (3) Redesigned rewards may just shift the problem to somewhere else

So undershooting the cost of an accident would tolerate too many accidents, whereas overshooting may cause us to just stay at home. But what if we magically knew the “right” cost of an accident, in terms of wasted minutes on the road? Since accidents are inherently rare and costly, naïve optimization for safe driving would still suffer from effectively-sparse rewards and (as a result) sample-inefficiency. Thus, even though we didn’t formulate the problem as CVaR optimization this time, methods from risk-averse RL may still be of interest (e.g., over-sampling certain types of episodes as in [12]).

## (2) Risk-averse returns *≠* risk-averse rewards

Consider a financial portfolio, where the return is the annual profit and the rewards are the daily profits. Consider your risk-averse friend, who hates losing money: annual losses cost him thrice the mental health that annual profits gain. Instead of bothering with risk-measures, why can’t we simply transform the outcomes using the trivial utility function: *U(profit) = profit if profit≥0, else 3*profit*?

Well, to answer this, we should specify whether we apply this outcome-transformation to the returns or to the rewards. If we transform the return, we no longer follow the standard objective of the expected cumulative rewards, and we do in fact optimize a risk measure of the return. If we transform the rewards, then we do reduce to a standard RL problem, but it no longer respects our actual objective, since the utility is defined over annual profits, whereas the rewards are defined on daily basis. Losing 10$ today and gaining 20$ tomorrow still contribute *-10+20=+10$* to the annual profit — not *-3*10+20=-10$*! That is, **risk aversion over returns is not directly translated into rewards, and a naïve translation usually results in over-conservative policies** (e.g., avoiding losses *every single day*)**.**

In the example above, it is unclear how to define a reward function that respects the risk-aversion wrt the returns. The desired reward function should depend on the structures within the environment (e.g., correlations over days). In that sense, we need to solve the problem in advance, just so we can define the rewards that would let us solve the problem again…

But even this impractical solution is not always possible.

**Claim**: There exists an MDP *(S,A,P,r,H)*, such that for any redefined reward function *r’*, the CVaR-optimal policy wrt to *r* differs from the mean-optimal policy wrt *r’*.

**Proof**: The technical proof is provided in a separate document. The main idea is captured by the MDP illustrated above.

**Remark**: If states can be modified in addition to rewards, then in certain cases, CVaR *can *be reduced to mean optimization over an extended state-space [11]. Naturally, this raises new difficulties, and in particular increases sample complexity and thus limits scalability. Alternative methods can often optimize the CVaR without such a significant increase in sample complexity [12]. This can be seen as analogous to non-Markov decision processes: such problems can be mathematically reduced to MDPs (by extending the states to include history); yet, practical approaches address these problems using dedicated algorithms (e.g., by adding a memory unit to the agent).

## When **can* *we reduce to mean optimization?

*can**

Following the discussion above, we can mark certain situations where CVaR optimization can be safely reduced to a standard RL problem. First, of course, we have to trust our utility function — the values that we assign to various outcomes. Second, the inconsistency between high-risk rewards and high-risk returns has to be negligible. This is true, for example, if extremely-negative returns are mostly caused by a single extremely-negative reward (like a car accident) instead of a sequence of negative rewards (like financial losses). Alternatively, in small-scale problems, we may apply the reduction by extending the state-space [11].

In addition, while CVaR optimization is quite useful, not every problem that appears risk-sensitive actually follows a CVaR objective in the first place. For example, if the rewards are truly additive and reliably quantified, and the number of i.i.d test episodes is sufficiently large, then the expected return is a sensible objective to optimize (thanks to the Central Limit Theorem over episodes).

Writing the problem in a way that fits the standard algorithm is a key procedure in the art of applicational machine learning, as it allows us to take advantage of the great algorithms in the field. Yet, in certain cases the problem cannot fit in, and the algorithm must be modified instead. In this article, we explained why risk-averse RL is one of these cases. Fortunately, recent works propose efficient algorithms for optimization of risk-measures in RL [12,13,14,17].

Finally, I wish to thank Shie Mannor, Uri Gadot and Eli Meirom for their feedback and helpful advice for this essay.

## References

[1] *Prospect Theory: An Analysis of Decision under Risk*, Kahneman and Tversky, *Econometrica 1979*

[2] *Risk sensitivity as an evolutionary adaptation*, Hintze, Olson, Adami and Hertwig, *Nature 2015*

[3] *On the significance of expected shortfall as a coherent risk measure*, Inui and Kijima, *Journal of Banking & Finance 2005*

[4] *Optimization of Conditional Value-at-Risk*, Rockafellar and Uryasev, *Journal of Risk 2000*

[5] *Some Remarks on the Value-at-Risk and the Conditional Value-at-Risk*, Pflug, *2000*

[6] *VaR and expected shortfall in portfolios of dependent credit risks: Conceptual and practical insights*, Frey and McNeil, *Journal of Banking & Finance 2002*

[7] *Expected Shortfall and Beyond*, Tasche, *Journal of Banking & Finance 2002*

[8] *Back-testing expected shortfall*, Acerbi and Szekely, *Risk 2014*

[9] *Conditional value‐at‐risk beyond finance: a survey*, Filippi, Guastaroba and Speranza, *International Transactions in Operational Research 2019*

[10] *Optimizing the CVaR via Sampling*, Tamar, Glassner and Mannor, *AAAI 2015*

[11] *Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach*, Chow, Tamar, Mannor and Pavone, *NIPS 2015*

[12] *Efficient Risk-Averse Reinforcement Learning*, Greenberg, Chow, Ghavamzadeh and Mannor, *NeurIPS 2022*

[13] *Worst Cases Policy Gradients*, Tang, Zhang and Salakhutdinov, *CoRL 2019*

[14] *Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping*, Bodnar et al., *Robotics Science and Systems 2020*

[15] *Reward is enough*, Silver, Singh, Precup and Sutton, *Artificial Intelligence 2021*

[16] *Comparative Analyses of Expected Shortfall and Value-at-Risk (2): Expected Utility Maximization and Tail Risk*, Yamai and Yoshiba, *Monetary and economic studies 2002*

[17] *Implicit quantile networks for distributional reinforcement learning*, Dabney, Ostrovski, Silver and Munos, *ICML 2018*

**Denial of responsibility!**Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.