How Assistance Games make AI safer | by Felix Hofstätter | Oct, 2022

By Jessie Hobb On Oct 27, 2022

And how they don’t

AI systems are becoming ever more powerful and are given more responsibility, from driving cars to trading on the stock market. Thus, it is clear that aligning their behavior with human intentions is a critical safety issue. The traditional approach to Reinforcement Learning only allows the user to design the AI’s reward signal once before training, and getting this right is famously hard. But what if we reframe the relationship between humans and AI like that between a teacher and a student, allowing the AI to ask questions and learn from the human’s actions? In this article, I will write about Assistance Games: an exciting framework for an AI to learn about a human’s intentions by interacting with them. I will explain the formalism of Assistance Games, compare them to other Reward Learning approaches, and mention some current weaknesses of the framework.

Motivated by the problem of AI alignment, Stuart Russel has proposed 3 principles for building safe AI in his book “Human Compatible: AI and the problem of control” [7]:

The machine’s only objective is the realization of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preference is human behavior.

Following these principles requires a shift in thinking about how we design AI. The reason Machine Learning revolutionized the field is that it does not need us to give a program instructions on how to reach its goal. Instead, we tell it how to learn how to achieve the defined goal. Now that the goals have become more complex, Russel’s principles propose taking another step back and defining only how the goal can be learned.

Assistance Games (introduced by Hadfield-Menell et al as Cooperative Inverse Reinforcement Learning [4]) is a framework for training AI according to Russel’s three principles. It is a human-in-the-loop method, meaning that the AI gets feedback from a human about the true reward signal. The AI agent, which is usually called R or robot, and the human take turns acting in an environment and receive a reward for their actions, similar to Reinforcement Learning. Critically, both human and robot share the same reward but only human knows the true reward function. The robot is equipped with some assumptions about how the human behaves regarding a goal and must infer the reward from the human’s actions. Depending on the environment specification, they may ask clarifying questions.

Motivating Example: The Cooking Game

Sha et al’s Benefits of Assistance over Reward Learning defines the cooking game to showcase what Assistance Games do better than traditional Reward Learning [9]. In this game, the human wants to cook a particular dish and the robot must learn to help them. The gridworld features several ingredients that can be interacted with to create various dishes. For example, flour can be placed on a plate to make pie dough. Pie dough may be filled with blueberries or cherries and the resulting pie has to be baked. The agent can ask the human questions about their preferences, such as if they like cherry or blueberry pie.

An illustration of the cooking game from Benefits of Assistance over Reward Learning [9]

Some aspects of this game are challenging for traditional Reward Learning. For example, the human may only be available for questioning after a few timesteps and traditional Reward Learning leaves no room for action before human feedback. I will elaborate on the benefits of Assistance Games, but first, let me formally explain how they work.

Formal Definition Of Assistance Games

Fully formalized, an Assistance Game (AG) is a tuple

This looks vaguely similar to a Partially Observable Markov Decision Process (POMDP) which is often seen in Reinforcement Learning. The definition can look a bit intimidating so let’s go through it step by step

Like in a POMDP, there is a set of possible environment states S, though the agent never sees the true state and is instead given an observation from the set Ω.
As both the human and the robot act in the environment, there are two sets of states: A^H for the possible human actions, and A^R for the possible robot actions.
Similar to a POMDP, the true state of the environment is unknown, and instead the robot and human receive observations from a set Ω. Observations are linked to the true state by an observation function O that maps from states to a distribution over observations. As with actions, there are distinct sets of observations and distinct observation functions for the robot and the human.
T describes the environment’s dynamics. It is a nondeterministic mapping to the next state based on the current state and last action by the human and robot.
γ is the discount rate as in vanilla RL.
Θ is the set of possible reward function parameters θ. The actual parameter θ is drawn from P_θ and used to parametrize the true reward function r_θ. As mentioned above, only the human knows θ. r_θ gives reward based on the human and robot’s actions and the current state.

Solving Assistance Games

In this theoretical formulation, an AG is solved by a pair of human and robot policies that achieve the highest possible reward. However, in real life we may only want to train the robot and choose the human’s policy as a hyperparameter. Then, an AG can be reduced to a POMDP by treating the human as a part of the environment. The robot can thus be trained using POMDP solving algorithms or Deep Reinforcement Learning. However, there are two drawbacks to using traditional POMDP solving algorithms such as Value Iteration.

Firstly, the transformation of an AG to a POMDP exponentially increases the state space. For every state in an AG, there must be corresponding POMDP states embedding the possible reward function parametrizations and previous human actions. In the POMDP of the cooking game, there is not merely a single state of the robot standing on a tile. There is one variation of this state where the agent has high confidence that it should make a cherry pie, a corresponding state for chocolate cake, and one for every other degree of belief that is possible according to the AG’s reward function parameter space. Further, the human’s actions need to be embedded in the state so that the robot can learn the correlation between what the human wants and the subsequently received reward.

In addition, traditional POMDP solving algorithms only return an optimal robot policy if the human’s policy is perfectly rational. Obviously, this does not accurately model huamn behavior.

Fortunately, thanks to recent research solving the POMDP corresponding to an Assistance Game has become much more efficient. Malik et al introduced an adapted version of Value Iteration [6]. It allows solving POMDPs based on Assistance Games with a much smaller state space than in the traditional approach. Moreover, it produces optimal robot policies for a broader class of human policies. A better model of human behavior that can now be used is Boltzmann rationality, meaning the human chooses their actions based on a Boltzmann distribution over the action’s expected value.

On the Deep Learning side, Woodward et al have had success with a multi-agent approach where the human is replaced with another agent during training [11].

In the vanilla Machine Learning paradigm, the reward is specified exactly once before the model is trained. In contrast, Reward Learning approaches usually alternate between two phases of training. For example, in Reinforcement Learning from Human Feedback (RLFH) the agent first completes multiple episodes or partial episode trajectories [2]. Then, the human gets presented with pairs of these trajectories and states their preference between them. This feedback is incorporated into a reward model which the agent uses during the next few episodes. These two phases of acting in the environment and gathering feedback alternate repeatedly during the training process.

On the other hand, Assistance Games have a more “fine-grained” interaction pattern between the human and AI because the robot and human take turns acting within the same episode. Therefore, the robot can update its belief about the true reward after every action. In the next section, I will explore the advantages of this fine-grained approach using the example of the cooking game.

The following process diagrams illustrate the difference between vanilla Reinforcement Learning, RLFH and Assistance Games:

Plans Conditional On Future Feedback

For one thing, it is useful for the robot to be able to act before receiving any feedback. In the cooking game, the human may not be available for a few timesteps — maybe they are still at work and the cooking robot is wondering how to use their time productively. In an AG, it can learn that it always makes sense to prepare dough since any cake will need it. This is called planning conditional on future feedback since the agent knows that it will ultimately receive instructions on what to do with the dough. Making such a plan is not possible with RLFH as the robot needs human feedback in the first phase.

Planning conditional on future feedback encourages conservative behavior in an agent, meaning they try to keep open the option to fulfill alternative reward functions. In traditional Reinforcement Learning, this can be achieved by providing a set of auxiliary reward functions and penalizing the agent if its solution to the main reward function makes the auxiliaries harder to satisfy [10]. Conservative agents try to avoid side effects and preserve the environment as much as possible. This makes them suitable for environments where it is hard or infeasible to specify everything the agent should avoid doing.

Relevance Aware Learning

Further, Assistance Games allow the robot to take into account relevance when asking questions. We can imagine an extension to the cooking game where the robot finds worms in apples or similar ingredients. It is unclear to the robot if the human wants wormy apples to be put in the compost bin or trash. Since worms are only found sometimes, this puts RLFH in a dilemma: should the robot ask a question about a situation that might not arise? In the cooking game, it may seem irrelevant, but in more complex environments the number of situations to consider can easily become infeasible. A sophisticated enough robot may wonder not only about worms but also about the size of ingredients, if they have seeds, or if the human’s emotional state affects their preferences. On the other hand, in an AG the robot can learn to only ask about particular situations if they arise.

Learning From Human Actions

A third advantage is that in Assitance Games the human can take other actions than answering questions and the robot can use these actions to infer the human’s goals. Suppose that in the cooking game chocolate cake can be made using dough, sugar, and chocolate. Dough and sugar are used in other cakes as well but if the human grabs the chocolate, then the robot can easily infer that the goal is to make chocolate cake.

Designing The Prior Can Be Hard

Crucially, an Assistance Game requires a reasonable prior over the human’s preferences. In the cooking game, this may be easy. If the robot can only bake apple, cherry, or blueberry pie, then it can start with a uniform distribution over reward functions that reward making any one of those pies. Or the prior could be based on statistics about human preferences. But as the tasks become more complex and the robot grows capable of considering more concepts, it becomes less obvious how to design a prior. We may be tempted to make the robot more uncertain about everything as a result. Unfortunately, this also means the robot is more restricted in what it can do without human feedback. If our cooking robot is also a housekeeper, they may be uncertain if the human desires them to make a mess by preparing the dough before they arrive. Thus, a bad prior can prevent the robot from developing good plans that are conditional on future feedback.

A recently introduced technique called Reward Learning by Simulating the Past (RLSP) may make designing priors somewhat easier [8]. The core idea is that some user preferences can be inferred from the current state of the environment. If there is a fragile vase in a room, the robot can infer that the human wants it intact as it would have been easy to break it had the human desired to do so. However, inferring intentions like this requires simulating the possible actions that could have led to the current state. Trying out all human trajectories for a given environment quickly becomes computationally infeasible. Instead, a Deep Learning version of RLSP simulates the trajectory backwards from the current environment state [5].

Using RLSP in Assistance Games can help the robot infer which actions are robustly good, even if its reward function prior is not sufficient to do so. A robot that should cook and keep the house clean can learn that it is alright to prepare dough even though it makes a mess by seeing that pie has been baked before in the environment and inferring that this required making dough.

Human Preferences May Change

The model of Assistance Games assumes that there is a constant true reward function that is known to the human and must be inferred by the robot. The reward function is supposed to represent the humans’ preferences and hence having it constant is not entirely accurate. Not only do preferences change naturally over time, but some AI applications are also bound to take actions that influence the preferences of humans that interact with them. In any process where the AI’s reward is related to human feedback, this may incentivize the AI to manipulate the human’s preferences to get more reward. An infamous example is recommender systems which learn to polarize users by showing them more extremist content as this makes their clicks easier to predict.

It is easy to model a recommender system as an assistance game: the robot’s actions consist in giving video recommendations to the human who gives feedback by watching the video that is most interesting to them. In this naive model we would have the exact same preference tampering dynamics as in current recommender systems. It may be possible to avoid manipulative behavior by modelling the problem differently. For example, a model could not make inferences about a human’s prefernces based (only) on which recommendations they consume. The problem could be circumvented instead of solved by taking into account different preferences, such as meta-preferences about how preferences change.

However, maybe the only way to make sure we do not train manipulative AI using Assistance Games is to overhaul the formalism to take into account changing preferences somehow. It has been suggested that attempting to align AI to human preferences as Russel suggests is futile if we do acknowledge how AI can change human preferences [3]. There are many attack vectors for an AI aiming to manipulate human preferences, such as the environment or the humans’ behavior. Assistance Games assume only that behavior and environment are an indicator of preferences, but ignore how they also affect behavior [1]. Recently, researchers in AI and other fields have called for a multidisciplinary research effort to better understand how preferences are shaped so that AI that deals with human preferences can be designed more safely [3].

Image from The problem of behaviour and preference manipulation in AI systems [1]

I am very excited about Assistance Games as a framework for training safe AI. It produces conservative, pragmatic agents that display desirable behavior such as plans conditional on future feedback and relevance-aware learning. The framework also allows an agent to learn from a broader class of human actions than traditional Reward Learning in which the human is restricted to stating preferences or answering questions. Yet, challenges such as picking a good prior over reward functions and an accurate human policy remain. As it is the subject of active research, I am confident that technical and algorithmic innovations will keep improving the framework in these areas. On the other hand, it is unclear how to deal with the risk of preference manipulation in assistance games. Presumably, there are many potential applications of AI where this won’t be an issue. Still, any framework that aims to be a general paradigm for training safe AI will need to take AI’s influence on human preferences into account.

[1] Ashton and Franklin, The problem of behaviour and preference manipulation in AI systems, Proceedings of the Workshop on Artificial Intelligence Safety 2022, http://ceur-ws.org/Vol-3087/paper_28.pdf

[2] Chrisiano et al, Deep Reinforcement Learning from Human Preferences, ArXiv, 12th June 2017, https://arxiv.org/abs/1706.03741

[3] Franklin et al, Recognising the importance of preference change: A call for a coordinated multidisciplinary research effort in the age of AI, ArXiv, https://arxiv.org/abs/2203.10525v2

[4] Hadfield-Menell et al, Cooperative Inverse Reinforcement Learning, ArXiv, 12th November 2016, https://arxiv.org/abs/1606.03137

[5] Linder et al, Learning what to do by simulating the Past, ArXiv, 8th April 2021, https://arxiv.org/abs/2104.03946

[6] Malik et al, An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning, ArXiv, 11th June 2018, https://arxiv.org/abs/1806.03820

[7] Russell S., Human compatible: Artificial intelligence
and the problem of control, Penguin, 2019

[8] Shah et al, Preferences Implicit in the State of the World, ArXiv, 12th February 2019, https://arxiv.org/abs/1902.04198

[9] Shah et al, Benefits of Assistance over Reward Learning, NeurIPS 2020, https://people.eecs.berkeley.edu/~russell/papers/neurips20ws-assistance

[10] Turner et al, Conservative Agency via Attainable Utility Preservation, ArXiv, 26st Februrary 2019, https://arxiv.org/abs/1902.09725

[11] Woodward et al, Learning to Interactively Learn and Assist, ArXiv, 24th June 2019, https://arxiv.org/abs/1906.10187