The Alberta Plan: Sutton’s Research Vision for Artificial Intelligence | by Wouter van Heeswijk, PhD | Sep, 2022

By Jessie Hobb On Sep 14, 2022

Reinforcement Learning expert Richard Sutton outlines research directions for the next five to ten years

For anyone familiar with Reinforcement Learning, it is hard not to know who Richard Sutton is. The Sutton & Barto textbook is considered canonical in the field. I always find it highly inspirational to study the views of genuine thought leaders. Thus, when they present a new research vision, I’m primed to listen.

This summer, Sutton and his colleagues Bowling and Pilarski outlined a research vision for Artificial Intelligence, designing a blueprint for their research commitments in the next 5 to 10 years. The full document is only 13 pages long and comprehensively written, so it doesn’t hurt to have a look.

The overarching theme of the research vision is to arrive at a complete understanding of intelligence. Although not concerned with immediate applications, successful completion of the research plans would mean a substantial step towards genuine interaction between Artificial Intelligence and the complex reality we live in.

At the foundation of the outline is the “Common Model of the Intelligent Agent”, which the authors used to derive their so-called “base agent”:

The base agent has four components that each can be learned based on state signals: (i) perception, (ii) transition model, (iii) reactive policies and (iv) value functions. [source: Sutton et al., 2022]

In this representation, there are four components that can be learned:

Perception: The representation of past experience, shaping the way observations are perceived. In essence, the real-world is translated into an abstract state (containing appropriate features) needed for decision-making.
Reactive policies: Taking actions based on states. Typically the policy is aimed at maximizing cumulative rewards, but other objectives could be incorporated as well.
Value functions: Functions that attach expected future rewards to states and actions. Value functions guide the training of reactive policies, and also may embed multiple sorts of values.
Transition model: A function that predicts future states based on present states and actions. The model represents the agent’s understanding of the real-world system and may also reflect uncertainties.

Although the base agent representation is not shocking to those familiar with Reinforcement Learning or adjacent fields, it does have some interesting implications.

For instance, Sutton et al. pose that the transition model could take options rather than actions as input. The opportunity to take an action subject to a terminating condition is well known to financial practitioners, but also links to many real-world decisions made under uncertainty.

Another enrichment is the choice to create sets of policies and value functions. Traditionally, one would have a single function that maps states to actions and value functions measuring a single metric. In contrast, the base agent representation allows multiple policies and values to exist in parallel. Such a representation more closely represents the complexities and paradoxes of real life.

The research vision also deepens the concept of planning, utilizing the transition model to test and evaluate various scenarios. The authors state that this process should run in the background, without hampering the other elements. Simultaneously, learning processes operating in the foreground should be updated based on recent events, possibly supported by a short-term memory.

In summary, Sutton et al. seem to advocate richer representations of agents in AI, positioning closer to various of the finer details and complexities that are present in the real world. As research on such topics is sparse, there is plenty of inspiration to be drawn from the base agent.

The research vision defines Artificial Intelligence as signal processing over time. The underlying premise is that our base agent continuously interacts with a vastly complicated real world. To meaningfully predict and control input signals, the agent must continue to learn based on abstractions of reality.

Within this context, four distinctive points that help shape the research visions are distilled:

I. Using true observations

Many AI models rely on abundant fine-tuning when it comes to the observations used to train. Dedicated training sets, leveraging human domain knowledge and incorporating knowledge of reality are commonplace. However, such support may hamper generalizability and long-term scalability. As such, the authors prefer to simply use observations from reality as they arrive. This will make learning more challenging, but ultimately more robust with respect to wide arrays of applications.

II. Temporal uniformity

Algorithms often play tricks with time to learn more efficiently. For instance, we often collect batches of observations before updating, reduce learning rates once value functions appear to converge, or place different weights on certain types of rewards. Sutton et al. encourage to treat every point in time the same, typically directly processing new events as they arrive.

Consistency is key. If we decrease learning rates in stable environments, for instance, we should also allow to increase them when destabilization is observed. Especially in environments that change over time, or when dealing with meta-learning applications, this more comprehensive attitude towards time is expected to bear fruits.

III. Leveraging computing power

Moore’s law is still in full swing, with computing power roughly doubling every two years. The consequence, however, is that sheer computational power itself becomes an increasingly strong determinant for agent performance. Think of the algorithms used to learn computer games, taking millions and millions of simulation runs to train massive neural networks.

Perhaps paradoxically, the exponential increase in computing power implies we should leverage it more carefully, focusing on methods that scale with computing power. The flipside is that non-scalable methods, such as domain expertise and human labeling, should receive less research attention.

IV. Multi-agent environments

Many settings include more than a single intelligent agent. The impact of placing multiple agents within an environment may have substantial impact, as agents may respond to each other’s decisions and signals stemming from communication, cooperation and competition. Thus, the agents themselves aid in shaping their environment. In multi-player games this notion is highly visible, yet the resulting non-stationarity is also of great relevance in less stylized problem settings.

A captivating subfield of multi-agent environments is that of Intelligence Amplification, in which human performance is enhanced by a software agent and vice versa. In this symbiosis, learnings from one agent are used to improve the actions and understanding of another. Sutton et al. state that this research avenue is a crucial one to unlock the full power of AI.

For those genuinely interested in the concrete research activities that are mapped out, the original article (see below) provides a detailed roadmap. The topics are not easy to condense and fairly dense in technical detail, so I decided to omit a summary in this article.

The research vision provides an interesting peek into the AI developments that we may expect in the next decade, at least those coming out of Alberta, Canada. Although some research on all of the discussed topics exists, it is clear there are also major gaps to be filled in the years to go.

It is tempting to tailor AI algorithms to specific problems in a way that maximizes performance and crushes the benchmarks. However, from a research perspective, it makes sense to instill some simplifications (e.g., with respect to time and observations), to drive the design of robust algorithms that can be deployed for varieties of complex problems. This research vision appears to be a thoughtful step in that direction.

My personal takeaways of the research vision are that AI algorithms should:

Better align to the complexities and characteristics of real-world environments, and
Provide a more general and robust methodology to tackle a variety of problems without excessive design effort.

The concrete ways to achieve these objectives include finding appropriate abstractions in both state (i.e., features) and in time (i.e., options). At the same time, it should be ensured that planning can be efficiently performed. Combined with richer representations of policies and values, the gap between AI and reality becomes smaller and smaller.