Reinforcement Learning Basics: Understanding Stochastic Theory Underlying a Markov Decision Process | by Shailey Dash | Feb, 2023

By Jessie Hobb On Feb 16, 2023

Part 1: On the Markov Decision Model which forms the theoretical foundation of reinforcement learning problems

Example of a simple MDP with three states (green circles) and two actions (orange circles), with two rewards (Image source: Wikipedia)

Reinforcement learning (RL) is a type of machine learning that enables an agent to learn to achieve a goal in an uncertain environment by taking actions. An important aspect of reinforcement learning is that it evaluates the actions taken rather than instructs by giving correct actions. Each action has a reward associated with it which signals to the agent the success of the action in progressing towards the goal. The agent navigates the environment repetitively learning how to optimize to reach its goal.

Till very recently most of reinforcement learning’s successes where related to game playing agents such as Alpha Go. However increasingly reinforcement learning is being applied in real world applications such as self-driving cars, and robotic automation. The recently launched ChatGPT also has a RL component in its architecture for finetuning of answers. Given this, it becomes important to understand RL which has been acknowledged to have a steep learning curve both from the theoretical perspective and also the practical implementation.

This article focuses on the Markov Decision model (MDP) which is often used to formalize the reinforcement agent’s problem. However, a neglected aspect of this conceptualization of the RL problem is that the MDP framework is an important application of Stochastic theory.

There is a non-trivial amount of stochastic and probability theory that underlies the basic Markov Decision Process equations. Without understanding these equations and how they are derived, it is difficult to make progress with RL. This article aims to clarify the stochastic and probability concepts that underlie MDPs with a focus on understanding equations presented in Sutton(2017) Ch 3.

The basic MDP framework as presented in Sutton and Barto (2017), Ch3, actually appears to be quite simple and intuitive — a sleight of hand achieved by a master! However, there is a huge backlog of statistical theory behind this framework. We don’t need to know the depths of Stochastic Processes, because, believe me, it’s a lot! But it is important to be aware of some of the subtleties underlying some of the concepts used in MDPs. Hence it is important to understand what are stochastic processes as the theory of these lies at the core of MDPs.

This article is the first in a sequence of articles that will aim to make the fundamentals of RL clearer both at the theoretical and practical application level.

Who is the intended audience?

Well, of course, it looks like it is for beginners, but I would say that this post is actually more useful to people who have some knowledge of RL and then try to understand the equations presented in Sutton (2017) Ch 3 more intuitively. This article is the result of my learning journey into RL. I started quite happily with Sutton and Barto (2017) and thought I had understood most of the concepts in Ch 3 for the MDP framework. However, once I tried to actually derive some of the equations presented in ch3, I found myself unable to figure them out. So, it became clear to me that possibly more was going on than I seemed to know about. There is no real resource available that I could find that relates MDPs to their underlying statistical underpinnings. I had to use a variety of material ranging from Wikipedia to various university level Stochastic theory courses. Given how scattered the underlying material is, I decided to document it in this post.

Approach — Decoding Sutton and Barto (2017)

The other thing is the format of this post. Sutton (2017) represents the bible of RL theory. The current edition is an updated version available online. But, much of theoretical material was actually published 20 years ago in the first edition of the book. Also, the approach taken in Ch3 of the book is the approach found across the internet. Given this, my approach in this post, is a bit like a ‘key’ where I present material from the book and then explain it using the lens of stochastic theory.

A slight digression in defense of ‘keys’. Legions of students, especially in school, use them extensively; whereas teachers actually frown on them for spoon feeding. However, when you are learning online and typically on your own, then a detailed guide becomes important, as you can’t just put up your hand and ask a question.

Enough on the preamble! Let’s move to the actual topic….

Role of MDP

Understanding how the reinforcement learning problem works in the theoretical MDP environment is critical to getting the fundamentals of RL. Many important RL algorithms either are based on Markov based equations or models or represent some key departures from the Markov model. MDPs are a mathematically idealized form of the reinforcement learning process and provide a theoretical framework for an idealized type of reinforcement learning problem. Hence, no getting away from it: understanding MDPs is foundational to understanding the RL problem and how it is being addressed.

Be warned, Stochastic Theory is a particularly tough branch of probability theory and at times the notation used can be very intimidating. It shall be my attempt to bring the underlying stochastic concepts and ideas underlying MDPs more explicitly. Hopefully, the end result brings to the front the complexities that underlie the Markov model.

In this article I split up the content in 3 sections:

Section 1: Key Components of an MDP

– RL problem

– State, agent, environment, reward, etc.

Section 2: Probability and Stochastic Concepts Used in MDPs

– Random variables, state or sample space, stochastic processes, realization or trajectory

– Probability distributions, joint distributions, marginal distributions, conditional distributions

– Deriving the Markov property

– Probability transition Matrix

Section 3: Understanding Sutton Ch3 equations Using Probability and Stochastic concepts

– The MDP Model

– Formal presentation of MDP model

– Equations 3.2, 3.3, 3.4, 3.5, 3.6 from Ch3, Sutton (2017)

This probably seems like just the introductory part of an MDP. You may well ask: ‘what about value functions, policy, Q value function, Bellman equations?’. Well, that comes in the next article, as once I have gone through with the above contents, you will agree that this by itself it is a lot.

Section 1: Key Components of an MDP

First, to set context about what we are about to explain, I shall provide a brief summary of the basic MDP model and definitions used by Sutton (2017), Ch3.

Why MDP?

The major reason for the popularity of the MDP framework is due to the Markov property where future states depend only on the current state and not the history of states. Or, to put it another way, the current state encapsulates all the relevant information required for making decisions about the future states. By the way, this seemingly innocuous assumption actually has a wealth of probability theory behind it and hence also implications.

Because of this property, an MDP represents a tractable way to formalize sequential decision making. It provides a probabilistic framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. In this scenario, actions don’t just have immediate payoffs in terms of rewards, but also affect subsequent time periods or states, and through them future rewards. Of course, the way reinforcement learning is done in practice is different from the MDP.

Some Definitions to start off with

Agent: This is the learner or decision maker who makes sequential decisions to achieve an end goal in an environment.

Environment: This is defined as everything that is outside the agent’s locus of control. Hence the environment can be the external environment for a robot, it can be roads and pedestrians for a self -driving car, etc. In a video game, the environment is the game. The following points are important for the environment:

Observations are what the agent receives as input — for example a self-driving car can only receive inputs from the environment in its vicinity and will not be aware of a road block significantly ahead
The agent only has full knowledge of the environment if it knows the model of the environment, example the structure of rules of a video game which govern the environment
The environment is changed by agent’s actions (sudden breaking causes a small jam), but can also change on its own (a road diversion)

What is the RL problem?

The typical RL problem involves a learning agent interacting over time with its environment to achieve a goal. To do this, a learning agent must be able to sense the state of its environment to some extent and must be able to take actions that affect the state. The fundamental problem is presented in this diagram in below.

Agent environment interaction in a Markov decision process (Image source: Created by author, inspired by Sutton (2017) Ch3)

The measurement of time

Time is typically measured in discrete steps: t=0,1,2,..T. Where, T represents the final or terminal state. Hence it is a finite discrete time problem. There are also continuous time problems and also infinite horizon problems, but these are a lot tougher to handle mathematically. As a result the focus is on the finite discrete time model.

State

This represents a snapshot of the environment and contains all the information pertaining to the environment. This includes information on rewards, how the environment will change with respect to actions, etc. In the case of a self-driving car, a state may represent a region around the car. Similarly, in the case of a robot moving forward, it would represent a snapshot of the environment once an action such as stepping forward has been taken.

For Markov purposes, the agent and environment interact in a series of discrete time steps: t=0,1,2…T. Where, T represents the final or terminal state. At each time step, t, the agent receives some representation of the environment encapsulated as Sₜ ∈ S, where S is the set of all states. In the context of MDPs, State is a discrete random variable.

Rewards

The agent receives a signal or reward Rₜ₊₁ from the environment in the next time period which is associated with how well the action is doing in terms of achieving the overall goal. The reward is typically defined as a scalar where Rₜ ∈ R ⊂ℝ is just the set of real numbers (example of complicated notation for something simple! But that’s the way things happen in statsworld😊).

Rewards can be both deterministic or a random variable. Sutton (2017) takes Rₜ to be a random variable, though most of the examples take a deterministic reward.

Now, Rₜ depends on the current state of the world, the action just taken, and the next state of the world, i.e., Rₜ = R(Sₜ, Aₜ, Sₜ₊₁). However, its frequently simplified to depend on the current state and action pair, Rₜ = R(Sₜ, Aₜ).

Now that we have the basic components of the MDP in place, i.e., State, Reward, Agent, Environment, we need to better understand the nature of the MDP variables State, Reward, Agent, Environment in the context of probability and stochastic theory.

So, let’s now jump to Probability and Stochastic theory…

Section 2: Probability and Stochastic Basics

To define stochastic processes, we need to first understand what are random variables, probability distributions, joint distributions, marginal distribution, conditional distributions and the chain rule of probability.

Let us now spend some time understanding the above concepts. Once we understand these, we are going to come back to MDPs and understand how it works using these concepts.

Since stats and probability is vast, I would like to keep the focus on the applicability of these concepts to MDPs. So, I present the statistical concepts related to an MDP concept in italics and then explain its relevance to the MDP. Hopefully, this will make the MDP concept clearer.

Random variable

In probability, a random variable is a way of capturing the fact that a variable can take multiple values or occurrences. It represents a mapping or function that captures the fact that the variable can be mapped to many values. For example a toss of a coin has outcomes: {H,T} and we can map this to a number lying between 0 and 1 signifying probability. See this Wikpedia page for a more formal definition. As an aside, I have found the Wikipedia pages for probability and statistics to be an excellent, if slightly advanced, resource.

When the range of the random variable is countable, it is termed discrete and its distribution is a discrete probability mass function ( a concept that I will define later as it will be used a lot). If the variable is continuous, then its distribution can be defined by a probability distribution function.

The definition of a random variable is traditionally defined for real valued cases, denoted by R. The definition can also be extended to any measurable set E which contains random Boolean values or categorical values, vectors, matrices or functions.

This becomes important to note because in MDPs we have real valued random variables as well as categorical types. Also stochastic processes (to be defined a little later) are also a random function of time.

In the context of the MDP defined above, we have State and Reward as random variables. Both variables are discrete and have a finite range.

State space or sample space:

When we talk of a random number, then the no. of finite number of states that a random variable can take is termed as the sample state or state space in the case of an MDP. The simplest example of a sample space is that for a throw of coin. It can take one of two values: heads or tails.

In Sutton ch3, this is denoted by the set S which contains different states, where each state can be representative of a different situation. For example, we can have a stochastic process for going to office. The list of states that S could potentially contain:

1. s’: meets a jam on the way to commuting to office

2. s: Meets clear roads to office

3. s”: Meets a road block due to an accident

These states are what constitute the sample space or range of a random variable, i.e., the values it can take.

Stochastic Process:

A stochastic process is a collection or ensemble of random variables indexed by a variable t, usually representing time. So, to be clear, a stochastic process comprises the same random variable at multiple points of time. The stochastic process is typically indexed by the time variable (it can also be vector space, but let’s not confuse things). Thus, each index or point of time is associated by a specific random variable. Read more about stochastic processes on this Wikipedia page.

This means that at every point of time, the stochastic random variable can realize one of the values in S. A stochastic process can also be written as: S(t,i), t∈T. Where the State is a function of both, time as indexed by t, and i representing the specific state from the state space S.

The state space is defined using elements that reflect the different values that the stochastic process can take. So, in the case of MDPs, a state, Sₜ , is a stochastic random variable over time, indexed by t, where for each t, Sₜ can take values in the finite set S ={s,s’,s’’’}.

The point of confusion regarding stochastic processes is that often the second index is skipped to avoid messy presentations. However, without it, the clarity that there are actually 2 things going on is lost:

1. Movement overtime indexed by t as the variable evolves

2. At each point of time a choice of state from the set of states

Let us continue with our example of the road to illustrate the distinction between the stochastic process evolving over time and different state spaces. There are a finite number of state spaces which the stochastic variable can take at a point of time. So, for example, when we head out to office we can potentially meet with clear roads or a jam on the way for the first 15 minutes. In the next 15 minutes, the situation may change. Suppose that it requires 45 minutes to get to office, then we can break it up into 3 different time slots of 15 minutes. So, our time periods can be t=1,2,3.

There is a probability p of clear road, q of meeting a jam and r of meeting a road block in the first 15 minutes. In the next 15 minutes, one of these 3 outcomes can again occur and similarly for the last 15 minutes. This is evolution over time. This means that, at each observation at a certain time, there is a certain probability to get a certain outcome. This can be illustrated as below.

The image shows a sample trajectory. At time t=1, i.e., first 15 minutes, the road can be clear, jammed or blocked with respective probabilities. Similarly for t=2 or second 15 minutes and so on. At a point of time, say, first 15 minutes, we can see the road state can vary randomly across 3 possible states. This represents the randomness in state at a point of time. However, if we look across time t, then we have a trajectory. So, if we have 3 time points, we will also have 3 random variables, one associated with each point of time.

We can also understand the reward set in this context. The reward set R in this case would be the time taken to cover the specific segment of the road at a particular time. There is a defined distribution of time (for simplicity we can keep it as discrete minutes) for each state. What does this mean? Basically, each state of the road would be associated with a distribution of time. It’s like sometimes we do the clear road (s) in 15 minutes and sometimes in 17 minutes due to some random variations and, at others in 18 minutes. So, we can define R(s)={15,16,17,18}. There is a probability distribution defined for the time taken to complete the segment with clear road conditions. This is an intuitive example of how the reward can also be random — something which most of the examples in Sutton (2017), Ch3 do not cover. Each state of the road — for example, the jammed state — would have its own distribution over the ETA.

Realization or trajectory of a stochastic process

A stochastic process can have many outcomes, due to its randomness. A single outcome of a stochastic process is called variously as a realization, episode or a trajectory. It is formed by taking a single possible value of the random variable at each time point of the stochastic process as it evolves over time.

So, to continue with our traffic example: a possible trajectory is shown in the image by the arrows. In the first 15 minutes, the motorist meets a clear road, in the next 15 minutes hits a jam and in the final 15 minutes hits a blocked road. So, a potential sample path would be:(s,s’,s’’). This is just one possible realization. The motorist traverses this path every day and is likely to meet a slightly different set of states every day; another day it can be (s’’,s’, s).

Stochastic processes are widely used for probabilistic representations of systems and phenomena that appear to vary in a random manner. Or, to put it more simply, stochastic processes or models are used to estimate the probability of various outcomes of a random variable which also changes overtime. Examples include the growth of a bacterial population, or the movement of a gas molecule. They are also used extensively in financial analysis where stochastic models can be used to estimate situations involving uncertainty, such as investment returns, volatile markets, or inflation rates

Further there are many different types of stochastic processes of which the Markov process is one type. Others, for example, include random walks, Gaussian processes, etc. Each stochastic process has its own specific assumptions and features which it is important to understand.

Probability Distribution

Now we turn to the question of how do we get the probabilities of different states? Recall, in above example, we have defined the probabilities of meeting a jam and having clear road as q and p. Now, let us just elaborate a bit about what this means in terms of probability distributions and how they come about.

The probability of a random outcome (road is jammed) is assessed as the proportion of times the outcome would occur in a very long series of repetitions. So, this is how we would assess p, q, r by noting the state of road variable at the specific time intervals specified across multiple days.

Recording all these probabilities of outputs of a random variable, Sₜ , gives us the probability distribution of the random variable. In the discrete case it is termed the probability mass distribution. This is a function that provides the probability that a discrete random variable equals a particular value. Read more of this here.

To continue with our office commute example: the State variable, can therefore take any of these values with a specific probability. For example if our state set for a road consists of 3 possible states: S= { s = clear, s’=jammed, s’’=blocked}.

The road can have any of these 3 states at a specific point of time (say morning) with a probability. For example, it can be free =.5, jammed =.4, and blocked =.1(blocking is relatively rarer state!). The set of probabilities is termed the probability distribution for the variable which is the state of the road.

Joint Distribution vs Marginal Distribution vs Conditional Distribution

In case of stochastic processes, since the random variable is the same but measured over different points of time, it is likely that state in t, Sₜ , is correlated to states in previous periods, Sₜ₋₁, Sₜ₋₂. In this case, then, if we want to understand the probability of Sₜ , it is the joint probability mass distribution (pmf) that is relevant. The joint pmf allows us to compute probabilities of events involving multiple random variables taking into account the relationship between the variables.

Given two random variables that are defined on the same probability space the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs.

This can be defined more generally in Markov context as:

Suppose we consider a discrete time 2 period stochastic process, Sₙ , where n=1,2. So, S₁ and S₂ are the two discrete stochastic random variables which can take values from the set i ∈ S = {1,2}. The joint distribution of Sₙ is given for every n, i.e., time periods and finite sequence of states:(i₁,i₂) by:

Let us look at our simple road example:

For simplicity assume that there are only 2 time periods — i.e., Sₜ and Sₜ₋₁. The state space is still 3: {clear( C ), jammed(J), blocked(B)}.

The joint probability table would be as follows:

2 period office drive joint probability distribution ( source:author)

The joint probability is the probability of two events occurring together. The table above provides the different combinations of instances of both events. Notice, since this is a stochastic process, time t-1 always comes first and then t.

Marginal probability distribution

The marginal distribution of a variable gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. The marginal mass function for Sᵢ is:

The marginal probability is the probability of a single event occurring. Essentially the marginal distribution is the column wise sum of probabilities. The equation for this requires you to sum over the column for j. The joint probability for any combination is given by the cell probabilities. In our example’s context, it is the probability of meeting a jam in Sₜ irrespective of what road conditions were in Sₜ₋₁, i.e., we sum the probabilities across the col J.

Conditional probability

A conditional probability is that a specific event occurs given that another specific event has already occurred. This means the probability of a jam in Sₜ given that the road was clear in Sₜ₋₁.

Now in our case, because Sₜ is a stochastic variable, and we are looking at a two period example, the joint and conditional probabilities are the same.

Mathematically, the conditional distribution of a variable given another variable is the joint distribution of both variables divided by the marginal distribution of the other variable.

Defining this generally it is:

The generalization of this to case where we have n random variables, X₁,.., Xₙ.

The joint probability distribution of these n variables is:

This can be written as the conditional function times the marginal function. This is based on the chain rule of probability. In terms of notation, Pₓ (x₁) = P(X=x₁). For the 2 random variables, X₁, X₂ the joint distribution can be written as:

Stated in words: the joint probability distribution of 2 variables X₁, X₂ taking specific values x₁, x₂ can be given by the conditional probability of X₂=x₂ given that X₁=x₁, times the marginal probability of getting X₁=x₁.

Why do we need these distributions — deriving the need for the Markov Property

Joint, conditional and marginal probability distribution definitions become important in the context of MDPs when we want to determine the probabilities of the State variable taking specific values as it transitions across different time periods. Our example is simple and consists of 3 time periods. But, what if the time period was 100, and we wanted to understand the probability of the state taking a specific value in the 100th time period? This is the underlying probability problem. Let’s look at how the general case is solved and then what the Markov property is

For the more general case, to get the joint probability of X₁,.., Xₙ random variables taking specific values, x₁,.., xₙ can be found by the conditional probability of Xₙ given the entire history of previous random variables taking specific values, times the joint probability distribution of the n-1th distribution.

We can open out the expression for joint probability of P(Xₙ₋₁=xₙ₋₁…,X₁=x₁) again into an expression for conditional probability of Xₙ₋₁ given the history of (Xₙ₋₂,…,X₁) and the joint probability of (X₁,..,Xₙ₋₂).

This can be further unrolled backwards. We will finally get the expression for a sequence of conditional probabilities and the marginal probability of the first variable P(X₁=x₁)

Now let us translate this to the MDP context. So far, this has been couched in terms of any sequence of random variables. However, this result translates easily to stochastic variables and Markov processes. We just need to replace Xᵢ by Sₜ and now the indexing would be in terms of time, t=1,2,..n instead of Xᵢ = xᵢ.

Why am I belaboring the obvious? Because there is a lot of material out there that uses different variables and indices for Markov processes and other sequences of random variables and this can get hugely confusing. I tried to unravel this for myself and hence am writing it down, so that it can help other non-statisticians who are trying to figure out the probability theory behind Markov results.

For analyzing the MDP, t=1,..,n and states xᵢ ∈ S. Just to be clear xₙ is a specific value of a state at time period t=n.

Let’s write the joint probability of getting S₁,..,Sₙ as the state variable evolves over t=1..n. Substituting in the general case discussed above we get the joint probability as a sequence of conditional probabilities and the marginal distribution of S₁.

To put it in words: the probability of getting states S₁,..,Sₙ with specific values is equal to the conditional probability of the state at time t=n having a value xₙ given the historical values of previous states times the joint distribution of the n-1th period. This unrolls into a sequence of conditional probabilities and the marginal or initial distribution of state at S₁.

Now this is just a stochastic process where the value of a state in time period n depends on the past history of state values. This is clearly intractable to compute, hence the Markov property

Markov Property:

The Markov property makes a state Sₙ depend on only the immediately preceding state, Sₙ₋₁, rather than the entire historical backlog of states. In this case our conditional probability for state Sₙ reduces to:

However, we still have n-1 one step conditional probabilities to be estimated along with the initial probability distribution, P(X₁=x₁). Still somewhat complex, as there are a large number of conditional probabilities to be calculated. A further assumption that is made to simplify and make the calculations more tractable is time homogeneity of transition probabilities.

Time homogeneity of transition probabilities

This basically says that the probability of transitioning from state i in time period n to state j in period n+1 is the same irrespective of the time periods:

What this means is that the probabilities for transitioning from one state to the next is fixed, irrespective of time.

To put it in context of our road example, the probability of going from clear road to jammed is fixed irrespective of whether we meet these 2 states in the first 2 time periods or the last 2 time periods. So, the probability of transition reduces to the following 9 cases irrespective of time. Where C= clear, j=jammed and B=blockled.

As you can see, if the number of states is less than the time period, as is more likely, this will significantly reduce the probabilities to be calculated.

Transition probabilities

The state transition probability matrix of a Markov chain gives the probabilities of transitioning from one state to another in a single time step.

For t=1,2 states case this reduces the dimensions of the problem to

This breaks up into the following components:

The conditional probability:

The initial distribution or marginal distribution: P(S₁=x₁)

In our road example, this was the initial probabilities for the 3 states: p,q,r.

So, all we need is the one step conditional probabilities and the initial distribution of states to get the joint probability of transitioning to a state.

The transition matrix can be represented as below:

In the transition matrix the rows represent the current state and the columns the future state. Thus p₁₂ is the probability of transitioning from state 1 in time t to state 2 in t+1.

Also note the sum of all branch probabilities from a state has to be one. This is logical as you have to transition from a state to another one. Mathematically this is written as:

Why do we sum over ‘j’? Because this represents the possible states that you can transition to from, ‘i’.

Markov Chain dynamics

For a Markov chain, we use the word dynamics to describe the variation of the state over a short time interval starting from a given initial state. The evolution of a Markov chain is a random process and so we cannot say exactly what sequence of states will follow the initial state. In reinforcement learning we wish predict the future states given the current state or initial state. A prediction of the future state, Sₙ given S₁ can be calculated easily once we have the probability transition matrix and the initial distribution. The n -step transition probability matrix can be found by multiplying the single-step probability matrix by itself n times.

This is the beauty of the Markov system and why it is used in reinforcement learning. Once we have a tractable set of transition probabilities, we can calculate the probability of transitioning from any initial state to state j in n steps.

Now after this very long digression into the stochastic and probability behind MDPs, I will briefly review the MDP equations in Sutton (2017), ch 3.

Section 3: Understanding Sutton Ch3 MDP Equations Using Probability and Stochastic Concepts

The MDP Model

In this section I briefly define the MDP giving what is essentially the textbook definition. In the next section we will look at the statistical theory underlying MDPs.

The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction. It proposes that whatever the details of the sensory, memory, and control apparatus, and whatever objective one is trying to achieve, any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states), and one signal to define the agent’s goal (the rewards). This framework may not be sufficient to represent all decision-learning problems usefully, but it has proved to be widely useful and applicable.

Sutton ch2, p50

Most of the MDP framework can actually be understood at a high level without detailed knowledge of probability and stochastic theory. However, now that we have exposure to those concepts we can actually appreciate what is going on and its complexity. Let us now look at some of the equations related to the Markov process that are presented in Sutton (2017), ch3.

Markov Property

This is essentially the idea that the future is independent of the past, given the present. It is conceptualized probabilistically as:

This has already been discussed above at length and enables us to understand how the basic Markov problem can be made more tractable. Stochastic processes that satisfy the Markov property are typically much simpler to analyze than incorporation of the historical process in models. In practice, it is important to consider whether the Markov property will hold. For many scenarios, for example, the position of a car, the present location of the car will determine where it goes in future states. But, in some instances, history maybe important: for example, a trading agent may have to consider past trends in a share as well as the present price to determine whether to buy or sell.

A Markov process is a stochastic process with the following properties:

(a.) The number of possible outcomes or states is finite

This is there for analytical convenience. Infinite processes can also be handled with some minor manipulation. However, for time dependent processes, we assume time is finite t=0,1,2…,T.

(b.) The outcome at any stage depends only on the outcome of the previous stage — given by Markov property discussed above.

(c.) The transition probabilities are constant over time.

(d.) The system is stationary — this is not actually called out, but property c) actually implies it.

Formal representation of MDP

Formally the MDP can be represented as follows. Given, the Markov property, an MDP is defined as follows: a tuple which typically contains 5 elements:

S: Set of states in the environment: this is actually a stochastic process

A: Set of actions possible in each state

R: Reward function in response to actions: This is normally deterministic, but sometimes can be stochastic as well

p : Probability matrix of transitioning from one state to the next — This is a critical part of an MDP. It is defined by the joint and conditional distribution of the state variable.

γ : Discount factor to be applied for rewards in the distant future

The MDP and agent together thereby give rise to a sequence or trajectory that begins like this:

This process is a stochastic process and is finite, i.e., has T time steps with a clear terminal state.

Understanding equations

Equation 3.2

This equation tells us that Rₜ and Sₜ are random variables with discrete probability mass distributions which depend on the previous state and action. Equation 3.2 defines the dynamics function p. This is a function that takes in 4 arguments. It also specifies a conditional distribution for each choice of state and action. For a specific value of state s’ and reward r, we can estimate the probability of these values occurring at time t, given the preceding state and action values at t-1.

Equation 3.3

Equation 3.3 basically tells us that for each value of ‘s’ and ‘a’, there is a joint distribution of states and rewards. This is basically a joint conditional distribution, hence it must sum to 1. The joint distribution will have the probabilities of different combinations of s’ and r. The sum of all probabilities of all combinations must sum to 1 (refer to Wikipedia).

Let’s go back to our road example. At each state of the road, we have two actions: drive or stop. Then, if we are in state ‘jammed’ and we choose to drive. Then conditional on these two, there is a joint distribution of states and rewards for the next state.

State transition probabilities — Equation 3.4

The state transition probabilities are given by summing over reward probabilities (note for the probability challenged: like a marginal probability distribution of states- refer to Wikipedia).

Expected Rewards — Equation 3.5

Here we are again working with equation 3.2. This time we sum over the probability of states (note for the probability challenged: like a marginal probability distribution of rewards — refer to Wikipedia). This gives us the marginal probability of rewards distribution (expression on extreme RHS of 3.5). We then multiply this probability by the actual rewards, r to get the expected rewards.

Expected rewards for state–action–next-state triples — Equation 3.6

This one is a little trickier. Probability of reward and state joint probability conditional on ‘s’ and ‘a’ divided by state transition probability . This is an application of chain rule of joint distributions (refer Wikipedia).

Let’s substitute our terms.

We can now take expectations over Rₜ

So, this completes the basic equations derived for the set up of the MDP. It is important to understand each of 3.2–3.6 as Sutton(2017) mention that each of these is a way to represent the MDP and would be used again in subsequent chapters.

To Conclude..

Well friends, this has been quite a journey on probability and stochastic theory. However, I hope what you have realized is the complexity and beauty of the underlying Markov system. Without understanding these, it is difficult to figure out the equations that underlie the basic MDP equations. Without understanding these equations and how they are derived, it is likely that one will become totally lost when it comes to policy, value functions and Q values. These will be picked up in the next installment of this tutorial.

I hope you liked my writing. Please consider following me for more such articles as I proceed with reinforcement learning: https://medium.com/@shaileydash

Also do let me know your inputs and comments in the comments which will help me revise this.

You can follow me on Linkedin for more corporate oriented articles on AI and DS: www.linkedin.com/in/shailey-dash-phd-8b4a286

References

I am listing quite a long list of stats and probability theory resources. They are all excellent, if difficult.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction; 2nd Edition. 2017.

https://en.wikipedia.org/wiki/Markov_decision_process

https://en.wikipedia.org/wiki/Stochastic_process

https://en.wikipedia.org/wiki/Joint_probability_distribution

https://en.wikipedia.org/wiki/Conditional_probability_distribution

https://www.kent.ac.uk/smsas/personal/lb209/files/notes1.pdf

Lawler, G.F. (2006). Introduction to Stochastic Processes (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781315273600

Yates, R.D. and Goodman, D.J., 2014. Probability and stochastic processes: a friendly introduction for electrical and computer engineers. John Wiley & Sons.

Unnikrishna Pillai, Probability and Stochastic Processes, Lecture 8, NYU, Tandon school of Engineering, https://www.youtube.com/watch?v=nFF6eRQT2-c&t=455s

Part 1: On the Markov Decision Model which forms the theoretical foundation of reinforcement learning problems

This article is the first in a sequence of articles that will aim to make the fundamentals of RL clearer both at the theoretical and practical application level.

Who is the intended audience?

Approach — Decoding Sutton and Barto (2017)

Enough on the preamble! Let’s move to the actual topic….

Role of MDP

In this article I split up the content in 3 sections:

Section 1: Key Components of an MDP

– RL problem

– State, agent, environment, reward, etc.

Section 2: Probability and Stochastic Concepts Used in MDPs

– Random variables, state or sample space, stochastic processes, realization or trajectory

– Probability distributions, joint distributions, marginal distributions, conditional distributions

– Deriving the Markov property

– Probability transition Matrix

Section 3: Understanding Sutton Ch3 equations Using Probability and Stochastic concepts

– The MDP Model

– Formal presentation of MDP model

– Equations 3.2, 3.3, 3.4, 3.5, 3.6 from Ch3, Sutton (2017)

Section 1: Key Components of an MDP

First, to set context about what we are about to explain, I shall provide a brief summary of the basic MDP model and definitions used by Sutton (2017), Ch3.

Why MDP?

Some Definitions to start off with

Agent: This is the learner or decision maker who makes sequential decisions to achieve an end goal in an environment.

Observations are what the agent receives as input — for example a self-driving car can only receive inputs from the environment in its vicinity and will not be aware of a road block significantly ahead
The agent only has full knowledge of the environment if it knows the model of the environment, example the structure of rules of a video game which govern the environment
The environment is changed by agent’s actions (sudden breaking causes a small jam), but can also change on its own (a road diversion)

What is the RL problem?

The measurement of time

State

Rewards

Rewards can be both deterministic or a random variable. Sutton (2017) takes Rₜ to be a random variable, though most of the examples take a deterministic reward.

So, let’s now jump to Probability and Stochastic theory…

Section 2: Probability and Stochastic Basics

Let us now spend some time understanding the above concepts. Once we understand these, we are going to come back to MDPs and understand how it works using these concepts.

Random variable

In the context of the MDP defined above, we have State and Reward as random variables. Both variables are discrete and have a finite range.

State space or sample space:

1. s’: meets a jam on the way to commuting to office

2. s: Meets clear roads to office

3. s”: Meets a road block due to an accident

These states are what constitute the sample space or range of a random variable, i.e., the values it can take.

Stochastic Process:

1. Movement overtime indexed by t as the variable evolves

2. At each point of time a choice of state from the set of states

Realization or trajectory of a stochastic process

Probability Distribution

Joint Distribution vs Marginal Distribution vs Conditional Distribution

This can be defined more generally in Markov context as:

Let us look at our simple road example:

For simplicity assume that there are only 2 time periods — i.e., Sₜ and Sₜ₋₁. The state space is still 3: {clear( C ), jammed(J), blocked(B)}.

The joint probability table would be as follows:

Marginal probability distribution

Conditional probability

Now in our case, because Sₜ is a stochastic variable, and we are looking at a two period example, the joint and conditional probabilities are the same.

Mathematically, the conditional distribution of a variable given another variable is the joint distribution of both variables divided by the marginal distribution of the other variable.

Defining this generally it is:

The generalization of this to case where we have n random variables, X₁,.., Xₙ.

The joint probability distribution of these n variables is:

Why do we need these distributions — deriving the need for the Markov Property

This can be further unrolled backwards. We will finally get the expression for a sequence of conditional probabilities and the marginal probability of the first variable P(X₁=x₁)

For analyzing the MDP, t=1,..,n and states xᵢ ∈ S. Just to be clear xₙ is a specific value of a state at time period t=n.

Now this is just a stochastic process where the value of a state in time period n depends on the past history of state values. This is clearly intractable to compute, hence the Markov property

Markov Property:

Time homogeneity of transition probabilities

This basically says that the probability of transitioning from state i in time period n to state j in period n+1 is the same irrespective of the time periods:

What this means is that the probabilities for transitioning from one state to the next is fixed, irrespective of time.

As you can see, if the number of states is less than the time period, as is more likely, this will significantly reduce the probabilities to be calculated.

Transition probabilities

The state transition probability matrix of a Markov chain gives the probabilities of transitioning from one state to another in a single time step.

For t=1,2 states case this reduces the dimensions of the problem to

This breaks up into the following components:

The conditional probability:

The initial distribution or marginal distribution: P(S₁=x₁)

In our road example, this was the initial probabilities for the 3 states: p,q,r.

So, all we need is the one step conditional probabilities and the initial distribution of states to get the joint probability of transitioning to a state.

The transition matrix can be represented as below:

In the transition matrix the rows represent the current state and the columns the future state. Thus p₁₂ is the probability of transitioning from state 1 in time t to state 2 in t+1.

Also note the sum of all branch probabilities from a state has to be one. This is logical as you have to transition from a state to another one. Mathematically this is written as:

Why do we sum over ‘j’? Because this represents the possible states that you can transition to from, ‘i’.

Markov Chain dynamics

Now after this very long digression into the stochastic and probability behind MDPs, I will briefly review the MDP equations in Sutton (2017), ch 3.

Section 3: Understanding Sutton Ch3 MDP Equations Using Probability and Stochastic Concepts

The MDP Model

In this section I briefly define the MDP giving what is essentially the textbook definition. In the next section we will look at the statistical theory underlying MDPs.

The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction. It proposes that whatever the details of the sensory, memory, and control apparatus, and whatever objective one is trying to achieve, any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states), and one signal to define the agent’s goal (the rewards). This framework may not be sufficient to represent all decision-learning problems usefully, but it has proved to be widely useful and applicable.

Sutton ch2, p50

Markov Property

This is essentially the idea that the future is independent of the past, given the present. It is conceptualized probabilistically as:

A Markov process is a stochastic process with the following properties:

(a.) The number of possible outcomes or states is finite

This is there for analytical convenience. Infinite processes can also be handled with some minor manipulation. However, for time dependent processes, we assume time is finite t=0,1,2…,T.

(b.) The outcome at any stage depends only on the outcome of the previous stage — given by Markov property discussed above.

(c.) The transition probabilities are constant over time.

(d.) The system is stationary — this is not actually called out, but property c) actually implies it.

Formal representation of MDP

Formally the MDP can be represented as follows. Given, the Markov property, an MDP is defined as follows: a tuple which typically contains 5 elements:

S: Set of states in the environment: this is actually a stochastic process

A: Set of actions possible in each state

R: Reward function in response to actions: This is normally deterministic, but sometimes can be stochastic as well

p : Probability matrix of transitioning from one state to the next — This is a critical part of an MDP. It is defined by the joint and conditional distribution of the state variable.

γ : Discount factor to be applied for rewards in the distant future

The MDP and agent together thereby give rise to a sequence or trajectory that begins like this:

This process is a stochastic process and is finite, i.e., has T time steps with a clear terminal state.

Understanding equations

Equation 3.2

Equation 3.3

State transition probabilities — Equation 3.4

The state transition probabilities are given by summing over reward probabilities (note for the probability challenged: like a marginal probability distribution of states- refer to Wikipedia).

Expected Rewards — Equation 3.5

Expected rewards for state–action–next-state triples — Equation 3.6

Let’s substitute our terms.

We can now take expectations over Rₜ

To Conclude..

I hope you liked my writing. Please consider following me for more such articles as I proceed with reinforcement learning: https://medium.com/@shaileydash

Also do let me know your inputs and comments in the comments which will help me revise this.

You can follow me on Linkedin for more corporate oriented articles on AI and DS: www.linkedin.com/in/shailey-dash-phd-8b4a286

References

I am listing quite a long list of stats and probability theory resources. They are all excellent, if difficult.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction; 2nd Edition. 2017.

https://en.wikipedia.org/wiki/Markov_decision_process

https://en.wikipedia.org/wiki/Stochastic_process

https://en.wikipedia.org/wiki/Joint_probability_distribution

https://en.wikipedia.org/wiki/Conditional_probability_distribution

https://www.kent.ac.uk/smsas/personal/lb209/files/notes1.pdf

Lawler, G.F. (2006). Introduction to Stochastic Processes (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781315273600

Yates, R.D. and Goodman, D.J., 2014. Probability and stochastic processes: a friendly introduction for electrical and computer engineers. John Wiley & Sons.

Unnikrishna Pillai, Probability and Stochastic Processes, Lecture 8, NYU, Tandon school of Engineering, https://www.youtube.com/watch?v=nFF6eRQT2-c&t=455s

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.