Foundational RL: Solving Markov Decision Process | by Rahul Bhadani

Road to Reinforcement Learning

Cover photo generated by the author using an AI tool Dreamstudio (Licenses as https://creativecommons.org/publicdomain/zero/1.0/)

In the first part, I discussed some basic concepts to establish a foundation for reinforcement learning (RL) such as Markov states, the Markov chain, and the Markov decision process (MDP). Reinforcement learning problems are built on top of MDP.

An MDP is a 4-tuple model (𝓢, 𝓐, 𝓟, 𝓡) where s ∈ 𝓢 is a state, a ∈ 𝓐 is an action taken while an agent is a state s, 𝓟(s’ | s, a) is the transition probability matrix of transition to state s’ from s under the influence of action a (or some other some condition probability density function), and r(s, a) ∈ 𝓡 is the reward function.

Policy function: The policy function, usually denoted by π in the RL literature specifies the mapping from state space 𝓢 to Action space 𝓐.

The goal of MDP is to find an optimal policy that maximizes the long-term reward.

MDP requires a notion of discrete time t, thus MDP is defined as a discrete time stochastic control process. In the context of RL, each MDP is parametrized by discount factor γ (gamma) which determines the importance of future rewards relative to current rewards. In other words, it is a measure of how much value future rewards will have to the agent, compared to rewards it receives in the present. The discount factor is a value between 0 and 1, where a value of 0 means that the agent only cares about immediate rewards and will completely ignore any future rewards, while a value of 1 means that the agent will consider future rewards with equal importance as those it receives in the present.

For instance, in a situation where the agent is deciding between two actions, A and B, that will lead to different sequences of future rewards, the discount factor can be used to determine the relative value of these different sequences of rewards. If the discount factor is low, then the agent will be more likely to choose the action that leads to immediate rewards, while if the discount factor is high, the agent will be more likely to choose the action that leads to a larger sum of future rewards.

In general, the discount factor is an important consideration in MDPs because it allows the agent to trade off short-term rewards for long-term rewards, and to balance the need for immediate gratification with the potential benefits of delayed gratification.

Thus, MDP in the context of RL is (𝓢, 𝓐, 𝓟, 𝓡, γ).

Considering the discount factor γ, the reward at time t can be written as

Equation 1. The reward at time step t.

and the total reward, i.e., the cumulative reward for all time is

Equation 2. Cumulative Reward

The policy π, which maximizes the cumulative reward is the solution to our MDP problem.

Let’s see an example (autonomous driving agent):

In the context of autonomous driving, a Markov decision process can be used to model the decision-making process of the vehicle in order to minimize fuel consumption over the long run. In this scenario, the state of the system can be represented by the current velocity and acceleration of the vehicle, and the objective is to find the optimal sequence of actions that will minimize fuel consumption.

The instantaneous fuel consumption of the vehicle can be modeled as a function g(v, a) where v is the current velocity and a is the current acceleration. This function can be used to evaluate the cost of each possible action at a given state, and the Markov decision process can be used to find the optimal sequence of actions that will minimize the long-term fuel consumption of the vehicle.

For example, at a given state where the current velocity is v and the current acceleration is a, the MDP can consider a range of possible actions such as accelerating, decelerating, or maintaining the current velocity. The cost of each of these actions can be evaluated using the function g(v, a) and the optimal action can be selected based on which action will result in the lowest fuel consumption over the long run. This process can then be repeated at each subsequent state in order to find the optimal sequence of actions that will minimize fuel consumption.

One important point to highlight is that, I am not considering the transition probability 𝓟 in the above example to make our example more digestable, yet somewhat realistic.

Some python coding:

Consider an MDP where states are velocity and acceleration with velocity has a minimum value of 0 m/s, maximum of 50 m/s, acceleration has a min of -4.5 m/s², max of 3.0 m/s², each of them quantized by 0.1. We can consider the MDP class within the following constructor:

def __init__(self, velocity_min, velocity_max, acceleration_min, acceleration_max, velocity_step, acceleration_step, acceleration_min_accelerate, acceleration_max_accelerate):
# Define minimum and maximum values for velocity and acceleration
self.VELOCITY_MIN = velocity_min
self.VELOCITY_MAX = velocity_max
self.ACCELERATION_MIN = acceleration_min
self.ACCELERATION_MAX = acceleration_max# Define quantization step for velocity and acceleration
self.VELOCITY_STEP = velocity_step
self.ACCELERATION_STEP = acceleration_step
# Define minimum and maximum values for acceleration when accelerating or decelerating
self.ACCELERATION_MIN_ACCELERATE = acceleration_min_accelerate
self.ACCELERATION_MAX_ACCELERATE = acceleration_max_accelerate
# Calculate number of possible values for velocity and acceleration
self.num_velocity_values = int((self.VELOCITY_MAX - self.VELOCITY_MIN) / self.VELOCITY_STEP) + 1
self.num_acceleration_values = int((self.ACCELERATION_MAX - self.ACCELERATION_MIN) / self.ACCELERATION_STEP) + 1
# Create list of possible values for velocity and acceleration
self.velocity_values = [self.VELOCITY_MIN + i * self.VELOCITY_STEP for i in range(self.num_velocity_values)]
self.acceleration_values = [self.ACCELERATION_MIN + i * self.ACCELERATION_STEP for i in range(self.num_acceleration_values)]

The desired actions can be Accelerate, Decelerate or Maintain the constant speed of the vehicle.

 # Function to calculate available actions in a given state
def calculate_actions(self, v, a):
# Initialize list of available actions
actions = []# If current velocity is less than maximum, add option to accelerate
if v < self.VELOCITY_MAX:
for a_new in self.acceleration_values:
if self.ACCELERATION_MIN_ACCELERATE <= a_new <= self.ACCELERATION_MAX_ACCELERATE:
actions.append((v, a_new))
# If current velocity is greater than minimum, add option to decelerate
if v > self.VELOCITY_MIN:
for a_new in self.acceleration_values:
if -self.ACCELERATION_MAX_ACCELERATE <= a_new <= -self.ACCELERATION_MIN_ACCELERATE:
actions.append((v, a_new))
# Add option to maintain current velocity and acceleration
actions.append((v, a))
return actions

Next, we can define a function to calculate expected fuel consumption:

# Function to evaluate the expected fuel consumption for a given state and action
def evaluate_fuel_consumption(self, v, a, v_new, a_new):
# Calculate expected fuel consumption for current state and action
fuel_current = self.fuel_consumption(v, a)
fuel_new = self.fuel_consumption(v_new, a_new)
return fuel_current + fuel_new

A naive way of calculating optimal policy can be sweeping through the entire state space:

# Function to find the optimal action in a given state, based on minimizing expected fuel consumption
def find_optimal_action(self, v, a):
# Calculate available actions in current state
actions = self.calculate_actions(v, a)# Initialize minimum expected fuel consumption
min_fuel = float("inf")
# Initialize optimal action
optimal_action = None
# Iterate over available actions and find action with minimum expected fuel consumption
for v_new, a_new in actions:
fuel = self.evaluate_fuel_consumption(v, a, v_new, a_new)
if fuel < min_fuel:
min_fuel = fuel
optimal_action = (v_new, a_new)
return optimal_action
# Function to calculate the optimal policy for the MDP
def calculate_optimal_policy(self):
# Initialize dictionary to store optimal policy
optimal_policy = {}
# Iterate over all possible states and calculate optimal action for each state
for v in self.velocity_values:
for a in self.acceleration_values:
optimal_policy[(v, a)] = self.find_optimal_action(v, a)
return optimal_policy

A test implementation of the above code snippets taken together is :

# Create MDP instance
mdp = MDP(VELOCITY_MIN, VELOCITY_MAX, ACCELERATION_MIN, ACCELERATION_MAX, VELOCITY_STEP, ACCELERATION_STEP, ACCELERATION_MIN_ACCELERATE, ACCELERATION_MAX_ACCELERATE)# Calculate optimal policy for the MDP
optimal_policy = mdp.calculate_optimal_policy()
# Print optimal policy for the first few states
for i in range(10):
for j in range(10):
print(optimal_policy[(mdp.velocity_values[i], mdp.acceleration_values[j])])

A complete code of the above example can be downloaded from https://gist.github.com/rahulbhadani/92d3be52529a64372c796ca5e7cb3770.

Now, we may ask a question: is the above implementation efficient?

We clearly see that the above implementation sweeps through the entire state space and is not very efficient. To improve the efficiency of this implementation, we can use dynamic programming to store the optimal action for each state, and then use the stored values to calculate the optimal policy for the entire MDP without needing to iterate over all possible states. This would allow the optimal policy to be calculated more efficiently, by taking advantage of the fact that the optimal policy for any given state depends only on the optimal policies for the states that can be reached from the current state.

Another option could be to use function approximation to approximate the optimal policy, which can be more efficient than explicitly calculating the optimal policy for all possible states. This can be done by training a model such as a neural network (deep RL) on a representative sample of the state space and then using the trained model to predict the optimal action for any given state.

At this point, we should be comfortable discussing Bellman Equation, Dynamic Programming, and Q-function.

In the context of the MDP defined above, where the objective is to minimize fuel consumption over the long run, the Bellman equation plays an important role in determining the optimal policy. The Bellman equation provides a recursive relationship between a state’s value and the states’ values that can be reached from the current state, which can be used to determine the optimal policy by finding the action that leads to the state with the highest value.

This recursive is solved using iterative computer science algorithms like dynamic programming and linear programming. Their variations have led to many kinds of RL training algorithms.

The Bellman equations for state values and state–action values are called value functions and Q-Functions respectively.

Value Function

In the MDP, the value of a state is defined as the expected fuel consumption over the long run, starting from the current state. We call this value function.

Mathematically, the value function can be written as

Equation 3. Value Function.

where P( s(t) , a(t) ) transition probability of going from state s(t) to s(t +1 ) under the influence fo action a(t). Equation 3 definition is a modified form of the value function obtained from [1].

The value function can be used to calculate the value of a state by summing the expected fuel consumption in the current state with the expected fuel consumption in the next state, where the expected fuel consumption in the next state is calculated by taking the maximum over all possible actions that can be taken in the current state. This process can be repeated recursively to calculate the value of each state, starting from the initial state and working backward to the final state.

Once the values of all states have been calculated using the Bellman equation, the optimal policy can be determined by finding the action that leads to the state with the highest value for each state. This optimal policy can then be used to determine the optimal actions to take in each state, in order to minimize fuel consumption over the long run.

Q-Function

In the context of the MDP defined above, where the objective is to minimize fuel consumption over the long run, the Q-function is a function that maps each state and action pair to a real number, representing the expected fuel consumption over the long run starting from that state and taking that action.

Mathematically, Q-function can be written as

Equation 4. Q-function

which is a modified form of a Q-function used in [1].

The Q-function can be calculated using the Bellman equation, which provides a recursive relationship between a state’s value and the states’ values that can be reached from the current state. The Q-function can be calculated by summing the expected fuel consumption in the current state with the expected fuel consumption in the next state, where the expected fuel consumption in the next state is calculated by taking the maximum overall possible actions that can be taken in the current state.

Once the values of all state-action pairs have been calculated using the Bellman equation, the optimal policy can be determined by finding the action that maximizes the Q function for each state. This optimal policy can then be used to determine the optimal actions to take in each state, in order to minimize fuel consumption over the long run.

Let’s see how they look for the example presented in this article.


class MDP:
# ...# Function to calculate the value function for the MDP
def calculate_value_function(self):
# Initialize dictionary to store values of each state
values = {}
# Iterate over all possible states and calculate value of each state
for v in self.velocity_values:
for a in self.acceleration_values:
values[(v, a)] = self.evaluate_value(v, a, values)
return values
# Function to evaluate the value of a state using the Bellman equation
def evaluate_value(self, v, a, values):
# Check if value of current state has already been calculated
if (v, a) in values:
return values[(v, a)]
# Calculate available actions in current state
actions = self.calculate_actions(v, a)
# Initialize maximum expected fuel consumption
max_fuel = float("-inf")
# Iterate over available actions and find action with maximum expected fuel consumption
for v_new, a_new in actions:
fuel = self.evaluate_fuel_consumption(v, a, v_new, a_new)
if fuel > max_fuel:
max_fuel = fuel
# Return maximum expected fuel consumption
return max_fuel

class MDP:
# ...# Function to calculate the Q-function for the MDP
def calculate_q_function(self):
# Initialize dictionary to store values of each state-action pair
q_values = {}
# Iterate over all possible states and actions
for v in self.velocity_values:
for a in self.acceleration_values:
for v_new, a_new in self.calculate_actions(v, a):
q_values[((v, a), (v_new, a_new))] = self.evaluate_q_value(v, a, v_new, a_new, q_values)
return q_values
# Function to evaluate the Q-value of a state-action pair using the Bellman equation
def evaluate_q_value(self, v, a, v_new, a_new, q_values):
# Check if Q-value of current state-action pair has already been calculated
if ((v, a), (v_new, a_new)) in q_values:
return q_values[((v, a), (v_new, a_new))]
# Calculate expected fuel consumption in current state
fuel = self.evaluate_fuel_consumption(v, a, v_new, a_new)
# Calculate expected fuel consumption in next state by taking maximum over all possible actions
max_fuel = float("-inf")
for v_next, a_next in self.calculate_actions(v_new, a_new):
fuel_next = self.evaluate_q_value(v_new, a_new, v_next, a_next, q_values)
if fuel_next > max_fuel:
max_fuel = fuel_next
# Return expected fuel consumption in current state plus expected fuel consumption in next state
return fuel + max_fuel

Of course, the example misses the notion of transition probability and hence the value function and Q-function used for the example of optimizing fuel consumption is a lot simpler than for some real-world practical problems.

Road to Reinforcement Learning

Cover photo generated by the author using an AI tool Dreamstudio (Licenses as https://creativecommons.org/publicdomain/zero/1.0/)

Policy function: The policy function, usually denoted by π in the RL literature specifies the mapping from state space 𝓢 to Action space 𝓐.

The goal of MDP is to find an optimal policy that maximizes the long-term reward.

Thus, MDP in the context of RL is (𝓢, 𝓐, 𝓟, 𝓡, γ).

Considering the discount factor γ, the reward at time t can be written as

Equation 1. The reward at time step t.

and the total reward, i.e., the cumulative reward for all time is

Equation 2. Cumulative Reward

The policy π, which maximizes the cumulative reward is the solution to our MDP problem.

Let’s see an example (autonomous driving agent):

One important point to highlight is that, I am not considering the transition probability 𝓟 in the above example to make our example more digestable, yet somewhat realistic.

Some python coding:

def __init__(self, velocity_min, velocity_max, acceleration_min, acceleration_max, velocity_step, acceleration_step, acceleration_min_accelerate, acceleration_max_accelerate):
# Define minimum and maximum values for velocity and acceleration
self.VELOCITY_MIN = velocity_min
self.VELOCITY_MAX = velocity_max
self.ACCELERATION_MIN = acceleration_min
self.ACCELERATION_MAX = acceleration_max# Define quantization step for velocity and acceleration
self.VELOCITY_STEP = velocity_step
self.ACCELERATION_STEP = acceleration_step
# Define minimum and maximum values for acceleration when accelerating or decelerating
self.ACCELERATION_MIN_ACCELERATE = acceleration_min_accelerate
self.ACCELERATION_MAX_ACCELERATE = acceleration_max_accelerate
# Calculate number of possible values for velocity and acceleration
self.num_velocity_values = int((self.VELOCITY_MAX - self.VELOCITY_MIN) / self.VELOCITY_STEP) + 1
self.num_acceleration_values = int((self.ACCELERATION_MAX - self.ACCELERATION_MIN) / self.ACCELERATION_STEP) + 1
# Create list of possible values for velocity and acceleration
self.velocity_values = [self.VELOCITY_MIN + i * self.VELOCITY_STEP for i in range(self.num_velocity_values)]
self.acceleration_values = [self.ACCELERATION_MIN + i * self.ACCELERATION_STEP for i in range(self.num_acceleration_values)]

The desired actions can be Accelerate, Decelerate or Maintain the constant speed of the vehicle.

 # Function to calculate available actions in a given state
def calculate_actions(self, v, a):
# Initialize list of available actions
actions = []# If current velocity is less than maximum, add option to accelerate
if v < self.VELOCITY_MAX:
for a_new in self.acceleration_values:
if self.ACCELERATION_MIN_ACCELERATE <= a_new <= self.ACCELERATION_MAX_ACCELERATE:
actions.append((v, a_new))
# If current velocity is greater than minimum, add option to decelerate
if v > self.VELOCITY_MIN:
for a_new in self.acceleration_values:
if -self.ACCELERATION_MAX_ACCELERATE <= a_new <= -self.ACCELERATION_MIN_ACCELERATE:
actions.append((v, a_new))
# Add option to maintain current velocity and acceleration
actions.append((v, a))
return actions

Next, we can define a function to calculate expected fuel consumption:

# Function to evaluate the expected fuel consumption for a given state and action
def evaluate_fuel_consumption(self, v, a, v_new, a_new):
# Calculate expected fuel consumption for current state and action
fuel_current = self.fuel_consumption(v, a)
fuel_new = self.fuel_consumption(v_new, a_new)
return fuel_current + fuel_new

A naive way of calculating optimal policy can be sweeping through the entire state space:

# Function to find the optimal action in a given state, based on minimizing expected fuel consumption
def find_optimal_action(self, v, a):
# Calculate available actions in current state
actions = self.calculate_actions(v, a)# Initialize minimum expected fuel consumption
min_fuel = float("inf")
# Initialize optimal action
optimal_action = None
# Iterate over available actions and find action with minimum expected fuel consumption
for v_new, a_new in actions:
fuel = self.evaluate_fuel_consumption(v, a, v_new, a_new)
if fuel < min_fuel:
min_fuel = fuel
optimal_action = (v_new, a_new)
return optimal_action
# Function to calculate the optimal policy for the MDP
def calculate_optimal_policy(self):
# Initialize dictionary to store optimal policy
optimal_policy = {}
# Iterate over all possible states and calculate optimal action for each state
for v in self.velocity_values:
for a in self.acceleration_values:
optimal_policy[(v, a)] = self.find_optimal_action(v, a)
return optimal_policy

A test implementation of the above code snippets taken together is :

# Create MDP instance
mdp = MDP(VELOCITY_MIN, VELOCITY_MAX, ACCELERATION_MIN, ACCELERATION_MAX, VELOCITY_STEP, ACCELERATION_STEP, ACCELERATION_MIN_ACCELERATE, ACCELERATION_MAX_ACCELERATE)# Calculate optimal policy for the MDP
optimal_policy = mdp.calculate_optimal_policy()
# Print optimal policy for the first few states
for i in range(10):
for j in range(10):
print(optimal_policy[(mdp.velocity_values[i], mdp.acceleration_values[j])])

A complete code of the above example can be downloaded from https://gist.github.com/rahulbhadani/92d3be52529a64372c796ca5e7cb3770.

Now, we may ask a question: is the above implementation efficient?

At this point, we should be comfortable discussing Bellman Equation, Dynamic Programming, and Q-function.

This recursive is solved using iterative computer science algorithms like dynamic programming and linear programming. Their variations have led to many kinds of RL training algorithms.

The Bellman equations for state values and state–action values are called value functions and Q-Functions respectively.

Value Function

In the MDP, the value of a state is defined as the expected fuel consumption over the long run, starting from the current state. We call this value function.

Mathematically, the value function can be written as

Equation 3. Value Function.

Q-Function

Mathematically, Q-function can be written as

Equation 4. Q-function

which is a modified form of a Q-function used in [1].

Let’s see how they look for the example presented in this article.


class MDP:
# ...# Function to calculate the value function for the MDP
def calculate_value_function(self):
# Initialize dictionary to store values of each state
values = {}
# Iterate over all possible states and calculate value of each state
for v in self.velocity_values:
for a in self.acceleration_values:
values[(v, a)] = self.evaluate_value(v, a, values)
return values
# Function to evaluate the value of a state using the Bellman equation
def evaluate_value(self, v, a, values):
# Check if value of current state has already been calculated
if (v, a) in values:
return values[(v, a)]
# Calculate available actions in current state
actions = self.calculate_actions(v, a)
# Initialize maximum expected fuel consumption
max_fuel = float("-inf")
# Iterate over available actions and find action with maximum expected fuel consumption
for v_new, a_new in actions:
fuel = self.evaluate_fuel_consumption(v, a, v_new, a_new)
if fuel > max_fuel:
max_fuel = fuel
# Return maximum expected fuel consumption
return max_fuel

class MDP:
# ...# Function to calculate the Q-function for the MDP
def calculate_q_function(self):
# Initialize dictionary to store values of each state-action pair
q_values = {}
# Iterate over all possible states and actions
for v in self.velocity_values:
for a in self.acceleration_values:
for v_new, a_new in self.calculate_actions(v, a):
q_values[((v, a), (v_new, a_new))] = self.evaluate_q_value(v, a, v_new, a_new, q_values)
return q_values
# Function to evaluate the Q-value of a state-action pair using the Bellman equation
def evaluate_q_value(self, v, a, v_new, a_new, q_values):
# Check if Q-value of current state-action pair has already been calculated
if ((v, a), (v_new, a_new)) in q_values:
return q_values[((v, a), (v_new, a_new))]
# Calculate expected fuel consumption in current state
fuel = self.evaluate_fuel_consumption(v, a, v_new, a_new)
# Calculate expected fuel consumption in next state by taking maximum over all possible actions
max_fuel = float("-inf")
for v_next, a_next in self.calculate_actions(v_new, a_new):
fuel_next = self.evaluate_q_value(v_new, a_new, v_next, a_next, q_values)
if fuel_next > max_fuel:
max_fuel = fuel_next
# Return expected fuel consumption in current state plus expected fuel consumption in next state
return fuel + max_fuel

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

Foundational RL: Solving Markov Decision Process | by Rahul Bhadani | Dec, 2022

Road to Reinforcement Learning

Let’s see an example (autonomous driving agent):

Some python coding:

Value Function

Q-Function

Road to Reinforcement Learning

Let’s see an example (autonomous driving agent):

Some python coding:

Value Function

Q-Function

Related Posts