Foundational RL: Markov States, Markov Chain, and Markov Decision Process | by Rahul Bhadani | Dec, 2022

By Jessie Hobb On Dec 10, 2022

Road to Reinforcement Learning

Cover photo generated by the author using an AI tool Dreamstudio (Licenses as https://creativecommons.org/publicdomain/zero/1.0/)

Reinforcement learning (RL) is a type of machine learning in which an agent learns to interact with its environment by trial and error in order to maximize a reward. It is different from supervised learning, in which an agent is trained on labeled examples, and unsupervised learning, in which an agent learns to identify patterns in unlabeled data. In reinforcement learning, the agent learns to take actions in an environment in order to maximize a reward, such as earning points or winning a game.

Reinforcement learning is useful for a wide range of applications, including robotics, natural language processing, and gaming.

In this article, I will build some foundational concepts to understand reinforcement learning.

In RL, we have an agent that we train using some algorithm to take certain actions that maximize reward in order to reach the end goal. The end goal might be very far in the future or continuously change (as in the case of autonomous navigation).

In reinforcement learning, a state refers to the current situation or environment that the agent is in. It is a representation of the information that the agent has about its environment at a given point in time. For example, the position and velocity of an autonomous vehicle can be a state in an RL problem. An agent uses state information to decide what action to take the next time-step to maximize the reward.

In RL, we care about Markov state where the state has a property that all future states depend on the current state only. It means that the agent does not need to remember the entire history of its interactions with the environment in order to make decisions. Instead, it can simply focus on the current state and take action based on that. This makes the learning process more efficient because the agent does not have to store and process a large amount of information. In addition, it makes the agent’s behavior more predictable, because it is determined solely by the current state. This can be useful in many applications, such as robotics and control systems.

We can encode the Markov state of a vehicle as follows:

# define the states of the vehicle
STOPPED = 0
MOVING_FORWARD = 1
MOVING_BACKWARD = 2# define the actions of the vehicle
STOP = 0
MOVE_FORWARD = 1
MOVE_BACKWARD = 2
# define the Markov state of the vehicle
class VehicleMarkovState:
def __init__(self, state, action):
self.state = state
self.action = action
# define a function to encode the Markov state of the vehicle
def encode_markov_state(vehicle_state, vehicle_action):
return VehicleMarkovState(vehicle_state, vehicle_action)
# example: encode the Markov state of a vehicle that is moving forward
markov_state = encode_markov_state(MOVING_FORWARD, MOVE_FORWARD)
print(markov_state.state)  # prints 1 (MOVING_FORWARD)
print(markov_state.action)  # prints 1 (MOVE_FORWARD)

Markov chain is a finite state machine where each state is a Markov state. Markov chain consists of a number of states with transition probabilities to go from one state to another. In the Markov chain, the probability of transition to a particular state depends on the current state and time elapsed without worrying about what happened in the past.

Markov chain differs from the stochastic process in the sense that in the stochastic process, what happens now depends on what happened in the earlier past and not just the immediate past.

Let’s consider an example:

Figure 1. A Markov Chain, Picture by the Author

We have two Markov states A, and B. Transition probability from A to B is 0.7, the Transition probability from B to A is 0.9, the Transition probability from B to B is 0.1, Transition probability from A to A is 0.3. The idea is depicted in Figure 1. We can encode this in Python as follows:

# define the states of the Markov chain
A = 0
B = 1# define the transition probabilities
transition_probs = [[0.3, 0.7],  # transition probabilities from A
[0.9, 0.1]]  # transition probabilities from B
# define a class to represent the Markov chain
class MarkovChain:
def __init__(self, states, transition_probs):
self.states = states
self.transition_probs = transition_probs
# define a function to encode the Markov chain
def encode_markov_chain(markov_states, markov_transition_probs):
return MarkovChain(markov_states, markov_transition_probs)
# example: encode the Markov chain
markov_chain = encode_markov_chain([A, B], transition_probs)
print(markov_chain.states)  # prints [0, 1]
print(markov_chain.transition_probs)  # prints [[0.3, 0.7], [0.9, 0.1]]

Markov Decision Process or MDP is an extension of the Markov Chain. In MDP, state transition happens from one Markov state to another depending on some action a. The transition will lead to some corresponding reward. an MDP is a 4-tuple model (𝓢, 𝓐, 𝓟, 𝓡) where s ∈ 𝓢 is a state, a ∈ 𝓐 is an action taken while an agent is a state s, 𝓟(s’ | s, a) is the transition probability matrix of transition to state s’ from s under the influence of action a (or some other some condition probability density function) similar to transition_probs in the above code snippet, and r(s, a) ∈ 𝓡 is the reward function.

Policy function: The policy function, usually denoted by π in the RL literature specifies the mapping from state space 𝓢 to Action space 𝓐.

MDP can be used to model the decision-making process of a self-driving car. In this scenario, the states of the MDP might represent the different positions and velocities of the car and other objects in the environment, such as other cars and obstacles. The actions of the MDP might represent the different actions that the self-driving car can take, such as accelerating, braking, or turning. The rewards of the MDP might represent the value or utility of different actions, such as avoiding collisions or arriving at the destination quickly. Using the MDP, the self-driving car can learn to take actions that maximize its reward, such as avoiding collisions and arriving at the destination quickly.