Techno Blender
Digitally Yours.
0 22

## Implementation of the Q-Learning algorithm, and application to OpenAI Gym’s Taxi-v3 environment

The first article in this series introduced the basic concepts and components of any Reinforcement Learning system, and explained the theory behind the Q-Learning algorithm. In this article the goal will be to implement the algorithm in Python3, applying it in a real training environment. All the concepts mentioned in the first article (Applied Reinforcement Learning I: Q-Learning) will be applied in this one, assuming that the reader knows and understands them, so if you are not familiar with these concepts, or have not read the first article, below you can access it:

In order to make this article didactic, a simple and basic environment has been chosen that does not add too much complexity to the training, so that the learning of the Q-Learning algorithm can be fully appreciated. The environment is OpenAI Gym’s Taxi-v3 , which consists of a grid world where the agent is a taxi driver who must pick up a customer and drop him off at his destination.

## Actions

As for the action space, the following discrete actions are available for the agent to interact with the environment: go forward, go backward, go right, go left, pick up a passenger and drop him off. This makes a total of 6 possible actions, which in turn are encoded in numbers from 0 to 5 for ease of programming. The correspondences between actions and numbers are shown in Figure 1.

## States

The discrete state space is much larger, as each state is represented as a tuple containing the agent/taxi driver’s position on the grid, the position of the passenger to be picked up and his destination. Since the map is a 2-dimensional grid, the agent’s position can be fully represented by its position on the x-axis and its position on the y-axis, so the tuple representing the agent’s state is: (pos_agent_x, pos_agent_y, passenger_location, destination). As in the case of actions, the tuples representing states are encoded to integers between 0 and 499 (500 possible states). Figure 3 shows the correspondence between the location/destination of the agent and the integer representing that position/destination, which is used in the tuple describing the state.

To understand in a visual way the representation of the possible states as tuples, two examples of states for the agent are shown in Figure 3.

## Rewards

As for the rewards received for each step performed by the agent, these will be:

• +20 for successfully delivering the passenger (terminal state)
• -10 for executing pickup or drop-off actions illegally
• -1 for any step unless other reward is triggered

Now that the environment is well known, the implementation of the Q-Learning algorithm can proceed. As in the first article, the pseudocode extracted from Barto and Sutton’s book  will be used as a reference to support the implementation of the algorithm. Q-Learning pseudocode. Extracted from Sutton and Barto: “Reinforcement Learning:An Introduction” 

## 1) Initialize Q-Table

The Q-Table is initialized as a mxn matrix with all its values set to zero. where m is the size of the state space, and n is the size of the action space. Rewards-Episode Plots for two trainings with different hyperparameters. Image by author

## Implementation of the Q-Learning algorithm, and application to OpenAI Gym’s Taxi-v3 environment

The first article in this series introduced the basic concepts and components of any Reinforcement Learning system, and explained the theory behind the Q-Learning algorithm. In this article the goal will be to implement the algorithm in Python3, applying it in a real training environment. All the concepts mentioned in the first article (Applied Reinforcement Learning I: Q-Learning) will be applied in this one, assuming that the reader knows and understands them, so if you are not familiar with these concepts, or have not read the first article, below you can access it:

In order to make this article didactic, a simple and basic environment has been chosen that does not add too much complexity to the training, so that the learning of the Q-Learning algorithm can be fully appreciated. The environment is OpenAI Gym’s Taxi-v3 , which consists of a grid world where the agent is a taxi driver who must pick up a customer and drop him off at his destination.

## Actions

As for the action space, the following discrete actions are available for the agent to interact with the environment: go forward, go backward, go right, go left, pick up a passenger and drop him off. This makes a total of 6 possible actions, which in turn are encoded in numbers from 0 to 5 for ease of programming. The correspondences between actions and numbers are shown in Figure 1.

## States

The discrete state space is much larger, as each state is represented as a tuple containing the agent/taxi driver’s position on the grid, the position of the passenger to be picked up and his destination. Since the map is a 2-dimensional grid, the agent’s position can be fully represented by its position on the x-axis and its position on the y-axis, so the tuple representing the agent’s state is: (pos_agent_x, pos_agent_y, passenger_location, destination). As in the case of actions, the tuples representing states are encoded to integers between 0 and 499 (500 possible states). Figure 3 shows the correspondence between the location/destination of the agent and the integer representing that position/destination, which is used in the tuple describing the state.

To understand in a visual way the representation of the possible states as tuples, two examples of states for the agent are shown in Figure 3.

## Rewards

As for the rewards received for each step performed by the agent, these will be:

• +20 for successfully delivering the passenger (terminal state)
• -10 for executing pickup or drop-off actions illegally
• -1 for any step unless other reward is triggered

Now that the environment is well known, the implementation of the Q-Learning algorithm can proceed. As in the first article, the pseudocode extracted from Barto and Sutton’s book  will be used as a reference to support the implementation of the algorithm. Q-Learning pseudocode. Extracted from Sutton and Barto: “Reinforcement Learning:An Introduction” 

## 1) Initialize Q-Table

The Q-Table is initialized as a mxn matrix with all its values set to zero. where m is the size of the state space, and n is the size of the action space. Rewards-Episode Plots for two trainings with different hyperparameters. Image by author