Applied Reinforcement Learning II: Implementation of Q-Learning | by Javier Martínez Ojeda | Oct, 2022

By Jessie Hobb On Oct 12, 2022

Implementation of the Q-Learning algorithm, and application to OpenAI Gym’s Taxi-v3 environment

The first article in this series introduced the basic concepts and components of any Reinforcement Learning system, and explained the theory behind the Q-Learning algorithm. In this article the goal will be to implement the algorithm in Python3, applying it in a real training environment. All the concepts mentioned in the first article (Applied Reinforcement Learning I: Q-Learning) will be applied in this one, assuming that the reader knows and understands them, so if you are not familiar with these concepts, or have not read the first article, below you can access it:

In order to make this article didactic, a simple and basic environment has been chosen that does not add too much complexity to the training, so that the learning of the Q-Learning algorithm can be fully appreciated. The environment is OpenAI Gym’s Taxi-v3 [1], which consists of a grid world where the agent is a taxi driver who must pick up a customer and drop him off at his destination.

Actions

As for the action space, the following discrete actions are available for the agent to interact with the environment: go forward, go backward, go right, go left, pick up a passenger and drop him off. This makes a total of 6 possible actions, which in turn are encoded in numbers from 0 to 5 for ease of programming. The correspondences between actions and numbers are shown in Figure 1.

**Figure 1**. Action — Number mappings. Image by author

States

The discrete state space is much larger, as each state is represented as a tuple containing the agent/taxi driver’s position on the grid, the position of the passenger to be picked up and his destination. Since the map is a 2-dimensional grid, the agent’s position can be fully represented by its position on the x-axis and its position on the y-axis, so the tuple representing the agent’s state is: (pos_agent_x, pos_agent_y, passenger_location, destination). As in the case of actions, the tuples representing states are encoded to integers between 0 and 499 (500 possible states). Figure 3 shows the correspondence between the location/destination of the agent and the integer representing that position/destination, which is used in the tuple describing the state.

**Figure 2**. Passenger location/destination — Number mappings. Image by author

To understand in a visual way the representation of the possible states as tuples, two examples of states for the agent are shown in Figure 3.

**Figure 3**. Visual examples of states. Image by author

Rewards

As for the rewards received for each step performed by the agent, these will be:

+20 for successfully delivering the passenger (terminal state)
-10 for executing pickup or drop-off actions illegally
-1 for any step unless other reward is triggered

Now that the environment is well known, the implementation of the Q-Learning algorithm can proceed. As in the first article, the pseudocode extracted from Barto and Sutton’s book [2] will be used as a reference to support the implementation of the algorithm.

Q-Learning pseudocode. Extracted from Sutton and Barto: “Reinforcement Learning:
An Introduction” **[2]**

1) Initialize Q-Table

The Q-Table is initialized as a mxn matrix with all its values set to zero. where m is the size of the state space, and n is the size of the action space.

Rewards-Episode Plots for two trainings with different hyperparameters. Image by author

Implementation of the Q-Learning algorithm, and application to OpenAI Gym’s Taxi-v3 environment

Actions

States

To understand in a visual way the representation of the possible states as tuples, two examples of states for the agent are shown in Figure 3.

Rewards

As for the rewards received for each step performed by the agent, these will be:

+20 for successfully delivering the passenger (terminal state)
-10 for executing pickup or drop-off actions illegally
-1 for any step unless other reward is triggered

1) Initialize Q-Table

The Q-Table is initialized as a mxn matrix with all its values set to zero. where m is the size of the state space, and n is the size of the action space.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Applied Reinforcement Learning II: Implementation of Q-Learning | by Javier Martínez Ojeda | Oct, 2022

Implementation of the Q-Learning algorithm, and application to OpenAI Gym’s Taxi-v3 environment

Actions

States

Rewards

1) Initialize Q-Table

2) Define Epsilon-Greedy Policy

3) Define the execution of an Episode

4) Training the Agent

5) Evaluate the Agent

Implementation of the Q-Learning algorithm, and application to OpenAI Gym’s Taxi-v3 environment

Actions

States

Rewards

1) Initialize Q-Table

2) Define Epsilon-Greedy Policy

3) Define the execution of an Episode

4) Training the Agent

5) Evaluate the Agent