Techno Blender
Digitally Yours.

Applied Reinforcement Learning II: Implementation of Q-Learning | by Javier Martínez Ojeda | Oct, 2022

0 73


Implementation of the Q-Learning algorithm, and application to OpenAI Gym’s Taxi-v3 environment

Photo by Richard Bell on Unsplash

The first article in this series introduced the basic concepts and components of any Reinforcement Learning system, and explained the theory behind the Q-Learning algorithm. In this article the goal will be to implement the algorithm in Python3, applying it in a real training environment. All the concepts mentioned in the first article (Applied Reinforcement Learning I: Q-Learning) will be applied in this one, assuming that the reader knows and understands them, so if you are not familiar with these concepts, or have not read the first article, below you can access it:

In order to make this article didactic, a simple and basic environment has been chosen that does not add too much complexity to the training, so that the learning of the Q-Learning algorithm can be fully appreciated. The environment is OpenAI Gym’s Taxi-v3 [1], which consists of a grid world where the agent is a taxi driver who must pick up a customer and drop him off at his destination.

Actions

As for the action space, the following discrete actions are available for the agent to interact with the environment: go forward, go backward, go right, go left, pick up a passenger and drop him off. This makes a total of 6 possible actions, which in turn are encoded in numbers from 0 to 5 for ease of programming. The correspondences between actions and numbers are shown in Figure 1.

Figure 1. Action — Number mappings. Image by author

States

The discrete state space is much larger, as each state is represented as a tuple containing the agent/taxi driver’s position on the grid, the position of the passenger to be picked up and his destination. Since the map is a 2-dimensional grid, the agent’s position can be fully represented by its position on the x-axis and its position on the y-axis, so the tuple representing the agent’s state is: (pos_agent_x, pos_agent_y, passenger_location, destination). As in the case of actions, the tuples representing states are encoded to integers between 0 and 499 (500 possible states). Figure 3 shows the correspondence between the location/destination of the agent and the integer representing that position/destination, which is used in the tuple describing the state.

Figure 2. Passenger location/destination — Number mappings. Image by author

To understand in a visual way the representation of the possible states as tuples, two examples of states for the agent are shown in Figure 3.

Figure 3. Visual examples of states. Image by author

Rewards

As for the rewards received for each step performed by the agent, these will be:

  • +20 for successfully delivering the passenger (terminal state)
  • -10 for executing pickup or drop-off actions illegally
  • -1 for any step unless other reward is triggered

Now that the environment is well known, the implementation of the Q-Learning algorithm can proceed. As in the first article, the pseudocode extracted from Barto and Sutton’s book [2] will be used as a reference to support the implementation of the algorithm.

Q-Learning pseudocode. Extracted from Sutton and Barto: “Reinforcement Learning:
An Introduction” [2]

1) Initialize Q-Table

The Q-Table is initialized as a mxn matrix with all its values set to zero. where m is the size of the state space, and n is the size of the action space.

Since both actions and states are encoded to integers, the construction of the Q-Table can be done using these integers as indices.

2) Define Epsilon-Greedy Policy

The epsilon-greedy policy selects the action with the highest Q-Value for the given state or a random action, depending on the selected epsilon parameter. For example, an epsilon of 0.15 means that 15% of the time an action will be chosen randomly, while an epislon of 1 means that the action will be chosen randomly always (100% of the time).

A very interesting point of this policy is that the epsilon value can vary along the training, allowing to take more random actions at the beginning with high epsilon values (exploration phase), to finally take the actions with higher Q-Value using a very low epsilon (exploitation phase). For this environment, however, training is carried out with a constant epsilon value.

3) Define the execution of an Episode

For each episode the agent will perform as many timesteps as necessary to reach a terminal state. In each of these timesteps, the agent will 1) choose an action following the epsilon-greedy policy, and execute that action. After executing it, the agent will 2) observe the new state reached and the reward obtained, information that will be used to 3) update the Q-Values of its Q-Table.

This process is repeated for each timestep, until the optimal Q-Values are obtained.

As can be seen, the code is very simple: the get_epsilon_greedy_action() function defined before is used to select the action, the agent performs the selected action through the step() method of the environment, and finally the Q-Value is updated by applying the adaptation of Bellman’s optimality equation, as described and explained in the previous article.

4) Training the Agent

At this point, only the hyperparameters of the algorithm need to be defined, which are: learning rate α, discount factor γ and epsilon. In addition to this, it will also be necessary to specify the number of episodes that the agent must perform in order to consider the training as completed.

After defining all these variables, the execution of the training will consist of running the execute_episode() function defined above for each training episode. Each episode (and every timestep within each episode) will update the Q-Table until optimal values are reached.

It is important to note that the training execution keeps records of the rewards obtained in each episode in the rewards_history variable, in order to be able to show the results for evaluation after the training.

5) Evaluate the Agent

To evaluate the agent, the rewards obtained from each training episode, a visualization of the trained agent carrying out an episode and the metrics of several executions for the trained agent will be used.

The rewards obtained during the training process are a very important metric, as it should show the convergence of the rewards towards the optimal values. In this case, as the rewards of each timestep are negative except for the terminal state, the algorithm should make the rewards as close as possible to 0, or even exceed it. Plots of the rewards per episode obtained during two training sessions with different hyperparameters are shown below. As can be seen, the first plot shows how the rewards quickly reach maximum values, drawing an asymptote at 0, which means that the agent has learned to obtain the best possible rewards in each state (optimal Q-Values). On the other hand, the second graph shows that the rewards neither get better nor worse while ranging from 0 to -20000, which implies that the agent is not learning the task.

Rewards-Episode Plots for two trainings with different hyperparameters. Image by author

In this case, the reason that the second training works so poorly is due to the excessively high value of epsilon, since an epsilon of 0.99 will cause most actions to be chosen randomly, completely ignoring the exploitation phase.

With regards to the visualization of the agent, the render mode will be used, which allows the agent to be visualized while interacting with the environment. To see what the agent has learned, an episode is executed in which the epsilon is 0, that is: the agent always takes the action with the highest Q-Value, thus following the optimal policy.

Trained Q-Learning Agent executing an episode. GIF by author

As shown, executing the code above will display the trained agent picking up the passenger and taking him to his destination, unequivocally showing that the agent has learned how to correctly perform his task.

Finally, the behavior of the trained agent will also be evaluated in several different episodes, thus discovering the ability of the agent to successfully complete the task from different starting points and different locations and destinations of the passenger. As in the visualization, the agent will always choose the action with the highest Q-Value, thus showing what it has learned, and never an stochastic behavior.

Execution logs of execute_episodes_on_ trained_agent() method. Image by author

As the execution logs show, the average number of timesteps needed to complete an episode is 13.8, and the average reward received for each episode is 7.2. Both metrics show that the agent has perfectly learned to perform the task, since it needs very few timesteps to complete the task, and achieves a reward greater than 0 in all executions.


Implementation of the Q-Learning algorithm, and application to OpenAI Gym’s Taxi-v3 environment

Photo by Richard Bell on Unsplash

The first article in this series introduced the basic concepts and components of any Reinforcement Learning system, and explained the theory behind the Q-Learning algorithm. In this article the goal will be to implement the algorithm in Python3, applying it in a real training environment. All the concepts mentioned in the first article (Applied Reinforcement Learning I: Q-Learning) will be applied in this one, assuming that the reader knows and understands them, so if you are not familiar with these concepts, or have not read the first article, below you can access it:

In order to make this article didactic, a simple and basic environment has been chosen that does not add too much complexity to the training, so that the learning of the Q-Learning algorithm can be fully appreciated. The environment is OpenAI Gym’s Taxi-v3 [1], which consists of a grid world where the agent is a taxi driver who must pick up a customer and drop him off at his destination.

Actions

As for the action space, the following discrete actions are available for the agent to interact with the environment: go forward, go backward, go right, go left, pick up a passenger and drop him off. This makes a total of 6 possible actions, which in turn are encoded in numbers from 0 to 5 for ease of programming. The correspondences between actions and numbers are shown in Figure 1.

Figure 1. Action — Number mappings. Image by author

States

The discrete state space is much larger, as each state is represented as a tuple containing the agent/taxi driver’s position on the grid, the position of the passenger to be picked up and his destination. Since the map is a 2-dimensional grid, the agent’s position can be fully represented by its position on the x-axis and its position on the y-axis, so the tuple representing the agent’s state is: (pos_agent_x, pos_agent_y, passenger_location, destination). As in the case of actions, the tuples representing states are encoded to integers between 0 and 499 (500 possible states). Figure 3 shows the correspondence between the location/destination of the agent and the integer representing that position/destination, which is used in the tuple describing the state.

Figure 2. Passenger location/destination — Number mappings. Image by author

To understand in a visual way the representation of the possible states as tuples, two examples of states for the agent are shown in Figure 3.

Figure 3. Visual examples of states. Image by author

Rewards

As for the rewards received for each step performed by the agent, these will be:

  • +20 for successfully delivering the passenger (terminal state)
  • -10 for executing pickup or drop-off actions illegally
  • -1 for any step unless other reward is triggered

Now that the environment is well known, the implementation of the Q-Learning algorithm can proceed. As in the first article, the pseudocode extracted from Barto and Sutton’s book [2] will be used as a reference to support the implementation of the algorithm.

Q-Learning pseudocode. Extracted from Sutton and Barto: “Reinforcement Learning:
An Introduction” [2]

1) Initialize Q-Table

The Q-Table is initialized as a mxn matrix with all its values set to zero. where m is the size of the state space, and n is the size of the action space.

Since both actions and states are encoded to integers, the construction of the Q-Table can be done using these integers as indices.

2) Define Epsilon-Greedy Policy

The epsilon-greedy policy selects the action with the highest Q-Value for the given state or a random action, depending on the selected epsilon parameter. For example, an epsilon of 0.15 means that 15% of the time an action will be chosen randomly, while an epislon of 1 means that the action will be chosen randomly always (100% of the time).

A very interesting point of this policy is that the epsilon value can vary along the training, allowing to take more random actions at the beginning with high epsilon values (exploration phase), to finally take the actions with higher Q-Value using a very low epsilon (exploitation phase). For this environment, however, training is carried out with a constant epsilon value.

3) Define the execution of an Episode

For each episode the agent will perform as many timesteps as necessary to reach a terminal state. In each of these timesteps, the agent will 1) choose an action following the epsilon-greedy policy, and execute that action. After executing it, the agent will 2) observe the new state reached and the reward obtained, information that will be used to 3) update the Q-Values of its Q-Table.

This process is repeated for each timestep, until the optimal Q-Values are obtained.

As can be seen, the code is very simple: the get_epsilon_greedy_action() function defined before is used to select the action, the agent performs the selected action through the step() method of the environment, and finally the Q-Value is updated by applying the adaptation of Bellman’s optimality equation, as described and explained in the previous article.

4) Training the Agent

At this point, only the hyperparameters of the algorithm need to be defined, which are: learning rate α, discount factor γ and epsilon. In addition to this, it will also be necessary to specify the number of episodes that the agent must perform in order to consider the training as completed.

After defining all these variables, the execution of the training will consist of running the execute_episode() function defined above for each training episode. Each episode (and every timestep within each episode) will update the Q-Table until optimal values are reached.

It is important to note that the training execution keeps records of the rewards obtained in each episode in the rewards_history variable, in order to be able to show the results for evaluation after the training.

5) Evaluate the Agent

To evaluate the agent, the rewards obtained from each training episode, a visualization of the trained agent carrying out an episode and the metrics of several executions for the trained agent will be used.

The rewards obtained during the training process are a very important metric, as it should show the convergence of the rewards towards the optimal values. In this case, as the rewards of each timestep are negative except for the terminal state, the algorithm should make the rewards as close as possible to 0, or even exceed it. Plots of the rewards per episode obtained during two training sessions with different hyperparameters are shown below. As can be seen, the first plot shows how the rewards quickly reach maximum values, drawing an asymptote at 0, which means that the agent has learned to obtain the best possible rewards in each state (optimal Q-Values). On the other hand, the second graph shows that the rewards neither get better nor worse while ranging from 0 to -20000, which implies that the agent is not learning the task.

Rewards-Episode Plots for two trainings with different hyperparameters. Image by author

In this case, the reason that the second training works so poorly is due to the excessively high value of epsilon, since an epsilon of 0.99 will cause most actions to be chosen randomly, completely ignoring the exploitation phase.

With regards to the visualization of the agent, the render mode will be used, which allows the agent to be visualized while interacting with the environment. To see what the agent has learned, an episode is executed in which the epsilon is 0, that is: the agent always takes the action with the highest Q-Value, thus following the optimal policy.

Trained Q-Learning Agent executing an episode. GIF by author

As shown, executing the code above will display the trained agent picking up the passenger and taking him to his destination, unequivocally showing that the agent has learned how to correctly perform his task.

Finally, the behavior of the trained agent will also be evaluated in several different episodes, thus discovering the ability of the agent to successfully complete the task from different starting points and different locations and destinations of the passenger. As in the visualization, the agent will always choose the action with the highest Q-Value, thus showing what it has learned, and never an stochastic behavior.

Execution logs of execute_episodes_on_ trained_agent() method. Image by author

As the execution logs show, the average number of timesteps needed to complete an episode is 13.8, and the average reward received for each episode is 7.2. Both metrics show that the agent has perfectly learned to perform the task, since it needs very few timesteps to complete the task, and achieves a reward greater than 0 in all executions.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment