Solving The Taxi Environment With Q-Learning — A Tutorial | by Wouter van Heeswijk, PhD | Mar, 2023


A Python implementation of Q-learning to solve the Taxi-v3 environment from OpenAI Gym in an animated Jupyter Notebook

Photo by Alexander Redl on Unsplash

The goal of the Taxi Environment in OpenAI’s Gym — yes, from the company behind ChatGPT and Dall⋅E — is simple and straightforward, making for an excellent introduction to the field of Reinforcement Learning (RL).

This article provides a step-to-step guide to implement the environment, learn a policy using tabular Q-learning, and visualize the learned behavior in animations. If you are looking to make your first strides in RL, this tutorial might be just for you.

In the Taxi environment, a taxi should pick up passengers and drop them at their destination on a small parking lot, driving the shortest path possible. How hard can it be to learn such a task with RL? Let’s find out.

Before diving into the implementation, it is good practice to first formulate the setting as a Markov Decision Process (MDP), getting a grasp on the mathematical structure of the problem at hand.

State space

An MDP states contains the information needed to (i) determine actions, (ii) compute the action reward and (iii) compute the transition to the next state.

The most obvious state property is the location of the taxi; the 5*5 grid gives us 25 options. Furthermore, a passenger waiting for pickup can be waiting at 1 or 4 points (labeled Y, R, G, B) or they can be in the taxi, giving us (4+1) options. Finally, the destination can also be Y, R, G or B. We denote the state by the following vector:

State = [x_pos_taxi, y_pos_taxi, pos_passenger, dest_passenger]

Although a very simple problem setting, we thus already deal with 5*5*(4+1)*4=500 states! Scale the problem to thousands of locations, and it is easy to see how quickly the size of MDPs explodes (the so-called curse of dimensionality).

The number of reachable states is slightly smaller than 500, e.g., a passenger will never have the same pickup point and destination. Due to modelling complexities, we typically focus on the full state space, representing the worst-case scenario.

Action space

The state space may be bigger than you’d expect at first glance, but the action space contains only six decisions:

  • Down
  • Up
  • Left
  • Right
  • Drop passenger
  • Pick up passenger

In case the action is not allowed (e.g., trying to drive through a wall, attempting to pick up a customer when there is none at the present location), the agent will remain still.

The action_mask (found in env.action_mask(state)) informs us which actions are feasible, but this feasibility check is not automatically enforced.

Reward function

Three types of reward can be distinguished in this environment:

  • Move: -1. Every move comes with a small penalty, to encourage following the shortest path from origin to destination.
  • Failed drop-off: -10. A passenger will naturally be unhappy when dropped at the wrong location, so a large penalty is appropriate.
  • Successful drop-off: 20. To encourage the desired behavior, a large bonus is incurred.

Designing rewards in a way that encourages desired behavior is known as reward shaping. For instance, we might simply end the game without a reward upon successful drop-off (thus ending the stream of costs), but a clear reward signal often eases learning considerably.

Transition function

Unlike most realistic environments, the taxi environment exhibits no randomness. This feature considerably simplifies learning, as the next state is dictated solely by our action. In case an infeasible action is selected — such as trying to drive through a wall or picking up a non-existent passenger — the agent simply remains in place.

In an abstract sense, we could denote the transition function as follows:

new_state = f(state,action)

An explicit definition would consider all grid moves, (in)feasible drop-offs and (in)feasible pickups, which is not necessary hard but requires some focus.

The game ends upon successfully dropping off a passenger at their destination, denoting the terminating state.

Having mathematically defined the problem (although not in excruciating detail), we will move towards implementation.

First, we install the necessary libraries and subsequently import them. Obviously we need to install the gym environment. Other than that, we just require some visual stuff and your typical data science libraries.

Assuming all imports went correctly, it is time for a sanity check. Lets create and render a Taxi-v3 environment.

We should see an action between 0 and 5, a state between 0 and 499, and an action mask to block infeasible actions. Most importantly, we should have rendered a still frame to affirm we are truly up and running!

Screenshot of initialization output [image by author, using OpenAI Gym]

Finally, let’s set up two functions : one for the console output and animation, one for storing GIFs. Note that the animation is simply a repeated plot of the RGB frame. It is also very time-consuming, and therefore not recommended to run during training.

Having asserted that the environment works as expected, it is time to let a random agent run wild. As you guessed, this agent takes random action at every moment in time. This is essentially how the learning agent sets out, having no reliable Q-values to work with.

You will see the agent trying to drive through walls, pick up passengers at deserted sports, or drop off passenger at the wrong location. It might take thousands of actions before — strictly by accident — dropping the passenger at the right location.

As you can imagine, the random agent frustrates the passenger to no end:

Animation of an untrained agent. The taxi selects random actions at each time step, including infeasible ones, until eventually dropping of the passenger at the correct stop by accident [image by author, using OpenAI Gym]

Why sit through this lengthy animation? Well, it really gives an impression how an untrained RL agent behaves, and how long it takes to get a meaningful reward signal.

Note: by applying the action mask — a Boolean vector that essentially assigns a probability of 0 to selecting infeasible actions — we restrict ourselves to only sensible actions, typically substantially shortening the episode’s length.

Time to learn something useful.

We will deploy a basic Q-learning algorithm with ϵ-greedy exploration, selecting random actions with probability ϵ and use Q-values to select actions for the remainder of the time.

Q-values are updates using the following equation after making an observation. Note that with 500 states and 6 actions, we must fill a Q-table of size 500*6=3000, with each state-action pair requiring multiple observations to learn its value with any degree of accuracy.

In Q-learning, we update values of state-action pairs after each observation, using off-policy learning.

The code looks as follows. You can tinker with the hyper-parameters; in my experience this problem is fairly insensitive to them.

Converge of training is visualized below. After 2000 episodes, it seems we have learned a pretty good and consistent policy.

Initially, the agent requires many steps to successfully complete an epsiode. Ultimately, positive rewards are recognized and the agent starts taking increasingly efficient actions, culminating in shortest path solutions [image by author]

Note: the code ignores the action mask and selects the first Q-value in case of a tie. Although alternatives for both issues are provided, they substantially increase the computational effort per episode, while providing no clear performance benefit in this setting.

Let’s see what we learned. Depending on the state we are in, we look up the corresponding Q-values in the Q-table (i.e., six values for each state, corresponding to the actions) and select the action with the highest associated Q-value.

Effectively, the Q-value captures the expected cumulative reward associated with the action. Note that we no longer select random actions with probability ϵ — this mechanism was required strictly for learning.

If you performed sufficiently many iterations, you should see the taxi always drive directly to the passenger, take the shortest path to the destination, and successfully drop of the passenger. Indeed, we’ve come a long way since driving around at random.

Animation of a successfully learned policy. The taxi will follow the shortest path both to pick up and drop off the passenger, maximizing the reward [image by author, using OpenAI Gym]

Although it requires many observations to learn this policy in a seemingly trivial environment, you might appreciate the result more if you realize the algorithm learned all of this without knowing anything at all about the environment it operates in!

The Taxi environment is a nice one to get started with Reinforcement Learning. The problem setting is simple and intuitive, yet could easily be extended towards something more realistic (think realistic urban networks, managing taxi fleets, uncertain travel times). It is computationally well-manageable, but also illustrates the computational effort required even to learn policies for trivial problem settings.

If you are new to Reinforcement Learning, playing around with the OpenAI Gym environments is highly recommended before moving to designing and solving your own problems.

The full Jupyter Notebook can be found on my GitHub:

https://github.com/woutervanheeswijk/taxi_environment


A Python implementation of Q-learning to solve the Taxi-v3 environment from OpenAI Gym in an animated Jupyter Notebook

Photo by Alexander Redl on Unsplash

The goal of the Taxi Environment in OpenAI’s Gym — yes, from the company behind ChatGPT and Dall⋅E — is simple and straightforward, making for an excellent introduction to the field of Reinforcement Learning (RL).

This article provides a step-to-step guide to implement the environment, learn a policy using tabular Q-learning, and visualize the learned behavior in animations. If you are looking to make your first strides in RL, this tutorial might be just for you.

In the Taxi environment, a taxi should pick up passengers and drop them at their destination on a small parking lot, driving the shortest path possible. How hard can it be to learn such a task with RL? Let’s find out.

Before diving into the implementation, it is good practice to first formulate the setting as a Markov Decision Process (MDP), getting a grasp on the mathematical structure of the problem at hand.

State space

An MDP states contains the information needed to (i) determine actions, (ii) compute the action reward and (iii) compute the transition to the next state.

The most obvious state property is the location of the taxi; the 5*5 grid gives us 25 options. Furthermore, a passenger waiting for pickup can be waiting at 1 or 4 points (labeled Y, R, G, B) or they can be in the taxi, giving us (4+1) options. Finally, the destination can also be Y, R, G or B. We denote the state by the following vector:

State = [x_pos_taxi, y_pos_taxi, pos_passenger, dest_passenger]

Although a very simple problem setting, we thus already deal with 5*5*(4+1)*4=500 states! Scale the problem to thousands of locations, and it is easy to see how quickly the size of MDPs explodes (the so-called curse of dimensionality).

The number of reachable states is slightly smaller than 500, e.g., a passenger will never have the same pickup point and destination. Due to modelling complexities, we typically focus on the full state space, representing the worst-case scenario.

Action space

The state space may be bigger than you’d expect at first glance, but the action space contains only six decisions:

  • Down
  • Up
  • Left
  • Right
  • Drop passenger
  • Pick up passenger

In case the action is not allowed (e.g., trying to drive through a wall, attempting to pick up a customer when there is none at the present location), the agent will remain still.

The action_mask (found in env.action_mask(state)) informs us which actions are feasible, but this feasibility check is not automatically enforced.

Reward function

Three types of reward can be distinguished in this environment:

  • Move: -1. Every move comes with a small penalty, to encourage following the shortest path from origin to destination.
  • Failed drop-off: -10. A passenger will naturally be unhappy when dropped at the wrong location, so a large penalty is appropriate.
  • Successful drop-off: 20. To encourage the desired behavior, a large bonus is incurred.

Designing rewards in a way that encourages desired behavior is known as reward shaping. For instance, we might simply end the game without a reward upon successful drop-off (thus ending the stream of costs), but a clear reward signal often eases learning considerably.

Transition function

Unlike most realistic environments, the taxi environment exhibits no randomness. This feature considerably simplifies learning, as the next state is dictated solely by our action. In case an infeasible action is selected — such as trying to drive through a wall or picking up a non-existent passenger — the agent simply remains in place.

In an abstract sense, we could denote the transition function as follows:

new_state = f(state,action)

An explicit definition would consider all grid moves, (in)feasible drop-offs and (in)feasible pickups, which is not necessary hard but requires some focus.

The game ends upon successfully dropping off a passenger at their destination, denoting the terminating state.

Having mathematically defined the problem (although not in excruciating detail), we will move towards implementation.

First, we install the necessary libraries and subsequently import them. Obviously we need to install the gym environment. Other than that, we just require some visual stuff and your typical data science libraries.

Assuming all imports went correctly, it is time for a sanity check. Lets create and render a Taxi-v3 environment.

We should see an action between 0 and 5, a state between 0 and 499, and an action mask to block infeasible actions. Most importantly, we should have rendered a still frame to affirm we are truly up and running!

Screenshot of initialization output [image by author, using OpenAI Gym]

Finally, let’s set up two functions : one for the console output and animation, one for storing GIFs. Note that the animation is simply a repeated plot of the RGB frame. It is also very time-consuming, and therefore not recommended to run during training.

Having asserted that the environment works as expected, it is time to let a random agent run wild. As you guessed, this agent takes random action at every moment in time. This is essentially how the learning agent sets out, having no reliable Q-values to work with.

You will see the agent trying to drive through walls, pick up passengers at deserted sports, or drop off passenger at the wrong location. It might take thousands of actions before — strictly by accident — dropping the passenger at the right location.

As you can imagine, the random agent frustrates the passenger to no end:

Animation of an untrained agent. The taxi selects random actions at each time step, including infeasible ones, until eventually dropping of the passenger at the correct stop by accident [image by author, using OpenAI Gym]

Why sit through this lengthy animation? Well, it really gives an impression how an untrained RL agent behaves, and how long it takes to get a meaningful reward signal.

Note: by applying the action mask — a Boolean vector that essentially assigns a probability of 0 to selecting infeasible actions — we restrict ourselves to only sensible actions, typically substantially shortening the episode’s length.

Time to learn something useful.

We will deploy a basic Q-learning algorithm with ϵ-greedy exploration, selecting random actions with probability ϵ and use Q-values to select actions for the remainder of the time.

Q-values are updates using the following equation after making an observation. Note that with 500 states and 6 actions, we must fill a Q-table of size 500*6=3000, with each state-action pair requiring multiple observations to learn its value with any degree of accuracy.

In Q-learning, we update values of state-action pairs after each observation, using off-policy learning.

The code looks as follows. You can tinker with the hyper-parameters; in my experience this problem is fairly insensitive to them.

Converge of training is visualized below. After 2000 episodes, it seems we have learned a pretty good and consistent policy.

Initially, the agent requires many steps to successfully complete an epsiode. Ultimately, positive rewards are recognized and the agent starts taking increasingly efficient actions, culminating in shortest path solutions [image by author]

Note: the code ignores the action mask and selects the first Q-value in case of a tie. Although alternatives for both issues are provided, they substantially increase the computational effort per episode, while providing no clear performance benefit in this setting.

Let’s see what we learned. Depending on the state we are in, we look up the corresponding Q-values in the Q-table (i.e., six values for each state, corresponding to the actions) and select the action with the highest associated Q-value.

Effectively, the Q-value captures the expected cumulative reward associated with the action. Note that we no longer select random actions with probability ϵ — this mechanism was required strictly for learning.

If you performed sufficiently many iterations, you should see the taxi always drive directly to the passenger, take the shortest path to the destination, and successfully drop of the passenger. Indeed, we’ve come a long way since driving around at random.

Animation of a successfully learned policy. The taxi will follow the shortest path both to pick up and drop off the passenger, maximizing the reward [image by author, using OpenAI Gym]

Although it requires many observations to learn this policy in a seemingly trivial environment, you might appreciate the result more if you realize the algorithm learned all of this without knowing anything at all about the environment it operates in!

The Taxi environment is a nice one to get started with Reinforcement Learning. The problem setting is simple and intuitive, yet could easily be extended towards something more realistic (think realistic urban networks, managing taxi fleets, uncertain travel times). It is computationally well-manageable, but also illustrates the computational effort required even to learn policies for trivial problem settings.

If you are new to Reinforcement Learning, playing around with the OpenAI Gym environments is highly recommended before moving to designing and solving your own problems.

The full Jupyter Notebook can be found on my GitHub:

https://github.com/woutervanheeswijk/taxi_environment

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
artificial intelligenceEnvironmentHeeswijkmachine learningMARPhDQLearningsolvingtaxiTechnoblenderTutorialvanWouter
Comments (0)
Add Comment