Techno Blender
Digitally Yours.

Creating a Custom Gym Environment for Jupyter Notebooks | by Steve Roberts | Jun, 2022

0 75


Part 1: Creating the framework

[All images by author]

This article (split over two parts) describes the creation of a custom OpenAI Gym environment for Reinforcement Learning (RL) problems.

Quite a few tutorials already exist that show how to create a custom Gym environment (see the References section for a few good links). In all of these examples, and indeed in the most common Gym environments, these produce either a text-based output (e.g. Frozenlake) or an image-based output that appears in a separate graphical window (e.g. Lunar Lander).

Instead we’ll create a custom environment that produces its output in a Jupyter notebook. The graphical representation of the environment will be written directly into the notebook cell and updated in real time. Additionally, it can be used in any test framework, and with any RL algorithm, that also implements the Gym interface.

By the end of the article we will have created a custom Gym environment, that can be tailored to produce a range of different Grid Worlds for Baby Robot to explore, and that renders an output similar to the cover image shown above.

The associated Jupyter Notebook for this article can be found on Github. This contains all of the code required to setup and run the Baby Robot custom Gym environment described below.

Up until now, in our series on Reinforcement Learning (RL), we’ve used bespoke environments to represent the locations where Baby Robot finds himself. Starting from a simple grid world we added components, such as walls and puddles, to increase the complexities of the challenges that Baby Robot faced.

Now that we know the basics of RL, and before we move onto more complex problems and algorithms, it seems like a good time to formalise Baby Robot’s environment. If we give this environment a fixed, defined, interface then we can re-use the same environment in all of our problems and with multiple RL algorithms. This will makes things a lot simpler as we move forwards to look at different RL methods.

By adopting a common interface we can then drop this environment into any existing systems that also implement the same interface. All we need to do is decide what interface we should use. Luckily for us this has already been done, and it’s called the OpenAI Gym interface.

Introduction to OpenAI Gym

OpenAI Gym is a set of Reinforcement Learning (RL) environments, with problems ranging from simple grid worlds up to complex physics engines.

Each of these environments implements the same interface, making it easy to test a single environment using a range of different RL algorithms. Similarly, it makes it straightforward to evaluate a single RL algorithm on a range of different problems.

As a result, OpenAI Gym has become the de-facto standard for learning about and bench-marking RL algorithms.

The OpenAI Gym Interface

The interface for all OpenAI Gym environments can be divided into 3 parts:

1. Initialisation: Create and initialise the environment.

2. Execution: Take repeated actions in the environment. At each step the environment provides information to describe its new state and the reward received as a consequence of taking the specified action. This continues until the environment signals that the episode is complete.

3. Termination: Cleanup and destroy the environment.

Example: The CartPole Environment

One of the simpler problems in Gym is the CartPole environment. In this problem the goal is to move a cart left or right so that the pole, that’s balanced on the cart, remains upright.

Figure 1: Output of the CartPole Environment — the aim is to balance the pole by moving the cart left or right.

The code to set up and run this Gym environment is shown below. Here we’re just choosing left or right actions randomly, so the pole isn’t going to stay up for very long!

Listing 1: The 3 stages of running a Gym environment.

In listing 1, shown above, we’ve labelled the 3 stages of a Gym environment. In more detail, each of these do the following:

1. Initialisation

env = gym.make(‘CartPole-v0’)
  • Create the required environment, in this case the version ‘0’ of CartPole. The returned environment object ‘env’ can then be used to call the functions in the common Gym environment interface.
obs = env.reset()
  • Called at the start of each episode, this puts the environment into its starting state and returns the initial observation of the environment.

2. Execution

Here we run until the environment ‘done’ flag is set to indicate that the episode is complete. This can occur if the agent has reached the termination state or a fixed number of steps have been executed.

env.render()
  • Draw the current state of the environment. In the case of CartPole this will result in a new window being opened to display a graphical view of the cart and its pole. In simpler environments, such as the FrozenLake simple grid-world, a textual representation is shown.
action = env.action_space.sample()
  • Choose a random action from the environment’s set of possible actions.
obs, reward, done, info = env.step(action)
  • Take the action and get back information from the environment about the outcome of this action. This includes 4 pieces of information:
  • obs’: Defines the new state of the environment. In the case of CartPole this is information about the position and velocity of the pole. In a grid-world environment it would be information about the next state, where we end up after taking the action.
  • reward’: The amount of reward, if any, received as a result of taking the action.
  • done’: A flag to indicate if we’ve reached the end of the episode
  • info’: Any additional information. In general this isn’t set.

3. Termination

env.close()
  • Terminate the environment. This will also close any graphical window that may have been created by the render function.

As described previously, the major advantage of using OpenAI Gym is that every environment uses exactly the same interface. We can just replace the environment name string ‘CartPole-v0’ in the ‘gym.make’ line above with the name of any other environment and the rest of the code can stay exactly the same.

This is also true for any custom environment that implements the Gym interface. All that’s required is a class that inherits from the Gym environment and that adds the set of functions described above.

This is shown below for the initial framework of the custom ‘BabyRobotEnv’ that we’re going to create (the ‘_v0’ appended to the class name indicates that this is version zero of our environment. We’ll update this as we add functionality):

In this basic framework for our custom environment we’ve inherited our class from the base ‘gym.Env’ class, which gives us all of the main functionality required to create the environment. To this we’ve then added the 4 functions that are required to turn the class into our own, custom, environment:

  • __init__’: the class initialisation, where we can setup anything required by the class
  • step’: implements what happens when Baby Robot takes a step in the environment and returns information describing the results of taking that step.
  • reset’: called at the start of every episode to put the environment back into its initial state.
  • render’: provides a graphical or text based representation of the environment, to allow the user to see how things are progressing.

We haven’t implemented a ‘close’ function, since there’s currently nothing to close, so we can just rely on the base class to do any required clean up. Additionally, we haven’t yet added any functionality. Our class satisfies the requirements of the Gym interface, and could be used within a Gym test harness, but it currently won’t do much!

The code above defines the framework for a custom environment, however it can’t yet be run since it currently has no ‘action_space’ from which to sample random actions. The ‘action_space’ defines the set of actions that an agent may take in the environment. These can be discrete, continuous or a combination of both.

  • Discrete actions represent a mutually-exclusive set of possible actions, such as the left and right actions in the CartPole environment. At any time-step you can either choose left or right but not both.
  • Continuous actions are actions that have an associated value, which represents the amount of that action to take. For example, when turning a steering wheel an angle could be specified to represent by how much the wheel should be turned.

The Baby Robot environment that we’re creating is what’s referred to as a Grid World. In other words, it’s a grid of squares where Baby Robot may move around, from square to square, to explore and navigate the environment. The default level in this environment will be a 3 x 3 grid, with a starting point at the top left-hand corner, and an exit at the bottom right-hand corner, as shown in Figure 2:

Figure 2: The default level in the Baby Robot environment.

Therefore, for the custom BabyRobotEnv that we’re creating, there are only 4 possible movement actions: North, South, East or West. Additionally, we’ll add a ‘Stay’ action, where Baby Robot remains in the current position. So, in total, we have 5 mutually-exclusive actions and we therefore set the action space to define 5 discrete values:

self.action_space = gym.spaces.Discrete(5)

In addition to an action_space, all environments need to specify an ‘observation_space’. This defines the information supplied to the agent when it receives an observation about the environment.

When Baby Robot takes a step in the environment we want to return his new position. Therefore we’ll define an observation space that specifies a grid position as an ‘x’ and ‘y’ coordinate.

The Gym interface defines a couple of different ‘spaces’ that could be used to specify our coordinates. For example, if our coordinates where continuous, floating point, values we could use the Box space. This would also let us set a limit on the possible range of values that can be used for the ‘x’ and ‘y’ coordinates. Additionally, we could then combine these to form a single expression of the environment’s observation space using Gym’s Dict space.

However, since we’re only going to allow whole moves from one square to the next (as opposed to being half-way between squares), we will specify the grid-coordinate in integers. Therefore, as with the action space, we’ll be using a discrete set of values. But now, instead of there only being a single discrete value, we have two: one for each of the ‘x’ and ‘y’ coordinates. Luckily for us, the Gym interface has just the thing, the MultiDiscrete space.

In the horizontal direction the maximum ‘x’ position is bounded by the width of the grid and in the vertical ‘y’ direction by the height of the grid. Therefore, the observation space can be defined as follows:

self.observation_space = MultiDiscrete([ self.width, self.height ])

Discrete spaces are zero based, so our coordinate values will be from zero up to one less than the defined maximum value.

With these changes the new version of the BabyRobotEnv class is as shown below:

There are a couple of points to note about the new version of the BabyRobotEnv class:

  • We’re supplying a kwargs argument to the init function, letting us create our instance with a dictionary of parameters. Here we’re just going to supply the width and height of the grid we want to make, but going forward we can use this to pass other parameters and by using kwargs we can avoid changing the interface of the class.
  • When we take the width and height from the kwargs, in both cases we default to values of 3 if the parameter hasn’t been supplied. So we get a grid of size 3×3 if no arguments are supplied during the creation of the environment.
  • We’ve now defined Baby Robot’s position in the grid using ‘self.x’ and ‘self.y’, which we now return as the observation from the ‘reset’ and ‘step’ functions. In both cases we’ve converted these values into numpy arrays, which although not required to match the Gym interface, is required for the Stable Baseline’s environment checker, which will be introduced in the next section.

Before we start adding any real functionality, it’s worth confirming that our new environment conforms to the Gym interface. To test this we can validate our class using the Stable Baselines Environment Checker.

Not only does this test that we’ve implemented the functions required for the Gym interface, but it also checks that the action and observation spaces are set up correctly and that the function responses match the associated observation space.

One point to note about the environment checker is that, as well as validating that an environment conforms to the Gym standard, it’s also checking that the environment is suitable to be run with the Stable Baseline’s RL algorithm set. As part of this it expects the observations to be returned as numpy arrays, which is why they’ve been added in the ‘reset’ and ‘step’ functions shown above.

To run the check it’s simply a case of creating an instance of the environment and supplying this to the ‘check_env’ function. If there’s anything wrong then warning messages will be shown. If there’s no output then it’s all good.

We can also take a look at the environment’s action and observation spaces, to make sure they’re returning the expected values:

print(f”Action Space: {env.action_space}”)
print(f”Action Space Sample: {env.action_space.sample()}”)

Should give an output similar to:

Action Space: Discrete(5)
Action Space Sample: 3

  • the action space, as expected, is a Discrete space with 5 possible values.
  • the value sampled from the action space will be a random value between 0 and 4.

And for the observation space:

print(f"Observation Space: {env.observation_space}")
print(f"Observation Space Sample: {env.observation_space.sample()}")

Should give an output similar to:

Observation Space: MultiDiscrete([3 3])
Observation Space Sample: [0 2]

  • the observation space has a MultiDiscrete type and its two components each have 3 possible values (since we created a default 3×3 grid).
  • when sampling from the observation space for this grid, both ‘x’ and ‘y’ can take the values 0, 1 or 2.

You may have noticed that in the test above, rather than creating the environment using ‘gym.make’, as we did for CartPole, we instead simply created an instance of it by doing:

env = BabyRobotEnv()

This is absolutely fine when working with the environment ourselves, but if we want to have our custom environment registered as a proper Gym environment, that can be created using ‘gym.make’, then there are a couple of further steps we need to take.

Firstly, from the Gym Documentation, we need to setup our files and directories with a structure similar to that shown below:

Figure 3: Directory structure for a custom Gym environment.

So we need 3 directories:

  1. The main directory (in this case ‘BabyRobotGym’) to hold our ‘setup.py’ file. This file defines the name of the project directory and references the required resources, which in this case is just the ‘gym’ library. The contents of this file are as shown below:

2. The project directory, which has the same name as the setup file’s ‘name’ parameter. So in the case the directory is called ‘babyrobot’. This contains a single file ‘__init.py__’ which defines the available versions of the environment:

3. The ‘envs’ directory where the main functionality lives. In our case this contains the two versions of the Baby Robot environment that we’ve defined above (‘baby_robot_env_v0.py’ and ‘baby_robot_env_v1.py’). These define the two classes that are referenced in the ‘babyrobot/__init__.py’ file.

Additionally this directory contains its own ‘__init__.py’ file that references both of the files contained in the directory:

We’ve now defined a Python package that can be uploaded to a repository, such as PyPi, to allow easy sharing of your new creation. Additionally, with this structure in place, we’re now able to import our new environment and create it using the ‘gym.make’ method, as we did previously for CartPole:

import babyrobot# create an instance of our custom environment
env = gym.make(‘BabyRobotEnv-v1’)

Note that the name used to specify the environment is the one that was used to register it, not the class name. So, in this case, although the class is called ’BabyRobotEnv_v1‘, the registered name is actually ’BabyRobotEnv-v1′.

Cloning the Github repository

To make it easier to examine the directory structure described above, it can be recreated by cloning the Github repository. The steps to do this are as follows:

1. Get the code and move to the newly created directory:

git clone https://github.com/WhatIThinkAbout/BabyRobotGym.git
cd BabyRobotGym
  • this directory contains the files and folder structure that we’ve defined above (plus a few extra ones that we’ll look at in part 2).

2. Create a Conda environment and install the required packages:

To be able to run our environment we need to have a few other packages installed, most notably ‘Gym’ itself. To make it easy to setup the environment the Github repo contains a couple of ‘.yml’ files that list the required packages.

To use these to create a Conda environment and install the packages, do the following (choose the one appropriate for your operating system):

On Unix:

conda env create -f environment_unix.yml

On Windows:

conda env create -f environment_windows.yml

3. Activate the environment:

We’ve created the environment with all our required packages, so now it’s just a case of activating it, as follows:

conda activate BabyRobotGym

(when you’re finished playing with this environment run “conda deactivate” to get back out)

4. Run the notebook

Everything should now be in place to run our custom Gym environment. To test this we can run the sample Jupyter Notebook ‘baby_robot_gym_test.ipynb’ that’s included in the repository. This will load the ‘BabyRobotEnv-v1’ environment and test it using the Stable Baseline’s environment checker.

To start this in a browser, just type:

jupyter notebook baby_robot_gym_test.ipynb

Or else just open this file in VS Code and make sure ‘BabyRobotGym’ is selected as the kernel. This should make the ‘BabyRobotEnv-v1’ environment, test it in Stable Baselines and then run the environment until it completes, which happens to occur in a single step, since we haven’t yet written the ‘step’ function!

Although the current version of the custom environment satisfies the requirements of the Gym interface, and has the required functions to pass the environment checker tests, it doesn’t yet do anything. We want Baby Robot to be able to move around in his environment and for this we’re going to need him to be able to take some actions.

Since Baby Robot will be operating in a simple Grid World environment (see figure 2, above) the actions he can take will be limited to moving North, South, East or West. Additionally we want him to be able to stay in the same place, if this would be the optimal action. So in total we have 5 possible actions (as we’ve already seen in the action space).

This can be described using a Python integer enumeration:

To simplify the code we can inherit from our previous ‘BabyRobotEnv_v1’ class. This gives us all of the previous functionality and behaviour, which we can then extend to add the new parts that relate to actions. This is shown below:

The new functionality, that’s been added to the class, does the following:

  • in the ‘__init__’ function key word arguments can be supplied that specify the start and end positions in the environment and Baby Robot’s starting position (which by default is set to the grid’s start position).
  • the ‘take_action’ function simply updates Baby Robot’s current position by applying the supplied action and then checks that the new position is valid (to stop him going off the grid).
  • the ‘step’ function applies the current action and then gets the new observation and reward, which are then returned to the caller. By default a reward of -1 is returned for each move, unless Baby Robot has reached the end position, in which case the reward is set to zero and the ‘done’ flag is set to true.
  • the ‘render’ function prints out the current position and reward.

So, finally, we can now take actions and move around from one cell to the next. We can then use a modified version of Listing 1 above (changing from using CartPole to instead use our latest BabyRobot_v2 environment) to select random actions and move around the grid until Baby Robot reaches the cell that has been specified as the exit of the grid (which by default is cell (2,2)).

The test framework for our new environment is shown below:

When we run this, we get an output similar to the one shown below.

Figure 4: A sample path through the grid, from the start cell (0,0) to the exit (2,2).

In this case, the path through the grid is very short, and moves from the start square (0,0) to the exit (2,2) in only a few steps. Since actions are chosen at random the path will typically be much longer. Note also that each step receives a reward of -1, until the exit is reached. So the longer it takes Baby Robot to reach the exit, the more negative the return value.

Technically we’ve already created the render function, it’s just that it’s not very exciting! As shown in Figure 4, all we’re getting are simple text messages that describe the action, position and reward. What we really want is a graphical representation of the environment, showing Baby Robot moving around the grid world.

As described above, the collection of environments in the Gym library perform their rendering, to show the current state of the environment, either by generating a text based representation or by creating an array containing an image.

Text based representations provide a quick way to render the environment in terminal based applications. They’re ideal when you only need a simple overview of the current state.

Images on the other hand give a very detailed picture of the current state and are perfect for creating videos of an episode, to display after the episode has completed.

While both of these representations are useful, neither is particularly suited to creating real-time, detailed, views of the environment’s state when working in Jupyter Notebooks. When Baby Robot moves around a grid level we want to actually see him moving, rather than just getting a text message describing his position, or watching a simple text drawing, with an ‘X’ moving over a grid of dots.

Additionally we want to watch this happening as the episode unfolds, rather than only being able to watch it back afterwards, or see it in a flickering display in real-time. In short, we want to render using a different method to text characters or image arrays. We can achieve this by drawing to an HTML5 Canvas, using the excellent ipycanvas library, and we’ll cover this fully in Part 2.

OpenAI Gym environments are the standard method for testing Reinforcement Learning algorithms. The base collection comes with a large set of varied and challenging problems. However, in many cases you may want to define your own, custom, environment. By implementing the structure and interface of the Gym environment it’s easy to create such an environment, that will slot seamlessly into any application that also uses the Gym interface.

In summary, the main steps to create a custom Gym environment are as follows:

  • Create a class that inherits from the env.Gym base class.
  • Implement the ‘reset’, ‘step’ and ‘render’ functions (and possibly the ‘close’ function if resources need to be tidied up).
  • Define the action space, to specify the number and type of actions that the environment allows.
  • Define the observation space, to describe the information that is supplied to the agent on each step and that sets the boundaries for movement within the environment.
  • Organise the directory structure and add ‘__init__.py’ and ‘setup.py’ files to match the Gym specification and to make the environment compatible with the Gym framework.

Following these steps will give you a bare-bones framework, from which you can start adding your own custom features, to tailor the environment to your own specific problem.

In our case, we want to create a Grid World environment that Baby Robot can explore. Additionally, we want to be able to graphically view this environment and watch Baby Robot as he moves around it. In part 2 we’ll see how this can be achieved.

  1. The Gym library:

2. Stable Baselines Environment Checker:

3. A good YouTube video on custom Gym environments with Stable Baselines:

And the complete series of Baby Robot’s guide to Reinforcement Learning can be found here


Part 1: Creating the framework

[All images by author]

This article (split over two parts) describes the creation of a custom OpenAI Gym environment for Reinforcement Learning (RL) problems.

Quite a few tutorials already exist that show how to create a custom Gym environment (see the References section for a few good links). In all of these examples, and indeed in the most common Gym environments, these produce either a text-based output (e.g. Frozenlake) or an image-based output that appears in a separate graphical window (e.g. Lunar Lander).

Instead we’ll create a custom environment that produces its output in a Jupyter notebook. The graphical representation of the environment will be written directly into the notebook cell and updated in real time. Additionally, it can be used in any test framework, and with any RL algorithm, that also implements the Gym interface.

By the end of the article we will have created a custom Gym environment, that can be tailored to produce a range of different Grid Worlds for Baby Robot to explore, and that renders an output similar to the cover image shown above.

The associated Jupyter Notebook for this article can be found on Github. This contains all of the code required to setup and run the Baby Robot custom Gym environment described below.

Up until now, in our series on Reinforcement Learning (RL), we’ve used bespoke environments to represent the locations where Baby Robot finds himself. Starting from a simple grid world we added components, such as walls and puddles, to increase the complexities of the challenges that Baby Robot faced.

Now that we know the basics of RL, and before we move onto more complex problems and algorithms, it seems like a good time to formalise Baby Robot’s environment. If we give this environment a fixed, defined, interface then we can re-use the same environment in all of our problems and with multiple RL algorithms. This will makes things a lot simpler as we move forwards to look at different RL methods.

By adopting a common interface we can then drop this environment into any existing systems that also implement the same interface. All we need to do is decide what interface we should use. Luckily for us this has already been done, and it’s called the OpenAI Gym interface.

Introduction to OpenAI Gym

OpenAI Gym is a set of Reinforcement Learning (RL) environments, with problems ranging from simple grid worlds up to complex physics engines.

Each of these environments implements the same interface, making it easy to test a single environment using a range of different RL algorithms. Similarly, it makes it straightforward to evaluate a single RL algorithm on a range of different problems.

As a result, OpenAI Gym has become the de-facto standard for learning about and bench-marking RL algorithms.

The OpenAI Gym Interface

The interface for all OpenAI Gym environments can be divided into 3 parts:

1. Initialisation: Create and initialise the environment.

2. Execution: Take repeated actions in the environment. At each step the environment provides information to describe its new state and the reward received as a consequence of taking the specified action. This continues until the environment signals that the episode is complete.

3. Termination: Cleanup and destroy the environment.

Example: The CartPole Environment

One of the simpler problems in Gym is the CartPole environment. In this problem the goal is to move a cart left or right so that the pole, that’s balanced on the cart, remains upright.

Figure 1: Output of the CartPole Environment — the aim is to balance the pole by moving the cart left or right.

The code to set up and run this Gym environment is shown below. Here we’re just choosing left or right actions randomly, so the pole isn’t going to stay up for very long!

Listing 1: The 3 stages of running a Gym environment.

In listing 1, shown above, we’ve labelled the 3 stages of a Gym environment. In more detail, each of these do the following:

1. Initialisation

env = gym.make(‘CartPole-v0’)
  • Create the required environment, in this case the version ‘0’ of CartPole. The returned environment object ‘env’ can then be used to call the functions in the common Gym environment interface.
obs = env.reset()
  • Called at the start of each episode, this puts the environment into its starting state and returns the initial observation of the environment.

2. Execution

Here we run until the environment ‘done’ flag is set to indicate that the episode is complete. This can occur if the agent has reached the termination state or a fixed number of steps have been executed.

env.render()
  • Draw the current state of the environment. In the case of CartPole this will result in a new window being opened to display a graphical view of the cart and its pole. In simpler environments, such as the FrozenLake simple grid-world, a textual representation is shown.
action = env.action_space.sample()
  • Choose a random action from the environment’s set of possible actions.
obs, reward, done, info = env.step(action)
  • Take the action and get back information from the environment about the outcome of this action. This includes 4 pieces of information:
  • obs’: Defines the new state of the environment. In the case of CartPole this is information about the position and velocity of the pole. In a grid-world environment it would be information about the next state, where we end up after taking the action.
  • reward’: The amount of reward, if any, received as a result of taking the action.
  • done’: A flag to indicate if we’ve reached the end of the episode
  • info’: Any additional information. In general this isn’t set.

3. Termination

env.close()
  • Terminate the environment. This will also close any graphical window that may have been created by the render function.

As described previously, the major advantage of using OpenAI Gym is that every environment uses exactly the same interface. We can just replace the environment name string ‘CartPole-v0’ in the ‘gym.make’ line above with the name of any other environment and the rest of the code can stay exactly the same.

This is also true for any custom environment that implements the Gym interface. All that’s required is a class that inherits from the Gym environment and that adds the set of functions described above.

This is shown below for the initial framework of the custom ‘BabyRobotEnv’ that we’re going to create (the ‘_v0’ appended to the class name indicates that this is version zero of our environment. We’ll update this as we add functionality):

In this basic framework for our custom environment we’ve inherited our class from the base ‘gym.Env’ class, which gives us all of the main functionality required to create the environment. To this we’ve then added the 4 functions that are required to turn the class into our own, custom, environment:

  • __init__’: the class initialisation, where we can setup anything required by the class
  • step’: implements what happens when Baby Robot takes a step in the environment and returns information describing the results of taking that step.
  • reset’: called at the start of every episode to put the environment back into its initial state.
  • render’: provides a graphical or text based representation of the environment, to allow the user to see how things are progressing.

We haven’t implemented a ‘close’ function, since there’s currently nothing to close, so we can just rely on the base class to do any required clean up. Additionally, we haven’t yet added any functionality. Our class satisfies the requirements of the Gym interface, and could be used within a Gym test harness, but it currently won’t do much!

The code above defines the framework for a custom environment, however it can’t yet be run since it currently has no ‘action_space’ from which to sample random actions. The ‘action_space’ defines the set of actions that an agent may take in the environment. These can be discrete, continuous or a combination of both.

  • Discrete actions represent a mutually-exclusive set of possible actions, such as the left and right actions in the CartPole environment. At any time-step you can either choose left or right but not both.
  • Continuous actions are actions that have an associated value, which represents the amount of that action to take. For example, when turning a steering wheel an angle could be specified to represent by how much the wheel should be turned.

The Baby Robot environment that we’re creating is what’s referred to as a Grid World. In other words, it’s a grid of squares where Baby Robot may move around, from square to square, to explore and navigate the environment. The default level in this environment will be a 3 x 3 grid, with a starting point at the top left-hand corner, and an exit at the bottom right-hand corner, as shown in Figure 2:

Figure 2: The default level in the Baby Robot environment.

Therefore, for the custom BabyRobotEnv that we’re creating, there are only 4 possible movement actions: North, South, East or West. Additionally, we’ll add a ‘Stay’ action, where Baby Robot remains in the current position. So, in total, we have 5 mutually-exclusive actions and we therefore set the action space to define 5 discrete values:

self.action_space = gym.spaces.Discrete(5)

In addition to an action_space, all environments need to specify an ‘observation_space’. This defines the information supplied to the agent when it receives an observation about the environment.

When Baby Robot takes a step in the environment we want to return his new position. Therefore we’ll define an observation space that specifies a grid position as an ‘x’ and ‘y’ coordinate.

The Gym interface defines a couple of different ‘spaces’ that could be used to specify our coordinates. For example, if our coordinates where continuous, floating point, values we could use the Box space. This would also let us set a limit on the possible range of values that can be used for the ‘x’ and ‘y’ coordinates. Additionally, we could then combine these to form a single expression of the environment’s observation space using Gym’s Dict space.

However, since we’re only going to allow whole moves from one square to the next (as opposed to being half-way between squares), we will specify the grid-coordinate in integers. Therefore, as with the action space, we’ll be using a discrete set of values. But now, instead of there only being a single discrete value, we have two: one for each of the ‘x’ and ‘y’ coordinates. Luckily for us, the Gym interface has just the thing, the MultiDiscrete space.

In the horizontal direction the maximum ‘x’ position is bounded by the width of the grid and in the vertical ‘y’ direction by the height of the grid. Therefore, the observation space can be defined as follows:

self.observation_space = MultiDiscrete([ self.width, self.height ])

Discrete spaces are zero based, so our coordinate values will be from zero up to one less than the defined maximum value.

With these changes the new version of the BabyRobotEnv class is as shown below:

There are a couple of points to note about the new version of the BabyRobotEnv class:

  • We’re supplying a kwargs argument to the init function, letting us create our instance with a dictionary of parameters. Here we’re just going to supply the width and height of the grid we want to make, but going forward we can use this to pass other parameters and by using kwargs we can avoid changing the interface of the class.
  • When we take the width and height from the kwargs, in both cases we default to values of 3 if the parameter hasn’t been supplied. So we get a grid of size 3×3 if no arguments are supplied during the creation of the environment.
  • We’ve now defined Baby Robot’s position in the grid using ‘self.x’ and ‘self.y’, which we now return as the observation from the ‘reset’ and ‘step’ functions. In both cases we’ve converted these values into numpy arrays, which although not required to match the Gym interface, is required for the Stable Baseline’s environment checker, which will be introduced in the next section.

Before we start adding any real functionality, it’s worth confirming that our new environment conforms to the Gym interface. To test this we can validate our class using the Stable Baselines Environment Checker.

Not only does this test that we’ve implemented the functions required for the Gym interface, but it also checks that the action and observation spaces are set up correctly and that the function responses match the associated observation space.

One point to note about the environment checker is that, as well as validating that an environment conforms to the Gym standard, it’s also checking that the environment is suitable to be run with the Stable Baseline’s RL algorithm set. As part of this it expects the observations to be returned as numpy arrays, which is why they’ve been added in the ‘reset’ and ‘step’ functions shown above.

To run the check it’s simply a case of creating an instance of the environment and supplying this to the ‘check_env’ function. If there’s anything wrong then warning messages will be shown. If there’s no output then it’s all good.

We can also take a look at the environment’s action and observation spaces, to make sure they’re returning the expected values:

print(f”Action Space: {env.action_space}”)
print(f”Action Space Sample: {env.action_space.sample()}”)

Should give an output similar to:

Action Space: Discrete(5)
Action Space Sample: 3

  • the action space, as expected, is a Discrete space with 5 possible values.
  • the value sampled from the action space will be a random value between 0 and 4.

And for the observation space:

print(f"Observation Space: {env.observation_space}")
print(f"Observation Space Sample: {env.observation_space.sample()}")

Should give an output similar to:

Observation Space: MultiDiscrete([3 3])
Observation Space Sample: [0 2]

  • the observation space has a MultiDiscrete type and its two components each have 3 possible values (since we created a default 3×3 grid).
  • when sampling from the observation space for this grid, both ‘x’ and ‘y’ can take the values 0, 1 or 2.

You may have noticed that in the test above, rather than creating the environment using ‘gym.make’, as we did for CartPole, we instead simply created an instance of it by doing:

env = BabyRobotEnv()

This is absolutely fine when working with the environment ourselves, but if we want to have our custom environment registered as a proper Gym environment, that can be created using ‘gym.make’, then there are a couple of further steps we need to take.

Firstly, from the Gym Documentation, we need to setup our files and directories with a structure similar to that shown below:

Figure 3: Directory structure for a custom Gym environment.

So we need 3 directories:

  1. The main directory (in this case ‘BabyRobotGym’) to hold our ‘setup.py’ file. This file defines the name of the project directory and references the required resources, which in this case is just the ‘gym’ library. The contents of this file are as shown below:

2. The project directory, which has the same name as the setup file’s ‘name’ parameter. So in the case the directory is called ‘babyrobot’. This contains a single file ‘__init.py__’ which defines the available versions of the environment:

3. The ‘envs’ directory where the main functionality lives. In our case this contains the two versions of the Baby Robot environment that we’ve defined above (‘baby_robot_env_v0.py’ and ‘baby_robot_env_v1.py’). These define the two classes that are referenced in the ‘babyrobot/__init__.py’ file.

Additionally this directory contains its own ‘__init__.py’ file that references both of the files contained in the directory:

We’ve now defined a Python package that can be uploaded to a repository, such as PyPi, to allow easy sharing of your new creation. Additionally, with this structure in place, we’re now able to import our new environment and create it using the ‘gym.make’ method, as we did previously for CartPole:

import babyrobot# create an instance of our custom environment
env = gym.make(‘BabyRobotEnv-v1’)

Note that the name used to specify the environment is the one that was used to register it, not the class name. So, in this case, although the class is called ’BabyRobotEnv_v1‘, the registered name is actually ’BabyRobotEnv-v1′.

Cloning the Github repository

To make it easier to examine the directory structure described above, it can be recreated by cloning the Github repository. The steps to do this are as follows:

1. Get the code and move to the newly created directory:

git clone https://github.com/WhatIThinkAbout/BabyRobotGym.git
cd BabyRobotGym
  • this directory contains the files and folder structure that we’ve defined above (plus a few extra ones that we’ll look at in part 2).

2. Create a Conda environment and install the required packages:

To be able to run our environment we need to have a few other packages installed, most notably ‘Gym’ itself. To make it easy to setup the environment the Github repo contains a couple of ‘.yml’ files that list the required packages.

To use these to create a Conda environment and install the packages, do the following (choose the one appropriate for your operating system):

On Unix:

conda env create -f environment_unix.yml

On Windows:

conda env create -f environment_windows.yml

3. Activate the environment:

We’ve created the environment with all our required packages, so now it’s just a case of activating it, as follows:

conda activate BabyRobotGym

(when you’re finished playing with this environment run “conda deactivate” to get back out)

4. Run the notebook

Everything should now be in place to run our custom Gym environment. To test this we can run the sample Jupyter Notebook ‘baby_robot_gym_test.ipynb’ that’s included in the repository. This will load the ‘BabyRobotEnv-v1’ environment and test it using the Stable Baseline’s environment checker.

To start this in a browser, just type:

jupyter notebook baby_robot_gym_test.ipynb

Or else just open this file in VS Code and make sure ‘BabyRobotGym’ is selected as the kernel. This should make the ‘BabyRobotEnv-v1’ environment, test it in Stable Baselines and then run the environment until it completes, which happens to occur in a single step, since we haven’t yet written the ‘step’ function!

Although the current version of the custom environment satisfies the requirements of the Gym interface, and has the required functions to pass the environment checker tests, it doesn’t yet do anything. We want Baby Robot to be able to move around in his environment and for this we’re going to need him to be able to take some actions.

Since Baby Robot will be operating in a simple Grid World environment (see figure 2, above) the actions he can take will be limited to moving North, South, East or West. Additionally we want him to be able to stay in the same place, if this would be the optimal action. So in total we have 5 possible actions (as we’ve already seen in the action space).

This can be described using a Python integer enumeration:

To simplify the code we can inherit from our previous ‘BabyRobotEnv_v1’ class. This gives us all of the previous functionality and behaviour, which we can then extend to add the new parts that relate to actions. This is shown below:

The new functionality, that’s been added to the class, does the following:

  • in the ‘__init__’ function key word arguments can be supplied that specify the start and end positions in the environment and Baby Robot’s starting position (which by default is set to the grid’s start position).
  • the ‘take_action’ function simply updates Baby Robot’s current position by applying the supplied action and then checks that the new position is valid (to stop him going off the grid).
  • the ‘step’ function applies the current action and then gets the new observation and reward, which are then returned to the caller. By default a reward of -1 is returned for each move, unless Baby Robot has reached the end position, in which case the reward is set to zero and the ‘done’ flag is set to true.
  • the ‘render’ function prints out the current position and reward.

So, finally, we can now take actions and move around from one cell to the next. We can then use a modified version of Listing 1 above (changing from using CartPole to instead use our latest BabyRobot_v2 environment) to select random actions and move around the grid until Baby Robot reaches the cell that has been specified as the exit of the grid (which by default is cell (2,2)).

The test framework for our new environment is shown below:

When we run this, we get an output similar to the one shown below.

Figure 4: A sample path through the grid, from the start cell (0,0) to the exit (2,2).

In this case, the path through the grid is very short, and moves from the start square (0,0) to the exit (2,2) in only a few steps. Since actions are chosen at random the path will typically be much longer. Note also that each step receives a reward of -1, until the exit is reached. So the longer it takes Baby Robot to reach the exit, the more negative the return value.

Technically we’ve already created the render function, it’s just that it’s not very exciting! As shown in Figure 4, all we’re getting are simple text messages that describe the action, position and reward. What we really want is a graphical representation of the environment, showing Baby Robot moving around the grid world.

As described above, the collection of environments in the Gym library perform their rendering, to show the current state of the environment, either by generating a text based representation or by creating an array containing an image.

Text based representations provide a quick way to render the environment in terminal based applications. They’re ideal when you only need a simple overview of the current state.

Images on the other hand give a very detailed picture of the current state and are perfect for creating videos of an episode, to display after the episode has completed.

While both of these representations are useful, neither is particularly suited to creating real-time, detailed, views of the environment’s state when working in Jupyter Notebooks. When Baby Robot moves around a grid level we want to actually see him moving, rather than just getting a text message describing his position, or watching a simple text drawing, with an ‘X’ moving over a grid of dots.

Additionally we want to watch this happening as the episode unfolds, rather than only being able to watch it back afterwards, or see it in a flickering display in real-time. In short, we want to render using a different method to text characters or image arrays. We can achieve this by drawing to an HTML5 Canvas, using the excellent ipycanvas library, and we’ll cover this fully in Part 2.

OpenAI Gym environments are the standard method for testing Reinforcement Learning algorithms. The base collection comes with a large set of varied and challenging problems. However, in many cases you may want to define your own, custom, environment. By implementing the structure and interface of the Gym environment it’s easy to create such an environment, that will slot seamlessly into any application that also uses the Gym interface.

In summary, the main steps to create a custom Gym environment are as follows:

  • Create a class that inherits from the env.Gym base class.
  • Implement the ‘reset’, ‘step’ and ‘render’ functions (and possibly the ‘close’ function if resources need to be tidied up).
  • Define the action space, to specify the number and type of actions that the environment allows.
  • Define the observation space, to describe the information that is supplied to the agent on each step and that sets the boundaries for movement within the environment.
  • Organise the directory structure and add ‘__init__.py’ and ‘setup.py’ files to match the Gym specification and to make the environment compatible with the Gym framework.

Following these steps will give you a bare-bones framework, from which you can start adding your own custom features, to tailor the environment to your own specific problem.

In our case, we want to create a Grid World environment that Baby Robot can explore. Additionally, we want to be able to graphically view this environment and watch Baby Robot as he moves around it. In part 2 we’ll see how this can be achieved.

  1. The Gym library:

2. Stable Baselines Environment Checker:

3. A good YouTube video on custom Gym environments with Stable Baselines:

And the complete series of Baby Robot’s guide to Reinforcement Learning can be found here

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment