Techno Blender
Digitally Yours.

Applied Reinforcement Learning VI: Deep Deterministic Policy Gradients (DDPG) for Continuous Control | by Javier Martínez Ojeda | Mar, 2023

0 44


Photo by Eyosias G on Unsplash

The DDPG algorithm, first presented at ICLR 2016 by Lillicarp et al. [1], was a significant breakthrough in terms of Deep Reinforcement Learning algorithms for continuous control, because of its improvement over DQN [2] (which only works with discrete actions), and its very good results and ease of implementation (see [1]).

As for the NAF algorithm [3] presented in the previous article, DDPG works with continuous action spaces and continuous state spaces, making it an equally valid choice for continuous control tasks applicable to fields such as Robotics or Autonomous Driving.

The DDPG algorithm is an Actor-Critic algorithm, which, as its name suggests, is composed of two neural networks: the Actor and the Critic. The Actor is in charge of choosing the best action, and the Critic must evaluate how good the chosen action was, and inform the actor how to improve it.

The Actor is trained by applying policy gradient, while the Critic is trained by calculating the Q-Function. For this reason, the DDPG algorithm tries to learn an approximator to the optimal Q-function (Critic) and an approximator to the optimal policy (Actor) at the same time.

Actor-Critic schema. Image extracted from [4]

The optimal Q-Function Q*(s, a) gives the expected return for being in state s, taking action a, and then acting following the optimal policy. On the other hand, the optimal policy 𝜇*(s) returns the action which maximizes the expected return starting from state s. According to these two definitons, the optimal action on a given state (i.e. the return of the optimal policy on a given state) can be obtained by getting the argmax of Q*(s, a) for a given state, as shown below.

Q-Function — Policy relation. Image by author

The problem is that, for continuous action spaces, obtaining the action a which maximizes Q is not easy, because it would be impossible to calculate Q for every possible value of a to check which result is the highest (which is the solution for discrete action spaces), since it would have infinite possible values.

As a solution to this, and assuming that the action space is continuous and that the Q-Function is differentiable with respect to the action, the DDPG algorithm approximates the calculation of maxQ(s, a) to Q(s, 𝜇(s)), where 𝜇(s) (a deterministic policy) can be optimized performing gradient ascent.

In simple terms, DDPG learns an approximator to the optimal Q-Function in order to obtain the action that maximises it. Since, as the action space is continuous, the result of the Q-Function cannot be obtained for every possible value of the action, DDPG also learns an approximator to the optimal policy, in order to directly obtain the action that maximises the Q-Function.

The following sections explain how the algorithm learns both the approximator to the optimal Q-Function and the approximator to the optimal policy.

Mean-Squared Bellman Error Function

The learning of the Q-Function is carried out using as base the Bellman equation, previously introduced in the first article of this series. As in the DDPG algorithm the Q-Function is not calculated directly, but a neural network denoted Qϕ(s, a) is used as an approximator of the Q-Function, instead of the Bellman equation a loss function called Mean Squared Bellman Error (MSBE) is used. This function, shown in Figure 1, indicates how well the approximator Qϕ(s, a) satisfies the Bellman equation.

Figure 1. Mean-Squared Bellman Error (MSBE). Image extracted from [5]

The goal of DDPG is to minimize this error function, which will cause the approximator to the Q-Function to satisfy Bellman’s equation, which implies that the approximator is optimal.

Replay Buffer

The data required to minimize the MSBE function (i.e. to train the neural network to approximate Q*(s, a)), is extracted from a Replay Buffer where the experiences lived during the training are stored. This Replay Buffer is represented in Figure 1 as D, from which the data required for calculating the loss is obtained: state s, action a, reward r, next state s’ and done d. If you are not familiar with the Replay Buffer, it was explained in the articles about the DQN algorithm and NAF algorithm, and implemented and applied in the article about the implementation of DQN.

Target Neural Network

The minimization of the MSBE function consists of making the approximator to the Q-Function, Qϕ(s, a), as close as possible to the other term of the function, the target, which originally has the following form:

Figure 2. Target. Extracted from Figure 1 [5]

As can be seen, the target depends on the same parameters to be optimized, ϕ, which makes the minimization unstable. Therefore, as a solution another neural network is used, containing the parameters of the main neural network but with a certain delay. This second neural network is called target network, Qϕtarg(s, a) (see Figure 3), and its parameters are denoted ϕtarg.

However, in Figure 2 it can be seen how, when substituting Qϕ(s, a) for Qϕtarg(s, a), it is necessary to obtain the action that maximizes the output of this target network, which, as explained above, is complicated for continuous action space environments. This is solved by making use of a target policy network, 𝜇ϕtarg(s) (see Figure 3), which approximates the action that maximizes the output of the target network. In other words, a target policy network 𝜇ϕtarg(s) has been created to solve the problem of continuous actions for Qϕtarg(s, a), just as was done with 𝜇ϕ(s) and Qϕ(s, a).

Minimize the modified MSBE Function

With all this, the learning of the optimal Q-Function by the DDPG algorithm is carried out minimizing the modified MSBE function in Figure 3, by applying gradient descent on it.

Figure 3. Modified Mean-Squared Bellman Error. Extracted from [5] and edited by author

Given that the action space is continuous, and that the Q-Function is differentiable with respect to the action, DDPG learns the deterministic policy 𝜇ϕ(s) that maximizes Qϕ(s, a) by applying gradient ascent on the function below with respect to the deterministic policy’s parameters:

Figure 4. Function to optimize for the deterministic policy learning. Extracted from [5]

The flow of the DDPG algorithm will be presented following the pseudocode below, extracted from [1]. The DDPG algorithm follows the same steps as other Q-Learning algorithms for function approximators, such as the DQN or NAF algorithm.

DDPG Algorithm Pseudocode. Extracted from [1]

1. Initialize Critic, Critic Target, Actor and Actor Target networks

Initialize the Actor and Critic neural networks to be used during training.

  • The Critic network, Qϕ(s, a), acts as an approximator to the Q-Function.
  • The Actor network, 𝜇ϕ(s), acts as an approximator to the deterministic policy and is used to obtain the action that maximizes the output of the Critic network.

Once initialized, the target networks are initialized with the same architecture as their corresponding main networks, and the weights of the main networks are copied into the target networks.

  • The Critic Target network, Qϕtarg(s, a), acts as a delayed Critic network, so that the target does not depend on the same parameters to be optimized, as explained before.
  • The Actor Target network, 𝜇ϕtarg(s), acts as a delayed Actor network, and it used to obtain the action that maximizes the output of the Critic Target network.

2. Initialize Replay Buffer

The Replay Buffer to be used for the training is initialized empty.

For each timestep in an episode, the agent performs the following steps:

3. Select action and apply noise

The best action for the current state is obtained from the output of the Actor neural network, which approximates the deterministic policy 𝜇ϕ(s). The noise extracted from a Ornstein Uhlenbeck Noise process [6], or from an uncorrelated, mean-zero Gaussian distribution [7] is then applied to the selected action.

4. Perform the action and store observations in Replay Buffer

The noisy action is performed in the environment. After that, the environment returns a reward indicating how good the action taken was, the new state reached after performing the action, and a boolean indicating whether a terminal state has been reached.

This information, together with the current state and the action taken, is stored in the Replay Buffer, to be used later to optimize the Critic and Actor neural networks.

5. Sample batch of experiences and train Actor and Critic networks

This step is only performed when the Replay Buffer has enough experiences to fill a batch. Once this requirement is met, a batch of experiences is extracted from the Replay Buffer for use in training.

With this batch of experiences:

  • The target is calculated and the output of the Critic network (the approximator of the Q-Function) is obtained, in order to then apply gradient descent on the MSBE error function, as shown in Figure 3. This step trains/optimizes the approximator to the Q-Function, Qϕ(s, a).
  • Gradient ascent is performed on the function shown in Figure 4, thus optimizing/training the approximator to the deterministic policy, 𝜇ϕ(s).

6. Softly update the Target networks

Both the Actor Target and the Critic Target networks are updated every time the Actor and Critic networks are updated, by polyak averaging, as shown in the figure below.

Figure 5. Polyak Averaging. Extracted from [1]

Tau τ, the parameter that sets the weights of each element in the polyak averaging, is a hyperparameter to be set for the algorithm, which usually takes values close to 1.

The DDPG algorithm proposed by Lillicrap et al. achieves very good results in most continuous environments available in Gym as shown in the paper that presented it [1], demonstrating its ability to learn different tasks, regardless of their complexity, in a continuous context.

Therefore, this algorithm is still used today to enable an agent to learn an optimal policy for a complex task in a continuous environment, such as control tasks for manipulator robots, or obstacle avoidance for autonomous vehicles.

[1] LILLICRAP, Timothy P., et al. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[2] MNIH, Volodymyr, et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[3] GU, Shixiang, et al. Continuous deep q-learning with model-based acceleration. En International conference on machine learning. PMLR, 2016. p. 2829–2838.

[4] SUTTON, Richard S.; BARTO, Andrew G. Reinforcement learning: An introduction. MIT press, 2018.

[5] OpenAI Spinning Up — Deep Deterministic Policy Gradient
https://spinningup.openai.com/en/latest/algorithms/ddpg.html

[6] Uhlenbeck-Ornstein process
https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process

[7] Normal / Gaussian Distribution
https://en.wikipedia.org/wiki/Normal_distribution


Photo by Eyosias G on Unsplash

The DDPG algorithm, first presented at ICLR 2016 by Lillicarp et al. [1], was a significant breakthrough in terms of Deep Reinforcement Learning algorithms for continuous control, because of its improvement over DQN [2] (which only works with discrete actions), and its very good results and ease of implementation (see [1]).

As for the NAF algorithm [3] presented in the previous article, DDPG works with continuous action spaces and continuous state spaces, making it an equally valid choice for continuous control tasks applicable to fields such as Robotics or Autonomous Driving.

The DDPG algorithm is an Actor-Critic algorithm, which, as its name suggests, is composed of two neural networks: the Actor and the Critic. The Actor is in charge of choosing the best action, and the Critic must evaluate how good the chosen action was, and inform the actor how to improve it.

The Actor is trained by applying policy gradient, while the Critic is trained by calculating the Q-Function. For this reason, the DDPG algorithm tries to learn an approximator to the optimal Q-function (Critic) and an approximator to the optimal policy (Actor) at the same time.

Actor-Critic schema. Image extracted from [4]

The optimal Q-Function Q*(s, a) gives the expected return for being in state s, taking action a, and then acting following the optimal policy. On the other hand, the optimal policy 𝜇*(s) returns the action which maximizes the expected return starting from state s. According to these two definitons, the optimal action on a given state (i.e. the return of the optimal policy on a given state) can be obtained by getting the argmax of Q*(s, a) for a given state, as shown below.

Q-Function — Policy relation. Image by author

The problem is that, for continuous action spaces, obtaining the action a which maximizes Q is not easy, because it would be impossible to calculate Q for every possible value of a to check which result is the highest (which is the solution for discrete action spaces), since it would have infinite possible values.

As a solution to this, and assuming that the action space is continuous and that the Q-Function is differentiable with respect to the action, the DDPG algorithm approximates the calculation of maxQ(s, a) to Q(s, 𝜇(s)), where 𝜇(s) (a deterministic policy) can be optimized performing gradient ascent.

In simple terms, DDPG learns an approximator to the optimal Q-Function in order to obtain the action that maximises it. Since, as the action space is continuous, the result of the Q-Function cannot be obtained for every possible value of the action, DDPG also learns an approximator to the optimal policy, in order to directly obtain the action that maximises the Q-Function.

The following sections explain how the algorithm learns both the approximator to the optimal Q-Function and the approximator to the optimal policy.

Mean-Squared Bellman Error Function

The learning of the Q-Function is carried out using as base the Bellman equation, previously introduced in the first article of this series. As in the DDPG algorithm the Q-Function is not calculated directly, but a neural network denoted Qϕ(s, a) is used as an approximator of the Q-Function, instead of the Bellman equation a loss function called Mean Squared Bellman Error (MSBE) is used. This function, shown in Figure 1, indicates how well the approximator Qϕ(s, a) satisfies the Bellman equation.

Figure 1. Mean-Squared Bellman Error (MSBE). Image extracted from [5]

The goal of DDPG is to minimize this error function, which will cause the approximator to the Q-Function to satisfy Bellman’s equation, which implies that the approximator is optimal.

Replay Buffer

The data required to minimize the MSBE function (i.e. to train the neural network to approximate Q*(s, a)), is extracted from a Replay Buffer where the experiences lived during the training are stored. This Replay Buffer is represented in Figure 1 as D, from which the data required for calculating the loss is obtained: state s, action a, reward r, next state s’ and done d. If you are not familiar with the Replay Buffer, it was explained in the articles about the DQN algorithm and NAF algorithm, and implemented and applied in the article about the implementation of DQN.

Target Neural Network

The minimization of the MSBE function consists of making the approximator to the Q-Function, Qϕ(s, a), as close as possible to the other term of the function, the target, which originally has the following form:

Figure 2. Target. Extracted from Figure 1 [5]

As can be seen, the target depends on the same parameters to be optimized, ϕ, which makes the minimization unstable. Therefore, as a solution another neural network is used, containing the parameters of the main neural network but with a certain delay. This second neural network is called target network, Qϕtarg(s, a) (see Figure 3), and its parameters are denoted ϕtarg.

However, in Figure 2 it can be seen how, when substituting Qϕ(s, a) for Qϕtarg(s, a), it is necessary to obtain the action that maximizes the output of this target network, which, as explained above, is complicated for continuous action space environments. This is solved by making use of a target policy network, 𝜇ϕtarg(s) (see Figure 3), which approximates the action that maximizes the output of the target network. In other words, a target policy network 𝜇ϕtarg(s) has been created to solve the problem of continuous actions for Qϕtarg(s, a), just as was done with 𝜇ϕ(s) and Qϕ(s, a).

Minimize the modified MSBE Function

With all this, the learning of the optimal Q-Function by the DDPG algorithm is carried out minimizing the modified MSBE function in Figure 3, by applying gradient descent on it.

Figure 3. Modified Mean-Squared Bellman Error. Extracted from [5] and edited by author

Given that the action space is continuous, and that the Q-Function is differentiable with respect to the action, DDPG learns the deterministic policy 𝜇ϕ(s) that maximizes Qϕ(s, a) by applying gradient ascent on the function below with respect to the deterministic policy’s parameters:

Figure 4. Function to optimize for the deterministic policy learning. Extracted from [5]

The flow of the DDPG algorithm will be presented following the pseudocode below, extracted from [1]. The DDPG algorithm follows the same steps as other Q-Learning algorithms for function approximators, such as the DQN or NAF algorithm.

DDPG Algorithm Pseudocode. Extracted from [1]

1. Initialize Critic, Critic Target, Actor and Actor Target networks

Initialize the Actor and Critic neural networks to be used during training.

  • The Critic network, Qϕ(s, a), acts as an approximator to the Q-Function.
  • The Actor network, 𝜇ϕ(s), acts as an approximator to the deterministic policy and is used to obtain the action that maximizes the output of the Critic network.

Once initialized, the target networks are initialized with the same architecture as their corresponding main networks, and the weights of the main networks are copied into the target networks.

  • The Critic Target network, Qϕtarg(s, a), acts as a delayed Critic network, so that the target does not depend on the same parameters to be optimized, as explained before.
  • The Actor Target network, 𝜇ϕtarg(s), acts as a delayed Actor network, and it used to obtain the action that maximizes the output of the Critic Target network.

2. Initialize Replay Buffer

The Replay Buffer to be used for the training is initialized empty.

For each timestep in an episode, the agent performs the following steps:

3. Select action and apply noise

The best action for the current state is obtained from the output of the Actor neural network, which approximates the deterministic policy 𝜇ϕ(s). The noise extracted from a Ornstein Uhlenbeck Noise process [6], or from an uncorrelated, mean-zero Gaussian distribution [7] is then applied to the selected action.

4. Perform the action and store observations in Replay Buffer

The noisy action is performed in the environment. After that, the environment returns a reward indicating how good the action taken was, the new state reached after performing the action, and a boolean indicating whether a terminal state has been reached.

This information, together with the current state and the action taken, is stored in the Replay Buffer, to be used later to optimize the Critic and Actor neural networks.

5. Sample batch of experiences and train Actor and Critic networks

This step is only performed when the Replay Buffer has enough experiences to fill a batch. Once this requirement is met, a batch of experiences is extracted from the Replay Buffer for use in training.

With this batch of experiences:

  • The target is calculated and the output of the Critic network (the approximator of the Q-Function) is obtained, in order to then apply gradient descent on the MSBE error function, as shown in Figure 3. This step trains/optimizes the approximator to the Q-Function, Qϕ(s, a).
  • Gradient ascent is performed on the function shown in Figure 4, thus optimizing/training the approximator to the deterministic policy, 𝜇ϕ(s).

6. Softly update the Target networks

Both the Actor Target and the Critic Target networks are updated every time the Actor and Critic networks are updated, by polyak averaging, as shown in the figure below.

Figure 5. Polyak Averaging. Extracted from [1]

Tau τ, the parameter that sets the weights of each element in the polyak averaging, is a hyperparameter to be set for the algorithm, which usually takes values close to 1.

The DDPG algorithm proposed by Lillicrap et al. achieves very good results in most continuous environments available in Gym as shown in the paper that presented it [1], demonstrating its ability to learn different tasks, regardless of their complexity, in a continuous context.

Therefore, this algorithm is still used today to enable an agent to learn an optimal policy for a complex task in a continuous environment, such as control tasks for manipulator robots, or obstacle avoidance for autonomous vehicles.

[1] LILLICRAP, Timothy P., et al. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[2] MNIH, Volodymyr, et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[3] GU, Shixiang, et al. Continuous deep q-learning with model-based acceleration. En International conference on machine learning. PMLR, 2016. p. 2829–2838.

[4] SUTTON, Richard S.; BARTO, Andrew G. Reinforcement learning: An introduction. MIT press, 2018.

[5] OpenAI Spinning Up — Deep Deterministic Policy Gradient
https://spinningup.openai.com/en/latest/algorithms/ddpg.html

[6] Uhlenbeck-Ornstein process
https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process

[7] Normal / Gaussian Distribution
https://en.wikipedia.org/wiki/Normal_distribution

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment