Deep Deterministic Policy Gradients Explained | by Wouter van Heeswijk, PhD | Apr, 2023


Photo by Jonathan Ford on Unsplash

This article introduces Deep Deterministic Policy Gradient (DDPG) — a Reinforcement Learning algorithm suitable for deterministic policies applied in continuous action spaces. By combining the actor-critic paradigm with deep neural networks, continuous action spaces can be tackled without resorting to stochastic policies.

Especially for continuous control tasks in which randomness in actions is undesirable — e.g., robotics or navigation — DDPG might be just the algorithm you need.

DDPG displays elements of policy-based methods as well as value-based methods, making it a hybrid policy class.

Policy gradient methods like REINFORCE, TRPO, and PPO use stochastic policies π:a~P(a|s) to explore and compare actions. These methods draw actions from a differentiable distribution P_θ(a|s), which enables the computation of gradients with respect to θ. The inherent randomness in these decisions could be undesirable real-world applications. DDPG eliminates this randomness, yielding simpler and more predictable policies.

Value-based methods like SARSA, Monte Carlo Learning, and Deep Q-Learning are based on deterministic policies that always return a single action given an input state. However, these methods assume a finite number of actions, which makes evaluating their value functions and selecting the most rewarding actions difficult for continuous action spaces with infinitely many actions.

As you guessed, Deep Deterministic Policy Gradients fill the gap, incorporating elements of both Deep Q-Learning and policy gradient methods. DDPG effectively handles continuous action spaces and has been successfully applied in robotic control- and game-playing tasks.

If you’re unfamiliar with policy gradient algorithms (in particular REINFORCE) or value-based methods (in particular DQN), it’s recommended to learn about them before exploring DDPG.

DDPG is remarkably close to Deep Q-Learning, sharing both notation and concepts. Let’s make a quick journey.

DQN for continuous action spaces?

In vanilla (i.e., tabular) Q-learning, we use Q-values to approximate the Bellman value function V. Q-values are defined for every state-action pair and thus denoted by Q(s,a). Tabular Q-learning requires a lookup table containing a Q-value for each pair, thus necessitating a discrete state space and a discrete action space.

Time to put the ‘deep’ in Deep Reinforcement Learning. Compared to lookup tables, introducing a neural network has two advantages: (i) it provides a single general expression for the entire state space and (ii) by extension, it can also handle continuous state spaces.

Of course, we need to deal with continuous action spaces; we thus cannot output Q-values for every action. Instead, we provide a single action as input and compute the Q-value for the state-action pair (a procedure also known as naive DQN). In mathematical terms, we can represent the network as Q:(s,a)→Q(s,a), i.e., outputting a single Q- value for a given state-action pair.

The corresponding critic network looks as follows.

Example of a critic network Q:(s,a) in DDPG. The network takes both the state vector and action vector as input, and outputs a single Q-value [image by author]

Although offering generalization, the neural network introduces some stability issues. Being a single representation for all states, each update also affects all Q-values. As observation tuples (s,a,r,s’) are collected sequentially, there tends to be a high temporal correlation between them that makes overfitting all too likely. Without going into too much detail here, the following three techniques are needed to properly train the value network:

  • Experience replay: Sample observations (s,a,r,s’) from experience buffer, breaking the correlation between subsequently collected tuples.
  • Batch learning: Train value network with batches of observations, making for more reliable and impactful updates.
  • Target networks: Use a different network to compute Q(s’,a’) than Q(s,a), reducing correlation between expectation and observation.

Critic network updates

Now that the basics are refreshed, let’s tie the aforementioned concepts to DDPG. We define a critic network Q_ϕ as detailed before, parameterized by ϕ (representing the network weights).

We set out with the loss function that we aim to minimize, which should look familiar to those experienced with Q-learning:

Compared to DQN, the critical distinction is that for the action corresponding to s’ — instead of maximizing over the action space — we determine the action a’ through a target actor network μ_{θ_targ} (more on that later). After that sidestep, we update the critic network as usual.

Aside from updating the main critic network, we must also update the target critic network. In Deep Q-Learning this is often a periodic copy of the main value network (e.g., copy once every 100 episodes). In DDPG, it is common to use a lagging target network using polyak averaging, making the target network trail behind the main value network:

With ρ typically being close to 1, the target network adapts very slowly and gradually over time, which improves training stability.

In DDPG, actor and critic are closely intertwined in an off-policy fashion. We first explore the off-policy nature of the algorithm, before moving to action generation and actor network updates.

Off-policy training

In pure policy gradient methods, we directly update a policy μ_θ (parameterized by θ) to maximize expected rewards, without resorting to explicit value functions to capture these rewards. DDPG is a hybrid that also uses Q-values, but from an actor perspective maximizing the objective J(θ) look similar at face value:

However, a closer look at the expectation reveals that DDPG is an off-policy method, whereas typical actor-critic methods are on-policy. Most actor-critic models maximize an expectation E_{τ~π_θ}, with τ being the state-action trajectory generated by policy π_θ. In contrast, DDPG takes an expectation over sample states drawn from the experience buffer (E_{s~D}). As DDPG optimizes a policy using experiences generated with different policies, it is an off-policy algorithm.

The role of the replay buffer in this off-policy context needs some attention. Why can we re-use old experiences, and why should we?

First, let’s explore why the buffer can include experiences obtained with different policies than the present one. As the policy is updated over time, the replay buffer holds experiences stemming from outdated policies. As optimal Q-values apply to any transition, it does not matter what policy generated them.

Second, the reason the replay buffer should contain a diverse range of experiences is that we deploy a deterministic policy. If the algorithm would be on-policy, we would likely have limited exploration. By drawing upon past experiences, we also train on observations that are unlikely to be encountered under the present policy.

Example of an actor network μ_θ:s in DDPG. The network takes the state vector as input, and outputs a deterministic action μ(s). During training, a separate random noise ϵ is typically added [image by author]

Action exploration

What about the exploration mechanism that policy gradient methods are so famous for? After all, we now deploy a deterministic policy rather than a stochastic one, right? DDPG resolves this by simply adding some noise ϵ during training, which is removed when deploying the policy.

Early implementations of DDPG used rather complicated noise constructs (e.g., time-correlated Ornstein-Uhlenbeck noise), but later empirical results suggested that plain Gaussian noise ϵ~N(0,σ^2) works equally well. The noise may be gradually reduced over time, but is not a trainable component σ_θ such as in stochastic policies. A final note is that we may clip the action range. Evidently, some tuning effort is involved in the exploration.

In summary, the actor generates actions as follows. It takes the state as input, outputs a deterministic value, and adds some random noise:

Actor network updates

A final note on the policy update, which is not necessarily trivial. We update the actor network parameters θ based on the Q-values returned by the critic network (parameterized by ϕ). Thus, we keep Q-values constant — that is, we don’t update ϕ in this step — and try to maximize the expected reward by changing the action. This means we assume that the critic network is differentiable w.r.t. the action, such that we can update the action in a direction that maximizes the Q-value:

Although the second gradient ∇_θ is often omitted for readability, it offers some clarification. We train the actor network to output better actions, which in turn improves the obtained Q-values. If you wanted, you could detail the procedure by applying the chain rule.

The actor target network is updated with polyak averaging, in the same vein as the critic target network.

We have an actor, we have a critic, so nothing stops us from completing the algorithm now!

Outline of the DDPG outline [image by author, initial outline generated with ChatGPT]

Let’s go through the procedure step by step.

Initialization [line 1–4]

DDPG initialization [image by author]

We set out with four networks as detailed below:

Actor network μ_θ

  • Parameterized by θ
  • Outputs deterministic action based on input s

Actor target network μ_{θ_targ}

  • Parameterized by θ_targ
  • Provides action for s’ when training critic network

Critic network Q_ϕ(s,a)

  • Parameterized by ϕ
  • Outputs Q-value Q(s,a) (expectation) based on input (s,a)

Critic target network μ

  • Parameterized by ϕ_tar
  • Outputs Q-value Q(s’,a’) (target) when training critic network

We start with an empty replay buffer D. Unlike on-policy methods, we do not empty the buffer after updating the policy, as we re-use older transitions.

Finally, we set the learning rate ρ to update the target networks. For simplicity, we assume it is identical for both target networks. Recall that ρ should be set close to 1 (e.g., something like 0.995), such that the the networks are updated slowly and targets remain rather stable over time.

Data collection [line 9–11]

DDPG data collection [image by author]

Actions are generated using the actor network, which outputs deterministic actions. To increase exploration, noise is added to these actions. The resulting observation tuples are stored in the replay buffer.

Updating the actor and critic network [line 12–17]

DDPG main network updates [image by author]

A random mini-batch B⊆D is sampled from the replay buffer (including observations stemming from older policies).

To update the critic, we minimize the squared error between the observation (obtained with the target networks) and the expectation (obtained with the main networks).

To update the actor, we compute the sample policy gradient, with the Q-values kept fixed. In a neural network setting, we compute a pseudo loss which is the cumulative Q-value of the generated actions.

The training procedure might be clarified by the Keras snippet below:

Updating the target networks [line 18–19]

DDPG target network updates [image by author]

Both the actor target network and the critic target network are updated using polyak averaging, with their weights moving a bit closer to the updated main networks.

Returning the trained network [line 23]

DDPG trained actor network [image by author]

Although we had to go through some trouble, the resulting policy looks very clean. Unlike in DQN, we don’t perform an explicit maximization over the action space, so we have no need for Q-values anymore [note we never used them to take decisions, just to improve them].

We also have no need for the target networks anymore, which were just required to stabilize training and prevent oscillations. Ideally, the main- and target networks will have converged, such that we have μ_θ=μ_{θ_targ} (and Q_ϕ=Q_{ϕ_targ}). That way, we know our policy has truly converged.

Finally, we drop the exploration noise ϵ, which was never an integral aspect of the policy. We are left with an actor network that takes the state as input and outputs a deterministic action, which for many applications is exactly the simplicity we want.

We established that DDPG is a hybrid class method, incorporating elements from both policy gradient methods and value-based methods. This goes for all actor-critic methods though, so what precisely makes DDPG unique?

  • DDPG handles continuous action spaces: The algorithm is specifically designed to handle continuous action spaces, without relying on stochastic policies. Deterministic policies may be easier to learn, and policies without inherent randomness are often preferable in real-world applications.
  • DDPG is off-policy. Unlike common actor-critic algorithms, experiences are drawn from a replay buffer that includes observations from older policies. The off-policy nature is necessary to sufficiently explore (as actions are generated deterministically). It also offers benefits such as greater sample efficiency and enhanced stability.
  • DDPG is conceptually very close to DQN: In essence, DDPG is a variant of DQN that works for continuous action spaces. To circumvent the need to explicitly maximize over all actions — DQN enumerates over the full action space to identify the highest Q(s,a) value — actions are provided by an actor network which is optimized separately.
  • DDPG outputs an actor network: Although close to DQN in terms of training, during deployment we only need the trained actor network. This network takes the state as input and deterministically outputs an action.

Although it might not seem so at first glance, the deterministic nature of DDPG tends to simplify training, being more stable and sample-efficient than online counterparts. The output is a comprehensive actor network that deterministically generates actions. Because of these properties, it has become a staple in continuous control tasks.


Photo by Jonathan Ford on Unsplash

This article introduces Deep Deterministic Policy Gradient (DDPG) — a Reinforcement Learning algorithm suitable for deterministic policies applied in continuous action spaces. By combining the actor-critic paradigm with deep neural networks, continuous action spaces can be tackled without resorting to stochastic policies.

Especially for continuous control tasks in which randomness in actions is undesirable — e.g., robotics or navigation — DDPG might be just the algorithm you need.

DDPG displays elements of policy-based methods as well as value-based methods, making it a hybrid policy class.

Policy gradient methods like REINFORCE, TRPO, and PPO use stochastic policies π:a~P(a|s) to explore and compare actions. These methods draw actions from a differentiable distribution P_θ(a|s), which enables the computation of gradients with respect to θ. The inherent randomness in these decisions could be undesirable real-world applications. DDPG eliminates this randomness, yielding simpler and more predictable policies.

Value-based methods like SARSA, Monte Carlo Learning, and Deep Q-Learning are based on deterministic policies that always return a single action given an input state. However, these methods assume a finite number of actions, which makes evaluating their value functions and selecting the most rewarding actions difficult for continuous action spaces with infinitely many actions.

As you guessed, Deep Deterministic Policy Gradients fill the gap, incorporating elements of both Deep Q-Learning and policy gradient methods. DDPG effectively handles continuous action spaces and has been successfully applied in robotic control- and game-playing tasks.

If you’re unfamiliar with policy gradient algorithms (in particular REINFORCE) or value-based methods (in particular DQN), it’s recommended to learn about them before exploring DDPG.

DDPG is remarkably close to Deep Q-Learning, sharing both notation and concepts. Let’s make a quick journey.

DQN for continuous action spaces?

In vanilla (i.e., tabular) Q-learning, we use Q-values to approximate the Bellman value function V. Q-values are defined for every state-action pair and thus denoted by Q(s,a). Tabular Q-learning requires a lookup table containing a Q-value for each pair, thus necessitating a discrete state space and a discrete action space.

Time to put the ‘deep’ in Deep Reinforcement Learning. Compared to lookup tables, introducing a neural network has two advantages: (i) it provides a single general expression for the entire state space and (ii) by extension, it can also handle continuous state spaces.

Of course, we need to deal with continuous action spaces; we thus cannot output Q-values for every action. Instead, we provide a single action as input and compute the Q-value for the state-action pair (a procedure also known as naive DQN). In mathematical terms, we can represent the network as Q:(s,a)→Q(s,a), i.e., outputting a single Q- value for a given state-action pair.

The corresponding critic network looks as follows.

Example of a critic network Q:(s,a) in DDPG. The network takes both the state vector and action vector as input, and outputs a single Q-value [image by author]

Although offering generalization, the neural network introduces some stability issues. Being a single representation for all states, each update also affects all Q-values. As observation tuples (s,a,r,s’) are collected sequentially, there tends to be a high temporal correlation between them that makes overfitting all too likely. Without going into too much detail here, the following three techniques are needed to properly train the value network:

  • Experience replay: Sample observations (s,a,r,s’) from experience buffer, breaking the correlation between subsequently collected tuples.
  • Batch learning: Train value network with batches of observations, making for more reliable and impactful updates.
  • Target networks: Use a different network to compute Q(s’,a’) than Q(s,a), reducing correlation between expectation and observation.

Critic network updates

Now that the basics are refreshed, let’s tie the aforementioned concepts to DDPG. We define a critic network Q_ϕ as detailed before, parameterized by ϕ (representing the network weights).

We set out with the loss function that we aim to minimize, which should look familiar to those experienced with Q-learning:

Compared to DQN, the critical distinction is that for the action corresponding to s’ — instead of maximizing over the action space — we determine the action a’ through a target actor network μ_{θ_targ} (more on that later). After that sidestep, we update the critic network as usual.

Aside from updating the main critic network, we must also update the target critic network. In Deep Q-Learning this is often a periodic copy of the main value network (e.g., copy once every 100 episodes). In DDPG, it is common to use a lagging target network using polyak averaging, making the target network trail behind the main value network:

With ρ typically being close to 1, the target network adapts very slowly and gradually over time, which improves training stability.

In DDPG, actor and critic are closely intertwined in an off-policy fashion. We first explore the off-policy nature of the algorithm, before moving to action generation and actor network updates.

Off-policy training

In pure policy gradient methods, we directly update a policy μ_θ (parameterized by θ) to maximize expected rewards, without resorting to explicit value functions to capture these rewards. DDPG is a hybrid that also uses Q-values, but from an actor perspective maximizing the objective J(θ) look similar at face value:

However, a closer look at the expectation reveals that DDPG is an off-policy method, whereas typical actor-critic methods are on-policy. Most actor-critic models maximize an expectation E_{τ~π_θ}, with τ being the state-action trajectory generated by policy π_θ. In contrast, DDPG takes an expectation over sample states drawn from the experience buffer (E_{s~D}). As DDPG optimizes a policy using experiences generated with different policies, it is an off-policy algorithm.

The role of the replay buffer in this off-policy context needs some attention. Why can we re-use old experiences, and why should we?

First, let’s explore why the buffer can include experiences obtained with different policies than the present one. As the policy is updated over time, the replay buffer holds experiences stemming from outdated policies. As optimal Q-values apply to any transition, it does not matter what policy generated them.

Second, the reason the replay buffer should contain a diverse range of experiences is that we deploy a deterministic policy. If the algorithm would be on-policy, we would likely have limited exploration. By drawing upon past experiences, we also train on observations that are unlikely to be encountered under the present policy.

Example of an actor network μ_θ:s in DDPG. The network takes the state vector as input, and outputs a deterministic action μ(s). During training, a separate random noise ϵ is typically added [image by author]

Action exploration

What about the exploration mechanism that policy gradient methods are so famous for? After all, we now deploy a deterministic policy rather than a stochastic one, right? DDPG resolves this by simply adding some noise ϵ during training, which is removed when deploying the policy.

Early implementations of DDPG used rather complicated noise constructs (e.g., time-correlated Ornstein-Uhlenbeck noise), but later empirical results suggested that plain Gaussian noise ϵ~N(0,σ^2) works equally well. The noise may be gradually reduced over time, but is not a trainable component σ_θ such as in stochastic policies. A final note is that we may clip the action range. Evidently, some tuning effort is involved in the exploration.

In summary, the actor generates actions as follows. It takes the state as input, outputs a deterministic value, and adds some random noise:

Actor network updates

A final note on the policy update, which is not necessarily trivial. We update the actor network parameters θ based on the Q-values returned by the critic network (parameterized by ϕ). Thus, we keep Q-values constant — that is, we don’t update ϕ in this step — and try to maximize the expected reward by changing the action. This means we assume that the critic network is differentiable w.r.t. the action, such that we can update the action in a direction that maximizes the Q-value:

Although the second gradient ∇_θ is often omitted for readability, it offers some clarification. We train the actor network to output better actions, which in turn improves the obtained Q-values. If you wanted, you could detail the procedure by applying the chain rule.

The actor target network is updated with polyak averaging, in the same vein as the critic target network.

We have an actor, we have a critic, so nothing stops us from completing the algorithm now!

Outline of the DDPG outline [image by author, initial outline generated with ChatGPT]

Let’s go through the procedure step by step.

Initialization [line 1–4]

DDPG initialization [image by author]

We set out with four networks as detailed below:

Actor network μ_θ

  • Parameterized by θ
  • Outputs deterministic action based on input s

Actor target network μ_{θ_targ}

  • Parameterized by θ_targ
  • Provides action for s’ when training critic network

Critic network Q_ϕ(s,a)

  • Parameterized by ϕ
  • Outputs Q-value Q(s,a) (expectation) based on input (s,a)

Critic target network μ

  • Parameterized by ϕ_tar
  • Outputs Q-value Q(s’,a’) (target) when training critic network

We start with an empty replay buffer D. Unlike on-policy methods, we do not empty the buffer after updating the policy, as we re-use older transitions.

Finally, we set the learning rate ρ to update the target networks. For simplicity, we assume it is identical for both target networks. Recall that ρ should be set close to 1 (e.g., something like 0.995), such that the the networks are updated slowly and targets remain rather stable over time.

Data collection [line 9–11]

DDPG data collection [image by author]

Actions are generated using the actor network, which outputs deterministic actions. To increase exploration, noise is added to these actions. The resulting observation tuples are stored in the replay buffer.

Updating the actor and critic network [line 12–17]

DDPG main network updates [image by author]

A random mini-batch B⊆D is sampled from the replay buffer (including observations stemming from older policies).

To update the critic, we minimize the squared error between the observation (obtained with the target networks) and the expectation (obtained with the main networks).

To update the actor, we compute the sample policy gradient, with the Q-values kept fixed. In a neural network setting, we compute a pseudo loss which is the cumulative Q-value of the generated actions.

The training procedure might be clarified by the Keras snippet below:

Updating the target networks [line 18–19]

DDPG target network updates [image by author]

Both the actor target network and the critic target network are updated using polyak averaging, with their weights moving a bit closer to the updated main networks.

Returning the trained network [line 23]

DDPG trained actor network [image by author]

Although we had to go through some trouble, the resulting policy looks very clean. Unlike in DQN, we don’t perform an explicit maximization over the action space, so we have no need for Q-values anymore [note we never used them to take decisions, just to improve them].

We also have no need for the target networks anymore, which were just required to stabilize training and prevent oscillations. Ideally, the main- and target networks will have converged, such that we have μ_θ=μ_{θ_targ} (and Q_ϕ=Q_{ϕ_targ}). That way, we know our policy has truly converged.

Finally, we drop the exploration noise ϵ, which was never an integral aspect of the policy. We are left with an actor network that takes the state as input and outputs a deterministic action, which for many applications is exactly the simplicity we want.

We established that DDPG is a hybrid class method, incorporating elements from both policy gradient methods and value-based methods. This goes for all actor-critic methods though, so what precisely makes DDPG unique?

  • DDPG handles continuous action spaces: The algorithm is specifically designed to handle continuous action spaces, without relying on stochastic policies. Deterministic policies may be easier to learn, and policies without inherent randomness are often preferable in real-world applications.
  • DDPG is off-policy. Unlike common actor-critic algorithms, experiences are drawn from a replay buffer that includes observations from older policies. The off-policy nature is necessary to sufficiently explore (as actions are generated deterministically). It also offers benefits such as greater sample efficiency and enhanced stability.
  • DDPG is conceptually very close to DQN: In essence, DDPG is a variant of DQN that works for continuous action spaces. To circumvent the need to explicitly maximize over all actions — DQN enumerates over the full action space to identify the highest Q(s,a) value — actions are provided by an actor network which is optimized separately.
  • DDPG outputs an actor network: Although close to DQN in terms of training, during deployment we only need the trained actor network. This network takes the state as input and deterministically outputs an action.

Although it might not seem so at first glance, the deterministic nature of DDPG tends to simplify training, being more stable and sample-efficient than online counterparts. The output is a comprehensive actor network that deterministically generates actions. Because of these properties, it has become a staple in continuous control tasks.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Aprartificial intelligenceDeepDeterministicexplainedGradientsHeeswijkPhDPolicyTech NewsTechnoblendervanWouter
Comments (0)
Add Comment