Extending Q-Learning With Dyna-Q for Enhanced Deci

By Jessie Hobb On Dec 23, 2023

Introduction To Q-Learning

Q-Learning is a crucial model-free algorithm in reinforcement learning, focusing on learning the value, or ‘Q-value’, of actions in specific states. This method excels in environments with unpredictability, as it doesn’t need a predefined model of its surroundings. It adapts to stochastic transitions and varied rewards effectively, making it versatile for scenarios where outcomes are uncertain. This flexibility allows Q-Learning to be a powerful tool in scenarios requiring adaptive decision-making without prior knowledge of the environment’s dynamics.

Learning Process:

Q-learning works by updating a table of Q-values for each action in each state. It uses the Bellman equation to iteratively update these values based on the observed rewards and its estimation of future rewards. The policy – the strategy of choosing actions – is derived from these Q-values.

Q-Value: Represents the expected future rewards that can be obtained by taking a certain action in a given state.
Update Rule: Q-values are updated as follows
- Q (state, action) ← Q (state, action) + α (reward + γ max Q (next-state,a) − Q (state, action))
- The learning rate α indicates the importance of new information and the discount factor γ indicates the importance of future rewards.

The code provided serves as a training function for the Q-Learner. It utilizes the Bellman equation to determine the most effective transitions between states.

def train_Q(self,s_prime,r): 			  		 			     			  	   		   	  			  	
        self.QTable[self.s,self.action] = (1-self.alpha)*self.QTable[self.s, self.action] + \
            self.alpha * (r + self.gamma * (self.QTable[s_prime, np.argmax(self.QTable[s_prime])])) 
        self.experiences.append((self.s, self.action, s_prime, r))
        self.num_experiences = self.num_experiences + 1
        self.s = s_prime
        self.action = action
        return action

Exploration vs. Exploitation

A key aspect of Q-learning is balancing exploration (trying new actions to discover their rewards) and exploitation (using known information to maximize rewards). Algorithms often use strategies like ε-greedy to maintain this balance.

Start by setting a rate for random actions to balance Exploration and Exploitation. Implement a decay rate to gradually reduce the randomness as the Q-Table accumulates more data. This approach guarantees that, over time, with the accumulation of more evidence, the algorithm increasingly shifts towards exploitation.

if rand.random() >= self.random_action_rate:
  action = np.argmax(self.QTable[s_prime,:])  #Exploit: Select Action that leads to a State with the Best Reward
else:
  action = rand.randint(0,self.num_actions - 1) #Explore: Randomly select an Action.
    
# Use a decay rate to reduce the randomness (Exploration) as the Q-Table gets more evidence
self.random_action_rate = self.random_action_rate * self.random_action_decay_rate

Introducing Dyna-Q

Dyna-Q, an innovative extension of the traditional Q-Learning algorithm, stands at the forefront of blending real experience with simulated planning. This approach significantly enhances the learning process by integrating actual interactions and simulated experiences, enabling agents to rapidly adapt and make informed decisions in complex environments. By leveraging both direct learning from environmental feedback and insights gained through simulation, Dyna-Q offers a comprehensive and efficient strategy for navigating challenges where real-world data is scarce or expensive to obtain.

Components of Dyna-Q

Q-Learning: Learned from real experience.
Model Learning: Learns a model of the environment.
Planning: Uses the model to generate simulated experiences.

Model Learning

The model keeps track of the transitions and rewards. For each state-action pair (s, a), the model stores the next state s′ and reward r.
When the agent observes a transition (s, a,r,s′), it updates the model.

Planning with Simulated Experience

In each step, after the agent updates its Q-Value from real experience, it also updates Q-Values based on simulated experiences.
These experiences are generated using the learned model: for a selected state-action pair (s, a), it predicts the next state and reward, and the Q-value is updated as if this transition had been experienced.

Algorithm Dyna-Q

Initialize Q-values Q(s, a) and Model (s, a) for all state-action pairs.
Loop (for each episode):
- Initialize state s.
- Loop (for each step of the episode):
  - Choose action a from state s using derived from Q (e.g., ϵ-greedy )
  - Take action a, observe reward r, and next state s′
  - Direct Learning: Update Q-value using observed transition (s, a,r,s′)
  - Model Learning: Update model with transition (s, a,r,s′)
  - Planning: Repeat n times:
    - Randomly select a state-action pair (s, a) previously experienced.
    - Use model to generate predicted next state s′ and reward r
    - Update Q-value using simulated transition (s, a,r,s′)
    - s← s′.
End Loop This function merges a Dyna-Q planning phase into the aforementioned Q-Learner, providing the ability to designate the desired amount of simulations to run in each episode, where actions are chosen at random. This feature enhances the overall functionality and versatility of the Q-Learner.

def train_DynaQ(self,s_prime,r): 			  		 			     			  	   		   	  			  	
        self.QTable[self.s,self.action] = (1-self.alpha)*self.QTable[self.s, self.action] + \
            self.alpha * (r + self.gamma * (self.QTable[s_prime, np.argmax(self.QTable[s_prime])])) 
        self.experiences.append((self.s, self.action, s_prime, r))
        self.num_experiences = self.num_experiences + 1
        
        # Dyna-Q Planning - Start
        if self.dyna_planning_steps > 0:  # Number of simulations to perform
            idx_array = np.random.randint(0, self.num_experiences, self.dyna)
            for exp in range(0, self.dyna): # Pick random experiences and update QTable
                idx = idx_array[exp]
                self.QTable[self.experiences[idx][0],self.experiences[idx][1]] = (1-self.alpha)*self.QTable[self.experiences[idx][0], self.experiences[idx][1]] + \
                    self.alpha * (self.experiences[idx][3] + self.gamma * (self.QTable[self.experiences[idx][2], np.argmax(self.QTable[self.experiences[idx][2],:])])) 
        # Dyna-Q Planning - End

        if rand.random() >= self.random_action_rate:
          action = np.argmax(self.QTable[s_prime,:])  #Exploit: Select Action that leads to a State with the Best Reward
        else:
          action = rand.randint(0,self.num_actions - 1) #Explore: Randomly select an Action.
          
    	# Use a decay rate to reduce the randomness (Exploration) as the Q-Table gets more evidence
        self.random_action_rate = self.random_action_rate * self.random_action_decay_rate 
        
        self.s = s_prime
        self.action = action
        return action

Conclusion

Dyna Q represents an advancement, in our pursuit of designing agents that can learn and adapt in intricate and uncertain surroundings. By comprehending and implementing Dyna Q, experts and enthusiasts in the realm of AI and machine learning can devise resilient solutions to a diverse range of practical issues. The purpose of this tutorial was not to introduce the concepts and algorithms but also to ignite creativity for inventive applications and future progressions, in this captivating area of research.