A Reinforcement Learning-Based Inventory Control Policy for Retailers | by Guangrui Xie | Dec, 2022

By Jessie Hobb On Dec 7, 2022

Build a Deep Q Network (DQN) model to optimize the inventory operations for a single retailer

Inventory optimization is an important aspect of supply chain management, which is concerned with optimizing the inventory operations of businesses. It uses mathematical model to answer key questions like when to place a replenishment order to fulfill customers’ demand for a product and how much quantity to order. The major inventory control policies adopted in supply chain industry nowadays are classic static policies, in the sense that the decision of either when to order or how much to order is fixed throughout the planning horizon unless the policy is updated. However, such static policies may fall short when there is a high variability in demand. A dynamic policy that can adaptively adjust the decisions of when and how much to order based on not only the inventory position and but also related future demand information would be more advantageous. In this article, I will use a small retail store that sells Coke as an example to illustrate how we can utilize a reinforcement learning (RL) technique — Deep Q Network (DQN) to build an inventory control policy to optimize inventory operations and achieve more profits as compared to the classic static inventory control policies.

(R,Q) policy: this policy says when the inventory position drops below R units, we need to order a fixed quantity of Q units of products. Here, R is referred to as the reorder point and Q is the order quantity. In practice, the inventory position is usually checked at the beginning or end of every day.
(T,S) policy: this policy says we place an order to replenish the inventory up to S units every T days. Here, T is the review period, which determines how often we review the inventory level, and S is referred to as the order-up-to level.
(s,S) policy: this policy says when the inventory position drops below s units, we need to place an order to replenish the inventory up to S units. Here, s can be considered as the reorder point and S can be considered as the order-up-to level.
Base stock policy: this policy is equivalent to (S-1,S) policy, meaning we place an order to replenish the inventory up to S units immediately if there is any demand consuming the inventory on a particular day.

The above different policies are suitable for different demand patterns, but the commonality is that they all assume either a fixed reorder point, fixed order quantity, fixed order-up-to level or fixed time interval between two orders. Moreover, most of these policies rely on only the current inventory position to make ordering decisions, and do not utilize other possible information related to future demand to help make more informed decisions. This limits the flexibility of the policy, which potentially undermines the responsiveness of the policy to high demand (causing lost sales) or results in excessive inventory when demand is low (causing inventory holding costs). Can we do better if we remove this limitation? How can we build a model to obtain an inventory control policy without this limitation? A possible way to do this is reinforcement learning (RL).

RL is a subfield of machine learning that concerns with decision making. It enables an intelligent agent to learn how to make optimal decisions through its past experience by interacting with an environment. RL gained its popularity with its wide range of applications including self-driving cars, robotic control, gaming, etc..

Formulating the Markov Decision Process

To utilize RL, one has to first formulate a decision making problem as a Markov Decision Process (MDP). MDP is a mathematical framework to model decision making problems, where decisions are made sequentially at discrete time steps. An MDP has 4 core elements: state, action, reward, transition probability. State s_t represents the situation of the agent at time t. Action a_t is the decision that the agent takes at time t. Reward r_t is the feedback from the environment that tells the agent whether a certain action is good or bad. Transition probability P(s_(t+1)|s_t,a_t) determines the probability that the agent falls in state s_(t+1) when it takes action a_t at state s_t. In most real world environments, the transition probability is unlikely to be known.

The inventory optimization problem naturally fits the framework of MDP due to its sequential decision making structure. There may be multiple ways to define the state, action and reward for inventory optimization problems. In theory, the definition of state should include all relevant information that could be useful for taking a reasonable action, the definition of action should be flexible enough to represent all possible options for a decision, and the definition of reward should reflect the objective of problem (e.g., minimizing cost, maximizing profit). Hence, the state, action and reward definitions may vary from instance to instance.

In this article, we assume that the customer demand follows a special structure: a mixture of normal distributions, where the demand from Monday to Thursday follows a normal distribution with the lowest mean, the demand on Friday follows a normal distribution with the medium mean, and the demand from Saturday to Sunday follows a normal distribution with the highest mean. This assumption is based on the fact that people tend to buy groceries more often on weekends rather than weekdays (also more often on Fridays than other weekdays). Let’s further assume that as the owner of the retail store, we want to maximize the profit of selling Coke within a period of time. The costs being considered include inventory holding cost, fixed ordering cost (e.g., shipping cost), and variable ordering cost (e.g., unit cost for buying Coke from suppliers) taken into account. The backorder cost is not considered here as we assume if customers don’t see any Coke left in the store, they will go to other stores to buy Coke. They will not place an order in the store and wait for the order to be fulfilled in the future.

With the assumptions above, here is the definition of state, action, and reward.

State: (i_pt, dow_t), where i_pt is the inventory position (inventory on hand + up-coming order) at the end of tth day, and dow_t is a 6-dimensional vector representation of the day of week of the tth day using one hot encoding. We expect the ordering decisions are made not only based on inventory position, but also based on day of week information.
Action: a_t, where a_t denotes the order quantity at the end of tth day. If a_t is a positive number, we place an order of a_t units. If a_t = 0, we don’t place an order. The action space is limited by the maximum order quantity, determined by suppliers or the capacity of transportation vehicles.
Reward: r_t = min(d_t,i_t)*p-i_t*h-I(a_t>0)*f-a_t*v, where d_t is the demand that occurs during the daytime of the (t+1)th day, i_t is the inventory on hand at the end of tth day, p is the unit selling price of the product, h is the holding cost per unit per night, I(a_t>0) is an indicator function that takes 1 if a_t>0 and 0 otherwise, f is the fixed ordering cost incurred per order, and v is the variable ordering cost per unit. It’s easy to see that the reward r_t is just the profit obtained in the tth decision epoch.

Solving the Markov Decision Process

One notable propertity of the MPD formulated above is that the transition probabilities are unknown. At a particular time t, dow_(t+1) is known for sure, but i_p(t+1) is not uniquely determined by a_t. One could choose to fit a demand distribution using historical demand data, try to infer the transition probabilities and then use a model-based RL technique to solve this problem. However, this could result in a huge gap between the simulated environment and real world as fitting a perfect demand distribution is very challenging (especially in this case where demand follows a mixture of distributions). Hence, it would be better to adopt model-free RL techniques that can deal with unknown transition probabilities inherently.

There are multiple model-free RL techniques for solving this MDP. In this article, as a first attempt, I adopted Deep Q Network (DQN) as the solving tool. DQN is a variant of Q learning, which utilizes deep neural networks to build an approximation of Q functions. To save on space, I’m omitting the detailed instruction of DQN as it’s not the focus of this article. Interested readers are referred to this article.

To compare the performance of the inventory control policy learned by DQN and the classic inventory control policies, let’s consider a numerical experiment as follows.

Assume there is a small retail store which sells Coke to its customers. Every time the store wants to replenish its inventory to fulfill customer demand, the store has to place an order of an integer number of cases of Coke (one case contains 24 cans). Suppose that the unit selling price of Coke is $30 per case, holding cost is $3 per case per night, fixed ordering cost is $50 per order, variable ordering cost is $10 per case, the inventory capacity of the store is 50 cases, the maximum order quantity is 20 cases per order, the initial inventory is 25 cases at the end of a Sunday, the lead time (time interval between placing an order and order arrival) is 2 days, and the maximum order quantity allowed is 20 units. Here, we assume the demand from Monday to Thursday follows a normal distribution N(3,1.5), the demand on Friday follows a normal distribution N(6,1), and the demand from Saturday to Sunday follows a normal distribution N(12,2). We generate 52 weeks of historical demand samples from this mixture of distributions, and use this as a training dataset for the DQN model.

As a benchmark, we will optimize a classic (s,S) inventory control policy using the same dataset which was used for training the DQN model, and compare its performance with DQN in a test set.

Code for training the DQN model

First, generate the training dataset and view the histogram of historical demands. Note that non-integer demand data are rounded to nearest integers.

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
demand_hist = []
for i in range(52):
for j in range(4):
random_demand = np.random.normal(3, 1.5)
if random_demand < 0:
random_demand = 0
random_demand = np.round(random_demand)
demand_hist.append(random_demand)
random_demand = np.random.normal(6, 1)
if random_demand < 0:
random_demand = 0
random_demand = np.round(random_demand)
demand_hist.append(random_demand)
for j in range(2):
random_demand = np.random.normal(12, 2)
if random_demand < 0:
random_demand = 0
random_demand = np.round(random_demand)
demand_hist.append(random_demand)
plt.hist(demand_hist)

Histogram of the historical demand data (Image by author)

Then we define the environment of the inventory optimization problem for the DQN agent to interact with.

class InvOptEnv():
def __init__(self, demand_records):
self.n_period = len(demand_records)
self.current_period = 1
self.day_of_week = 0
self.inv_level = 25
self.inv_pos = 25
self.capacity = 50
self.holding_cost = 3
self.unit_price = 30
self.fixed_order_cost = 50
self.variable_order_cost = 10
self.lead_time = 2
self.order_arrival_list = []
self.demand_list = demand_records
self.state = np.array([self.inv_pos] + self.convert_day_of_week(self.day_of_week))
self.state_list = []
self.state_list.append(self.state)
self.action_list = []
self.reward_list = []def reset(self):
self.state_list = []
self.action_list = []
self.reward_list = []
self.inv_level = 25
self.inv_pos = 25
self.current_period = 1
self.day_of_week = 0
self.state = np.array([self.inv_pos] + self.convert_day_of_week(self.day_of_week))
self.state_list.append(self.state)
self.order_arrival_list = []
return self.state
def step(self, action):
if action > 0:
y = 1
self.order_arrival_list.append([self.current_period+self.lead_time, action])
else:
y = 0
if len(self.order_arrival_list) > 0:
if self.current_period == self.order_arrival_list[0][0]:
self.inv_level = min(self.capacity, self.inv_level + self.order_arrival_list[0][1])
self.order_arrival_list.pop(0)  
demand = self.demand_list[self.current_period-1]
units_sold = demand if demand <= self.inv_level else self.inv_level
reward = units_sold*self.unit_price-self.holding_cost*self.inv_level - y*self.fixed_order_cost \
-action*self.variable_order_cost    
self.inv_level = max(0,self.inv_level-demand)
self.inv_pos = self.inv_level
if len(self.order_arrival_list) > 0:
for i in range(len(self.order_arrival_list)):
self.inv_pos += self.order_arrival_list[i][1]
self.day_of_week = (self.day_of_week+1)%7
self.state = np.array([self.inv_pos] +self.convert_day_of_week(self.day_of_week))
self.current_period += 1
self.state_list.append(self.state)
self.action_list.append(action)
self.reward_list.append(reward)
if self.current_period > self.n_period:
terminate = True
else: 
terminate = False
return self.state, reward, terminate
def convert_day_of_week(self,d):
if d == 0:
return [0, 0, 0, 0, 0, 0]
if d == 1:
return [1, 0, 0, 0, 0, 0] 
if d == 2:
return [0, 1, 0, 0, 0, 0] 
if d == 3:
return [0, 0, 1, 0, 0, 0] 
if d == 4:
return [0, 0, 0, 1, 0, 0] 
if d == 5:
return [0, 0, 0, 0, 1, 0] 
if d == 6:
return [0, 0, 0, 0, 0, 1]

Now we begin building the DQN model with PyTorch. The code implementation of DQN in this part is based on this article.

import torch 
import torch.nn as nn
import torch.nn.functional as Fclass QNetwork(nn.Module):
""" Actor (Policy) Model."""
def __init__(self, state_size, action_size, seed, fc1_unit=128,
fc2_unit = 128):
"""
Initialize parameters and build model.
Params
=======
state_size (int): Dimension of each state
action_size (int): Dimension of each action
seed (int): Random seed
fc1_unit (int): Number of nodes in first hidden layer
fc2_unit (int): Number of nodes in second hidden layer
"""
super(QNetwork,self).__init__() ## calls __init__ method of nn.Module class
self.seed = torch.manual_seed(seed)
self.fc1= nn.Linear(state_size,fc1_unit)
self.fc2 = nn.Linear(fc1_unit,fc2_unit)
self.fc3 = nn.Linear(fc2_unit,action_size)
def forward(self,x):
# x = state
"""
Build a network that maps state -> action values.
"""
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
import random 
from collections import namedtuple, deque 
##Importing the model (function approximator for Q-table)
# from model import QNetwork
import torch
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler 
BUFFER_SIZE = int(5*1e5)  #replay buffer size
BATCH_SIZE = 128      # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3             # for soft update of target parameters
LR = 1e-4            # learning rate
UPDATE_EVERY = 4      # how often to update the network
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
class Agent():
"""Interacts with and learns form environment."""
def __init__(self, state_size, action_size, seed):
"""Initialize an Agent object.
Params
=======
state_size (int): dimension of each state
action_size (int): dimension of each action
seed (int): random seed
"""
self.state_size = state_size
self.action_size = action_size
self.seed = random.seed(seed)
#Q- Network
self.qnetwork_local = QNetwork(state_size, action_size, seed).to(device)
self.qnetwork_target = QNetwork(state_size, action_size, seed).to(device)
self.optimizer = optim.Adam(self.qnetwork_local.parameters(),lr=LR)
# Replay memory 
self.memory = ReplayBuffer(action_size, BUFFER_SIZE,BATCH_SIZE,seed)
# Initialize time step (for updating every UPDATE_EVERY steps)
self.t_step = 0
def step(self, state, action, reward, next_step, done):
# Save experience in replay memory
self.memory.add(state, action, reward, next_step, done)
# Learn every UPDATE_EVERY time steps.
self.t_step = (self.t_step+1)% UPDATE_EVERY
if self.t_step == 0:
# If enough samples are available in memory, get radom subset and learn
if len(self.memory)>BATCH_SIZE:
experience = self.memory.sample()
self.learn(experience, GAMMA)
def act(self, state, eps = 0, evaluation_episode=False):
"""Returns action for given state as per current policy
Params
=======
state (array_like): current state
eps (float): epsilon, for epsilon-greedy action selection
"""
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
self.qnetwork_local.eval()
with torch.no_grad():
action_values = self.qnetwork_local(state)
self.qnetwork_local.train()
#Epsilon -greedy action selction
if random.random() > eps:
return np.argmax(action_values.cpu().data.numpy())
else:
return random.choice(np.arange(self.action_size))
def learn(self, experiences, gamma):
"""Update value parameters using given batch of experience tuples.
Params
=======
experiences (Tuple[torch.Variable]): tuple of (s, a, r, s', done) tuples
gamma (float): discount factor
"""
states, actions, rewards, next_states, dones = experiences
## TODO: compute and minimize the loss
criterion = torch.nn.MSELoss()
# Local model is one which we need to train so it's in training mode
self.qnetwork_local.train()
# Target model is one with which we need to get our target so it's in evaluation mode
# So that when we do a forward pass with target model it does not calculate gradient.
# We will update target model weights with soft_update function
self.qnetwork_target.eval()
#shape of output from the model (batch_size,action_dim) = (64,4)
predicted_targets = self.qnetwork_local(states).gather(1,actions)
with torch.no_grad():
labels_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
# .detach() ->  Returns a new Tensor, detached from the current graph.
labels = rewards + (gamma* labels_next*(1-dones))
loss = criterion(predicted_targets,labels).to(device)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# ------------------- update target network ------------------- #
self.soft_update(self.qnetwork_local,self.qnetwork_target,TAU)
def soft_update(self, local_model, target_model, tau):
"""Soft update model parameters.
θ_target = τ*θ_local + (1 - τ)*θ_target
Params
=======
local model (PyTorch model): weights will be copied from
target model (PyTorch model): weights will be copied to
tau (float): interpolation parameter
"""
for target_param, local_param in zip(target_model.parameters(),
local_model.parameters()):
target_param.data.copy_(tau*local_param.data + (1-tau)*target_param.data)
class ReplayBuffer:
"""Fixed -size buffe to store experience tuples."""
def __init__(self, action_size, buffer_size, batch_size, seed):
"""Initialize a ReplayBuffer object.
Params
======
action_size (int): dimension of each action
buffer_size (int): maximum size of buffer
batch_size (int): size of each training batch
seed (int): random seed
"""
self.action_size = action_size
self.memory = deque(maxlen=buffer_size)
self.batch_size = batch_size
self.experiences = namedtuple("Experience", field_names=["state",
"action",
"reward",
"next_state",
"done"])
self.seed = random.seed(seed)
def add(self,state, action, reward, next_state,done):
"""Add a new experience to memory."""
e = self.experiences(state,action,reward,next_state,done)
self.memory.append(e)
def sample(self):
"""Randomly sample a batch of experiences from memory"""
experiences = random.sample(self.memory,k=self.batch_size)
states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
return (states,actions,rewards,next_states,dones)
def __len__(self):
"""Return the current size of internal memory."""
return len(self.memory)

Finally, we can train the DQN model. Note that here the size of the action space is 21, as there are 21 possible values of order quantities from 0 to maximum order quantity 20.

agent = Agent(state_size=7,action_size=21,seed=0)
TRAINING_EVALUATION_RATIO = 4def dqn(env, n_episodes= 1000, max_t = 10000, eps_start=1.0, eps_end = 0.01,
eps_decay=0.995):
"""Deep Q-Learning
Params
======
n_episodes (int): maximum number of training epsiodes
max_t (int): maximum number of timesteps per episode
eps_start (float): starting value of epsilon, for epsilon-greedy action selection
eps_end (float): minimum value of epsilon 
eps_decay (float): mutiplicative factor (per episode) for decreasing epsilon
"""
scores = [] # list containing score from each episode
eps = eps_start
for i_episode in range(1, n_episodes+1):
evaluation_episode = i_episode % TRAINING_EVALUATION_RATIO == 0
state = env.reset()
score = 0
for t in range(max_t):
action = agent.act(state,eps)
next_state,reward,done = env.step(action)
agent.step(state,action,reward,next_state,done)
## above step decides whether we will train(learn) the network
## actor (local_qnetwork) or we will fill the replay buffer
## if len replay buffer is equal to the batch size then we will
## train the network or otherwise we will add experience tuple in our 
## replay buffer.
state = next_state
score += reward
if done:
print('episode'+str(i_episode)+':', score)
scores.append(score)
break
eps = max(eps*eps_decay,eps_end)## decrease the epsilon
return scores
env = InvOptEnv(demand_hist)
scores= dqn(env)
plt.plot(np.arange(len(scores)),scores)
plt.ylabel('Reward')
plt.xlabel('Epsiode #')
plt.show()
torch.save(agent.qnetwork_local.state_dict(), desired_path)

The figure below shows the total reward obtained in each episode after training the DQN model for 1000 episodes. We see that the reward plot improves gradually and eventually comes to convergence.

Reward obtained in each training episode (Image by author)

Code for optimizing the (s,S) policy

As both s and S are discrete values, there is a limited number of possible (s,S) combinations in this problem. We will not consider setting s lower than 0, since it doesn’t make sense to reorder only when we are out of stock. So the value of s can go from 0 to S-1. For the value of S, we give a little extra room to allow S to take a value higher than the capacity. As orders do not arrive immediately and there may be demand arrivals during the lead time, the capacity shouldn’t be the limit for S. Here, we let S go from 1 to 60.

We can actually evaluate all the possible combinations on the historical demand dataset and pick the combination that gives the highest profit. The best (s,S) combination obtained is (15,32).

def profit_calculation_sS(s,S,demand_records):
total_profit = 0
inv_level = 25 # inventory on hand, use this to calculate inventory costs
lead_time = 2
capacity = 50
holding_cost = 3
fixed_order_cost = 50
variable_order_cost = 10
unit_price = 30
order_arrival_list = []
for current_period in range(len(demand_records)):
inv_pos = inv_level
if len(order_arrival_list) > 0:
for i in range(len(order_arrival_list)):
inv_pos += order_arrival_list[i][1]
if inv_pos <= s:
order_quantity = min(20,S-inv_pos)
order_arrival_list.append([current_period+lead_time, order_quantity])
y = 1
else:
order_quantity = 0
y = 0
if len(order_arrival_list) > 0:
if current_period == order_arrival_list[0][0]:
inv_level = min(capacity, inv_level + order_arrival_list[0][1])
order_arrival_list.pop(0)
demand = demand_records[current_period]
units_sold = demand if demand <= inv_level else inv_level
profit = units_sold*unit_price-holding_cost*inv_level-y*fixed_order_cost-order_quantity*variable_order_cost
inv_level = max(0,inv_level-demand)
total_profit += profit
return total_profits_S_list = []
for S in range(1,61): # give a little room to allow S to exceed the capacity 
for s in range(0,S):
s_S_list.append([s,S])  
profit_sS_list = []
for sS in s_S_list:
profit_sS_list.append(profit_calculation_sS(sS[0],sS[1],demand_hist))
best_sS_profit = np.max(profit_sS_list) 
best_sS = s_S_list[np.argmax(profit_sS_list)]

Code for testing the DQN policy

We first create 100 customer demand datasets for testing. Each of the 100 datasets contains 52 weeks of demand data. We can think of each dataset as a possible scenario of the demands in the next 1 year. Then we evaluate the DQN policy on each demand dataset and collect the total reward for each dataset.

demand_test = []
for k in range(100,200):
np.random.seed(k)
demand_future = []
for i in range(52):
for j in range(4):
random_demand = np.random.normal(3, 1.5)
if random_demand < 0:
random_demand = 0
random_demand = np.round(random_demand)
demand_future.append(random_demand)
random_demand = np.random.normal(6, 1)
if random_demand < 0:
random_demand = 0
random_demand = np.round(random_demand)
demand_future.append(random_demand)
for j in range(2):
random_demand = np.random.normal(12, 2)
if random_demand < 0:
random_demand = 0
random_demand = np.round(random_demand)
demand_future.append(random_demand)
demand_test.append(demand_future)

model = QNetwork(state_size=7,action_size=21,seed=0)
model.load_state_dict(torch.load(desired_path))
model.eval()profit_RL = []
actions_list = []
invs_list = []
for demand in demand_test:
env = InvOptEnv(demand)
env.reset()
profit = 0
actions = []
invs = []
done = False
state = env.state
while not done:
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
with torch.no_grad():
action_values = model(state)
action = np.argmax(action_values.cpu().data.numpy())
actions.append(action)
next_state, reward, done = env.step(action)
state = next_state
invs.append(env.inv_level)
profit += reward
actions_list.append(actions)
invs_list.append(invs)
profit_RL.append(profit)
RL_mean = np.mean(profit_RL)

Code for testing the (s,S) policy

We evaluate the (s,S) policy on the same test set.

profit_sS = []
for demand in demand_test:
profit_sS.append(profit_calculation_sS(15,32,demand))
sS_mean = np.mean(profit_sS)

Discussion on the numerical results

The average cost of the DQN policy across the 100 demand datasets is $20314.53, and the average cost of the (s,S) policy is $17202.08, which shows a 18.09% revenue increase. The boxplot of the profits obtained by the DQN and (s,S) policies across the 100 demand datasets is shown below.

Boxplot of the profits obtained by DQN policy and (s,S) policy in the test set (Image by author)

To further understand the difference between the DQN and (s,S) policies, we pick one demand dataset in the test set and take a closer look at the actions taken by the DQN policy and (s,S) policy respectively for the first 2 weeks. See the table below.

Comparison between the actions taken by DQN and (s,S) polcies (Image by author)

We see that the DQN policy is more responsive to customer demand, and tends to place more orders to reduce potential lost of sales. The DQN policy does incur more ordering cost, however, the increase in ordering cost is much lower as compared to the increase of sales.

In this article, I present a DQN model to optimize the inventory operations of a retail store. The DQN policy outperforms the classic (s,S) policy as it provides more flexibility in making ordering decisions and thus is more responsive to customer demand.

One more observation on the comparison between the DQN policy and (s,S) policy. The DQN model only tends to be advantageous over (s,S) policy when the demand structure is complex enough, so that we can utilize some other information to infer what the lead time demand distribution will be like in the state definition. For example, here we assume the customer demand follows different distributions on different days of week, so DQN can utilize the day of week information to infer what the demand will be like on the next two days in the lead time. This extra information helps us make more informed decisions as compared to (s,S) policy which bases its decisions only on inventory position. However, if there is no such useful extra information to be included in the state definition, DQN can barely beat the (s,S) policy. I tried training the DQN model with a slightly different state definition assuming the demand on every day follows the same negative binomial distribution. The DQN policy actually underperforms (s,S) policy.

Here are a few topics that my future articles may explore. First, I adopted DQN to solve this problem. It would be interesting to see if other RL frameworks such as those in the policy optimization class can achieve better performance as they can output stochastic policies. Second, in this article, I focused on a very simple supply chain model, which only contains one single retailer. It would also be interesting to see how RL techniques can be leveraged to optimize more complex supply chain models, such as multi-echelon networks. It might be the case that RL techniques will show more significant advantages for complex supply chain networks.

Thanks for reading!