The code of the paper A Deep Q-Network for the Beer Game: Deep Reinforcement Learning for Inventory Optimization
is presented at this repository. The paper is available online in https://pubsonline.informs.org/doi/abs/10.1287/msom.2020.0939. The code works with Python2.7
and Python3.4-Python3.7
. For more information see the list of the requirments (You can install them pip install -r requirements.txt
).
The main.py
is the file to call to start the training. BGAgent.py
provides the beer-game agent which involves all the properties and functionality of an agent. clBeergame.py
instanciates the agents and runs the beer-game simulation. Also, once the number of observations in the replay buffer filled by the minimum requirement, it calls the train-step of the SRDQN algorithm. The DNN approximator and SRDQN algorithm are implemented in SRDQN.py
. config.py
introduce all arguments and their default values, as well as some functions to properly build the simulation scenarios for different instances of the game. In the following the procedure to run the training and setting different values for the arguments is described.
###Play beer-game and compare your result with AI! You can play the beer-game and compare your result on the same game with the result that our RL algorithm achieves. See https://beergame.opexanalytics.com/
Note that this code does not work with TensorFlow 2+.
Each agent can use either of the srdqn
, bs
, Ster
, or Rnd
algorithms to decide about the action (order quantity). So, there are 256 combination of agent-types from which we consider 23 cases in this study. To determine each of these cases, we have used config.gameConfig
to select one of pre-defined type of four agents in the game. For example, config.gameConfig=3
, sets config.agentTypes = ["srdqn", "bs","bs","bs"]
, in which the retailer follows the srdqn
algorithm and the rest of agents use the base-stock policy to decide for the order quantity. The main gameConfig
are as below:
Base-stock co-players
if config.gameConfig == 3:
config.agentTypes = ["srdqn", "bs","bs","bs"]
if config.gameConfig == 4:
config.agentTypes = ["bs", "srdqn","bs","bs"]
if config.gameConfig == 5:
config.agentTypes = ["bs", "bs","srdqn","bs"]
if config.gameConfig == 6:
config.agentTypes = ["bs", "bs","bs","srdqn"]
Sterman co-players
if config.gameConfig == 7:
config.agentTypes = ["srdqn", "Strm","Strm","Strm"]
if config.gameConfig == 8:
config.agentTypes = ["Strm", "srdqn","Strm","Strm"]
if config.gameConfig == 9:
config.agentTypes = ["Strm", "Strm","srdqn","Strm"]
if config.gameConfig == 10:
config.agentTypes = ["Strm", "Strm","Strm","srdqn"]
Random co-players
if config.gameConfig == 11:
config.agentTypes = ["srdqn", "rnd","rnd","rnd"]
if config.gameConfig == 12:
config.agentTypes = ["rnd", "srdqn","rnd","rnd"]
if config.gameConfig == 13:
config.agentTypes = ["rnd", "rnd","srdqn","rnd"]
if config.gameConfig == 14:
config.agentTypes = ["rnd", "rnd","rnd","srdqn"]
The full list of all gameConfig
is defined in setAgentType()
function in config.py
.
Since the d+x
rule is used to train the SRDQN
model, we use the upper and lower limit for x
. config.actionLow
and config.actionUp
are used to set these values.
In addition, for each agent one can determine the lead time for receving order as well as receving the shimpement via config.leadRecItem1
, config.leadRecItem2
, config.leadRecItem3
, config.leadRecItem4
and config.leadRecOrder1
, config.leadRecOrder2
, config.leadRecOrder3
, config.leadRecOrder4
for four agents. Similarly, the initial inventory level, initial arriving order, and initial arriving shipment can be set by config.ILInit1
, config.ILInit2
, config.ILInit3
, config.ILInit4
, config.AOInit1
, config.AOInit2
, config.AOInit3
, config.AOInit4
, config.ASInit1
, config.ASInit2
, config.ASInit3
, config.ASInit4
, respectively for the four agents.
config.maxEpisodesTrain
determines the number of episodes to train the srdqn
agent.
TO run the baseStock policy (bs
), you need to set the value of the base-stock level for each agent by config.f1
, config.f2
, config.f3
, config.f4
. We obtained those values by running the Clark-Scarf algorithm for each instance.
data.zip
includes all the required dataset to train the model on basic case, literature cases, basket dataset, and forecasting dataset. Unzipping this file creates data
directory, in which there is a python file (createDemand.py
) as well as the mentioned datasets. createDemand.py
can be used to create datasets of any size for the literature cases.
The basic model used the Uniform distribution U[0,2]
with action space of {-2, -1, 0, 1, 2}
. All the default values are set to run this experiment for the case that srdqn
plays the retailer and other agents follow base-stock policy. For any other case the training can be started by setting the corresponding arguments. For example, to train a srdqn
Warehouse with the initial inventory of 10 units which plays with Sterman co-players, the following line can be used to run the training for 50000 episodes:
python main.py --gameConfig=8 --maxEpisodesTrain=50000 config.ILInit2=10 --batchSize=128
To train each of the literature cases, first you need to set config.demandDistribution
, actionUp
, and actionLow
, as well as the other parameter for the agents as following:
For U[0,8]:
python main.py --demandDistribution=0 --demandUp=9 --actionUp=8 --actionLow=-8 --ch1=0.5 --ch2=0.5 --ch3=0.5 --ch4=0.5 --cp1=1.0 --cp2=1.0 --cp3=1.0 --cp4=1.0 --f1=19.0 --f2=20.0 --f3=20.0 --f4=14.0 --leadRecItem1=2 --leadRecItem2=2 --leadRecItem3=2 --leadRecItem4=2 --leadRecOrder1=2 --leadRecOrder2=2 --leadRecOrder3=2 --leadRecOrder4=1 --ILInit1=12 --ILInit2=12 --ILInit3=12 --ILInit4=12 --AOInit1=4 --AOInit2=4 --AOInit3=4 --AOInit4=4 --ASInit1=4 --ASInit2=4 --ASInit3=4 --ASInit4=4 --gameConfig=6
For N(10,2):
python main.py --demandDistribution=1 --demandMu=10 --demandSigma=2 --actionUp=5 --actionLow=-5 --ch1=1 --ch2=0.75 --ch3=0.5 --ch4=0.25 --cp1=10.0 --cp2=0 --cp3=0 --cp4=0 --f1=48.0 --f2=43.0 --f3=41.0 --f4=30.0 --leadRecItem1=2 --leadRecItem2=2 --leadRecItem3=2 --leadRecItem4=2 --leadRecOrder1=2 --leadRecOrder2=2 --leadRecOrder3=2 --leadRecOrder4=1 --ILInit1=10 --ILInit2=10 --ILInit3=10 --ILInit4=10 --AOInit1=10 --AOInit2=10 --AOInit3=10 --AOInit4=10 --ASInit1=10 --ASInit2=10 --ASInit3=10 --ASInit4=10 --gameConfig=6
For C(4,8):
python main.py --demandDistribution=2 --actionUp=8 --actionLow=-8 --ch1=0.5 --ch2=0.5 --ch3=0.5 --ch4=0.5 --cp1=1.0 --cp2=1.0 --cp3=1.0 --cp4=1.0 --demandUp=9 --f1=32.0 --f2=32.0 --f3=32.0 --f4=24.0 --leadRecItem1=2 --leadRecItem2=2 --leadRecItem3=2 --leadRecItem4=2 --leadRecOrder1=2 --leadRecOrder2=2 --leadRecOrder3=2 --leadRecOrder4=1 --ILInit1=12 --ILInit2=12 --ILInit3=12 --ILInit4=12 --AOInit1=4 --AOInit2=4 --AOInit3=4 --AOInit4=4 --ASInit1=4 --ASInit2=4 --ASInit3=4 --ASInit4=4 --gameConfig=6
For the basket dataset you need to set config.demandDistribution=3
, and then config.data_id
can be either 6, 13
, or 22
. For training with the scaled dataset, which is reported in the paper, config.scaled=True
is required too. See the following commands for three cases:
python main.py --demandDistribution=3 --data_id=6 --demandMu=3 --demandSigma=2 --demandUp=3 --actionUp=5 --actionLow=-5 --leadRecItem1=2 --leadRecItem2=2 --leadRecItem3=2 --leadRecItem4=2 --leadRecOrder1=2 --leadRecOrder2=2 --leadRecOrder3=2 --leadRecOrder4=1 --scaled=True --ch1=1.0 --ch2=0.75 --ch3=0.5 --ch4=0.25 --cp1=10.0 --cp2=0.0 --cp3=0.0 --cp4=0.0 --f1=19.0 --f2=12.0 --f3=12.0 --f4=8.0 --ILInit1=3 --ILInit2=3 --ILInit3=3 --ILInit4=3 --AOInit1=3 --AOInit2=3 --AOInit3=3 --AOInit4=3 --ASInit1=3 --ASInit2=3 --ASInit3=3 --ASInit4=3
python main.py --demandDistribution=3 --data_id=13 --demandMu=3 --demandSigma=2 --demandUp=3 --actionUp=5 --actionLow=-5 --leadRecItem1=2 --leadRecItem2=2 --leadRecItem3=2 --leadRecItem4=2 --leadRecOrder1=2 --leadRecOrder2=2 --leadRecOrder3=2 --leadRecOrder4=1 --scaled=True --ch1=1.0 --ch2=0.75 --ch3=0.5 --ch4=0.25 --cp1=10.0 --cp2=0.0 --cp3=0.0 --cp4=0.0 --f1=19.0 --f2=13.0 --f3=11.0 --f4=8.0 --ILInit1=3 --ILInit2=3 --ILInit3=3 --ILInit4=3 --AOInit1=3 --AOInit2=3 --AOInit3=3 --AOInit4=3 --ASInit1=3 --ASInit2=3 --ASInit3=3 --ASInit4=3
python main.py --demandDistribution=3 --data_id=22 --demandMu=2 --demandSigma=2 --demandUp=3 --actionUp=5 --actionLow=-5 --leadRecItem1=2 --leadRecItem2=2 --leadRecItem3=2 --leadRecItem4=2 --leadRecOrder1=2 --leadRecOrder2=2 --leadRecOrder3=2 --leadRecOrder4=1 --scaled=True --ch1=1.0 --ch2=0.75 --ch3=0.5 --ch4=0.25 --cp1=10.0 --cp2=0.0 --cp3=0.0 --cp4=0.0 --f1=14.0 --f2=9.0 --f3=9.0 --f4=5.0 --ILInit1=2 --ILInit2=2 --ILInit3=2 --ILInit4=2 --AOInit1=2 --AOInit2=2 --AOInit3=2 --AOInit4=2 --ASInit1=2 --ASInit2=2 --ASInit3=2 --ASInit4=2
For the forecasting dataset you need to set config.demandDistribution=4
, and then config.data_id
can be either 5, 34
, or 46
. For training with the scaled dataset, which is reported in the paper, config.scaled=True
is required too. See the following commands for three cases:
python main.py --demandDistribution=4 --data_id=5 --demandMu=4 --demandSigma=2 --demandUp=3 --actionUp=5 --actionLow=-5 --leadRecItem1=2 --leadRecItem2=2 --leadRecItem3=2 --leadRecItem4=2 --leadRecOrder1=2 --leadRecOrder2=2 --leadRecOrder3=2 --leadRecOrder4=1 --scaled=True --ch1=1.0 --ch2=0.75 --ch3=0.5 --ch4=0.25 --cp1=10.0 --cp2=0.0 --cp3=0.0 --cp4=0.0 --f1=21.0 --f2=16.0 --f3=16.0 --f4=11.0 --ILInit1=4 --ILInit2=4 --ILInit3=4 --ILInit4=4 --AOInit1=4 --AOInit2=4 --AOInit3=4 --AOInit4=4 --ASInit1=4 --ASInit2=4 --ASInit3=4 --ASInit4=4
python main.py --demandDistribution=4 --data_id=34 --demandMu=4 --demandSigma=2 --demandUp=3 --actionUp=5 --actionLow=-5 --leadRecItem1=2 --leadRecItem2=2 --leadRecItem3=2 --leadRecItem4=2 --leadRecOrder1=2 --leadRecOrder2=2 --leadRecOrder3=2 --leadRecOrder4=1 --scaled=True --ch1=1.0 --ch2=0.75 --ch3=0.5 --ch4=0.25 --cp1=10.0 --cp2=0.0 --cp3=0.0 --cp4=0.0 --f1=18.0 --f2=15.0 --f3=14.0 --f4=10.0 --ILInit1=4 --ILInit2=4 --ILInit3=4 --ILInit4=4 --AOInit1=4 --AOInit2=4 --AOInit3=4 --AOInit4=4 --ASInit1=4 --ASInit2=4 --ASInit3=4 --ASInit4=4
python main.py --demandDistribution=4 --data_id=46 --demandMu=4 --demandSigma=2 --demandUp=3 --actionUp=5 --actionLow=-5 --leadRecItem1=2 --leadRecItem2=2 --leadRecItem3=2 --leadRecItem4=2 --leadRecOrder1=2 --leadRecOrder2=2 --leadRecOrder3=2 --leadRecOrder4=1 --scaled=True --ch1=1.0 --ch2=0.75 --ch3=0.5 --ch4=0.25 --cp1=10.0 --cp2=0.0 --cp3=0.0 --cp4=0.0 --f1=21.0 --f2=16.0 --f3=18.0 --f4=12.0 --ILInit1=4 --ILInit2=4 --ILInit3=4 --ILInit4=4 --AOInit1=4 --AOInit2=4 --AOInit3=4 --AOInit4=4 --ASInit1=4 --ASInit2=4 --ASInit3=4 --ASInit4=4
We have provided the trained model of the basic model which are used in the transfer learning section. The saved models are available in pre_model\uniform\0-3\brainX
in which X
is in {3, 4, 5, 6}
. The value of X
follows the same pattern as of config.gameConfig
. To train a new with either of these trained models, you need to set config.tlBaseBrain
that determines which trained should be used as the base model. For example:
python main.py --gameConfig=3 --iftl=True --ifUsePreviousModel=True --tlBaseBrain=3 --baseDemandDistribution=0
Besides, if you trained a model with another demand distribution, e.g., N(10,2)
, you need to move the saved models into pre_model\normal\10-2\brainX
and then for a new training set config.baseDemandDistribution=1
. The config.baseDemandDistribution
follows the same pattern as of config.demandDistribution
.
If you set config.ifSaveFigure=True
, it saves the trajectories of inventory-level, reward, action, open-order, and order-upto-level for each agent in an episode. config.saveFigIntLow
and config.saveFigIntUp
determine the range of eprisode to save the figures.
Setting config.ifsaveHistInterval=True
, activate saving of trajectory of the received order, received shipment, inventory-level, reward, action, open-order, and order-upto-level for each agent in an episode. With this argument, you need to determine the interval between every two epsiode to save the history with config.saveHistInterval
.
If you used this code for your experiments or found it helpful, consider citing the following paper:
@article{oroojlooyjadid2017deep,
title={A Deep Q-Network for the Beer Game: Deep Reinforcement Learning for Inventory Optimization},
author={Oroojlooyjadid, Afshin and Nazari, MohammadReza and Snyder, Lawrence and Tak{\'a}{\v{c}}, Martin},
journal = {Manufacturing \& Service Operations Management},
volume = {0},
number = {0},
pages = {null},
year = {0},
doi = {10.1287/msom.2020.0939},
URL = {
https://doi.org/10.1287/msom.2020.0939
},
eprint = {
https://doi.org/10.1287/msom.2020.0939
}
year={2021}
}