diff --git a/README.MD b/README.MD old mode 100644 new mode 100755 index d6fb10e..d3362b2 --- a/README.MD +++ b/README.MD @@ -1,7 +1,7 @@ -[23]: https://github.com/dgriff777/a3c_continuous +*Update: Minor updates to code. Added distributed step size training functionality. Added integration to tensorboard so you can can log and create graphs of training, see graph of model, and visualize your weights and biases distributions as they update during training. -## NEWLY ADDED A3G A NEW GPU/CPU ARCHITECTURE OF A3C FOR SUBSTANTIALLY ACCELERATED TRAINING!! +# A3G A GPU/CPU ARCHITECTURE OF A3C FOR SUBSTANTIALLY ACCELERATED TRAINING # RL A3C Pytorch @@ -9,12 +9,12 @@ ![A3C LSTM playing Breakout-v0](https://github.com/dgriff777/rl_a3c_pytorch/blob/master/demo/Breakout.gif) ![A3C LSTM playing SpaceInvadersDeterministic-v3](https://github.com/dgriff777/rl_a3c_pytorch/blob/master/demo/SpaceInvaders.gif) ![A3C LSTM playing MsPacman-v0](https://github.com/dgriff777/rl_a3c_pytorch/blob/master/demo/MsPacman.gif) ![A3C LSTM\ playing BeamRider-v0](https://github.com/dgriff777/rl_a3c_pytorch/blob/master/demo/BeamRider.gif) ![A3C LSTM playing Seaquest-v0](https://github.com/dgriff777/rl_a3c_pytorch/blob/master/demo/Seaquest.gif) -# NEWLY ADDED A3G!! -New implementation of A3C that utilizes GPU for speed increase in training. Which we can call **A3G**. A3G as opposed to other versions that try to utilize GPU with A3C algorithm, with A3G each agent has its own network maintained on GPU but shared model is on CPU and agent models are quickly converted to CPU to update shared model which allows updates to be frequent and fast by utilizing Hogwild Training and make updates to shared model asynchronously and without locks. This new method greatly increase training speed and models that use to take days to train can be trained in as fast as 10minutes for some Atari games! 10-15minutes for Breakout to start to score over 400! And 10mins to solve Pong! +# A3G +Implementation of A3C that utilizes GPU for speed increase in training. Which we can call **A3G**. A3G as opposed to other versions that try to utilize GPU with A3C algorithm, with A3G each agent has its own network maintained on GPU but shared model is on CPU and agent models are quickly converted to CPU to update shared model which allows updates to be frequent and fast by utilizing Hogwild Training and make updates to shared model asynchronously and without locks. This new method greatly increase training speed and models that use to take days to train can be trained in as fast as 10minutes for some Atari games! 10-15minutes for Breakout to start to score over 400! And 10mins to solve Pong! This repository includes my implementation with reinforcement learning using Asynchronous Advantage Actor-Critic (A3C) in Pytorch an algorithm from Google Deep Mind's paper "Asynchronous Methods for Deep Reinforcement Learning." -*See [a3c_continuous][23] a newly added repo of my A3C LSTM implementation for continuous action spaces which was able to solve BipedWalkerHardcore-v2 environment (average 300+ for 100 consecutive episodes)* +*See [a3c_continuous][23] a newly added repo of my A3C LSTM implementation for continuous action spaces which was able to solve BipedWalkerHardcore-v3 environment (average 300+ for 100 consecutive episodes)* ### A3C LSTM @@ -61,23 +61,23 @@ link to the Gym environment evaluations below - Python 2.7+ - Openai Gym and Universe -- Pytorch +- Pytorch (Pytorch 2.0 has a bug where it incorrectly occupies GPU memory on all GPUs being used when backward() is called on training processes. This does not slow down training but it does unnecesarily take up a lot of gpu memory. If this is problem for you and running out of gpu memory downgrade pytorch) ## Training *When training model it is important to limit number of worker processes to number of cpu cores available as too many processes (e.g. more than one process per cpu core available) will actually be detrimental in training speed and effectiveness* -To train agent in Pong-v0 environment with 32 different worker processes: +To train agent in PongNoFrameskip-v4 environment with 32 different worker processes: ``` -python main.py --env Pong-v0 --workers 32 +python main.py --env PongNoFrameskip-v4 --workers 32 ``` -#A3C-GPU -*training using machine with 4 V100 GPUs and 20core CPU for PongDeterministic-v4 took 10 minutes to converge* +#A3G-Training +*training using machine with 4 V100 GPUs and 20core CPU for PongNoFrameskip-v4 took 10 minutes to converge* -To train agent in PongDeterministic-v4 environment with 32 different worker processes on 4 GPUs with new A3G: +To train agent in PongNoFrameskip-v4 environment with 32 different worker processes on 4 GPUs with new A3G: ``` -python main.py --env PongDeterministic-v4 --workers 32 --gpu-ids 0 1 2 3 --amsgrad True +python main.py --env PongNoFrameskip-v4 --workers 32 --gpu-ids 0 1 2 3 --amsgrad ``` @@ -88,8 +88,18 @@ Hit Ctrl C to end training session properly ## Evaluation To run a 100 episode gym evaluation with trained model ``` -python gym_eval.py --env Pong-v0 --num-episodes 100 +python gym_eval.py --env PongNoFrameskip-v4 --num-episodes 100 --new-gym-eval ``` + +## Distributed Step Size training +Example of use to train an agent using different step sizes across training processes from provided list of step sizes +``` +python main.py --env PongNoFrameskip-v4 --workers 18 --gpu-ids 0 1 2 --amsgrad --distributed-step-size 16 32 64 --tau 0.92 +``` +Below a graph showing of running the distributed step size training command above +![PongNoFrameskip DSS Training](https://github.com/dgriff777/rl_a3c_pytorch/blob/master/demo/Pong_dss_training.png) + + *Notice BeamRiderNoFrameskip-v4 reaches scores over 50,000 in less than 2hrs of training compared to the gym v0 version this shows the difficulty of those versions but also the timelimit being a major factor in score level* *These training charts were done on a DGX Station using 4GPUs and 20core Cpu. I used 36 worker agents and a tau of 0.92 which is the lambda in Generalized Advantage Estimation equation to introduce more variance due to the more deterministic nature of using just a 4 frame skip environment and a 0-30 NoOp start* diff --git a/demo/Pong_dss_training.png b/demo/Pong_dss_training.png new file mode 100755 index 0000000..99a7d49 Binary files /dev/null and b/demo/Pong_dss_training.png differ diff --git a/environment.py b/environment.py index e54ab51..a06589d 100755 --- a/environment.py +++ b/environment.py @@ -4,7 +4,7 @@ from collections import deque from gym.spaces.box import Box #from skimage.color import rgb2gray -from cv2 import resize +from cv2 import resize, INTER_AREA #from skimage.transform import resize #from scipy.misc import imresize as resize import random @@ -30,12 +30,13 @@ def atari_env(env_id, env_conf, args): def process_frame(frame, conf): frame = frame[conf["crop1"]:conf["crop2"] + 160, :160] - frame = frame.mean(2) - frame = frame.astype(np.float32) - frame *= (1.0 / 255.0) - frame = resize(frame, (80, conf["dimension2"])) - frame = resize(frame, (80, 80)) - frame = np.reshape(frame, [1, 80, 80]) +# frame = frame.mean(2) +# frame = frame.astype(np.float32) +# frame *= (1.0 / 255.0) +# frame = resize(frame, (80, conf["dimension2"]), interpolation=INTER_AREA) + frame = resize(frame, (80, 80), interpolation=INTER_AREA) + frame = (0.2989 * frame[:,:,0] + 0.587 * frame[:,:,1] + 0.114 * frame[:,:,2]) + frame = np.reshape(frame, [1, 80, 80]).astype(np.float32) return frame @@ -64,8 +65,8 @@ def observation(self, observation): self.state_std = self.state_std * self.alpha + \ observation.std() * (1 - self.alpha) - unbiased_mean = self.state_mean / (1 - pow(self.alpha, self.num_steps)) - unbiased_std = self.state_std / (1 - pow(self.alpha, self.num_steps)) + unbiased_mean = self.state_mean / (1 - (self.alpha**self.num_steps)) + unbiased_std = self.state_std / (1 - (self.alpha**self.num_steps)) return (observation - unbiased_mean) / (unbiased_std + 1e-8) @@ -142,7 +143,7 @@ def step(self, action): # the environment advertises done. done = True self.lives = lives - return obs, reward, done, self.was_real_done + return obs, reward, done, info def reset(self, **kwargs): """Reset only when lives are exhausted. diff --git a/gym_eval.py b/gym_eval.py index 19d99a7..1591376 100755 --- a/gym_eval.py +++ b/gym_eval.py @@ -10,72 +10,79 @@ import gym import logging import time -#from gym.configuration import undo_logger_setup -#undo_logger_setup() -parser = argparse.ArgumentParser(description='A3C_EVAL') +gym.logger.set_level(40) + +parser = argparse.ArgumentParser(description="A3C_EVAL") parser.add_argument( - '--env', - default='Pong-v0', - metavar='ENV', - help='environment to train on (default: Pong-v0)') + "-ev", + "--env", + default="PongNoFrameskip-v4", + help="environment to train on (default: PongNoFrameskip-v4)", +) parser.add_argument( - '--env-config', - default='config.json', - metavar='EC', - help='environment to crop and resize info (default: config.json)') + "-evc", "--env-config", + default="config.json", + help="environment to crop and resize info (default: config.json)") parser.add_argument( - '--num-episodes', + "-ne", + "--num-episodes", type=int, default=100, - metavar='NE', - help='how many episodes in evaluation (default: 100)') -parser.add_argument( - '--load-model-dir', - default='trained_models/', - metavar='LMD', - help='folder to load trained models from') + help="how many episodes in evaluation (default: 100)", +) parser.add_argument( - '--log-dir', default='logs/', metavar='LG', help='folder to save logs') + "-lmd", + "--load-model-dir", + default="trained_models/", + help="folder to load trained models from", +) +parser.add_argument("-lgd", "--log-dir", default="logs/", help="folder to save logs") parser.add_argument( - '--render', - default=False, - metavar='R', - help='Watch game as it being played') + "-r", "--render", action="store_true", help="Watch game as it being played" +) parser.add_argument( - '--render-freq', + "-rf", + "--render-freq", type=int, default=1, - metavar='RF', - help='Frequency to watch rendered game play') + help="Frequency to watch rendered game play", +) parser.add_argument( - '--max-episode-length', + "-mel", + "--max-episode-length", type=int, default=10000, - metavar='M', - help='maximum length of an episode (default: 100000)') + help="maximum length of an episode (default: 100000)", +) +parser.add_argument( + "-nge", + "--new-gym-eval", + action="store_true", + help="Create a gym evaluation for upload", +) +parser.add_argument( + "-s", "--seed", type=int, default=1, help="random seed (default: 1)" +) parser.add_argument( - '--gpu-id', + "-gid", + "--gpu-id", type=int, default=-1, - help='GPU to use [-1 CPU only] (default: -1)') + help="GPU to use [-1 CPU only] (default: -1)", +) parser.add_argument( - '--skip-rate', + "-hs", + "--hidden-size", type=int, - default=4, - metavar='SR', - help='frame skip rate (default: 4)') + default=512, + help="LSTM Cell number of features in the hidden state h", +) parser.add_argument( - '--seed', + "-sk", "--skip-rate", type=int, - default=1, - metavar='S', - help='random seed (default: 1)') -parser.add_argument( - '--new-gym-eval', - default=False, - metavar='NGE', - help='Create a gym evaluation for upload') + default=4, + help="frame skip rate (default: 4)") args = parser.parse_args() setup_json = read_config(args.env_config) @@ -84,40 +91,38 @@ if i in args.env: env_conf = setup_json[i] +saved_state = torch.load( + f"{args.load_model_dir}{args.env}.dat", map_location=lambda storage, loc: storage +) + + +setup_logger(f"{args.env}_mon_log", rf"{args.log_dir}{args.env}_mon_log") +log = logging.getLogger(f"{args.env}_mon_log") + gpu_id = args.gpu_id torch.manual_seed(args.seed) if gpu_id >= 0: torch.cuda.manual_seed(args.seed) -saved_state = torch.load( - '{0}{1}.dat'.format(args.load_model_dir, args.env), - map_location=lambda storage, loc: storage) - -log = {} -setup_logger('{}_mon_log'.format(args.env), r'{0}{1}_mon_log'.format( - args.log_dir, args.env)) -log['{}_mon_log'.format(args.env)] = logging.getLogger('{}_mon_log'.format( - args.env)) d_args = vars(args) for k in d_args.keys(): - log['{}_mon_log'.format(args.env)].info('{0}: {1}'.format(k, d_args[k])) + log.info(f"{k}: {d_args[k]}") -env = atari_env("{}".format(args.env), env_conf, args) +env = atari_env(f"{args.env}", env_conf, args) num_tests = 0 start_time = time.time() reward_total_sum = 0 player = Agent(None, env, args, None) -player.model = A3Clstm(player.env.observation_space.shape[0], - player.env.action_space) +player.model = A3Clstm(player.env.observation_space.shape[0], player.env.action_space, args) player.gpu_id = gpu_id if gpu_id >= 0: with torch.cuda.device(gpu_id): player.model = player.model.cuda() if args.new_gym_eval: player.env = gym.wrappers.Monitor( - player.env, "{}_monitor".format(args.env), force=True) + player.env, f"{args.env}_monitor", force=True) if gpu_id >= 0: with torch.cuda.device(gpu_id): @@ -126,38 +131,39 @@ player.model.load_state_dict(saved_state) player.model.eval() -for i_episode in range(args.num_episodes): - player.state = player.env.reset() - player.state = torch.from_numpy(player.state).float() - if gpu_id >= 0: - with torch.cuda.device(gpu_id): - player.state = player.state.cuda() - player.eps_len += 2 - reward_sum = 0 - while True: - if args.render: - if i_episode % args.render_freq == 0: - player.env.render() - - player.action_test() - reward_sum += player.reward +try: + for i_episode in range(args.num_episodes): + player.state = player.env.reset() + if gpu_id >= 0: + with torch.cuda.device(gpu_id): + player.state = torch.from_numpy(player.state).float().cuda() + else: + player.state = torch.from_numpy(player.state).float() + player.eps_len = 0 + reward_sum = 0 + while 1: + if args.render: + if i_episode % args.render_freq == 0: + player.env.render() + player.action_test() + reward_sum += player.reward + if player.done and not player.env.was_real_done: + state = player.env.reset() + player.state = torch.from_numpy(state).float() + if gpu_id >= 0: + with torch.cuda.device(gpu_id): + player.state = player.state.cuda() + elif player.env.was_real_done: + num_tests += 1 + reward_total_sum += reward_sum + reward_mean = reward_total_sum / num_tests + log.info( + f"Time {time.strftime('%Hh %Mm %Ss', time.gmtime(time.time() - start_time))}, episode reward {reward_sum}, episode length {player.eps_len}, reward mean {reward_mean:.4f}" + ) + break +except KeyboardInterrupt: + print("KeyboardInterrupt exception is caught") +finally: + print("gym evalualtion process finished") - if player.done and not player.info: - state = player.env.reset() - player.eps_len += 2 - player.state = torch.from_numpy(state).float() - if gpu_id >= 0: - with torch.cuda.device(gpu_id): - player.state = player.state.cuda() - elif player.info: - num_tests += 1 - reward_total_sum += reward_sum - reward_mean = reward_total_sum / num_tests - log['{}_mon_log'.format(args.env)].info( - "Time {0}, episode reward {1}, episode length {2}, reward mean {3:.4f}". - format( - time.strftime("%Hh %Mm %Ss", - time.gmtime(time.time() - start_time)), - reward_sum, player.eps_len, reward_mean)) - player.eps_len = 0 - break +player.env.close() diff --git a/main.py b/main.py index f08022d..01fb3e7 100755 --- a/main.py +++ b/main.py @@ -13,106 +13,140 @@ #from gym.configuration import undo_logger_setup import time -#undo_logger_setup() -parser = argparse.ArgumentParser(description='A3C') +#undo_logger_setup()" +parser = argparse.ArgumentParser(description="A3C") parser.add_argument( - '--lr', + "-l", "--lr", type=float, default=0.0001, help="learning rate (default: 0.0001)" +) +parser.add_argument( + "-ec", + "--entropy-coef", type=float, - default=0.0001, - metavar='LR', - help='learning rate (default: 0.0001)') + default=0.01, + help="entropy loss coefficient (default: 0.01)", +) parser.add_argument( - '--gamma', + "-vc", + "--value-coef", type=float, - default=0.99, - metavar='G', - help='discount factor for rewards (default: 0.99)') + default=0.5, + help="value coefficient (default: 0.5)", +) parser.add_argument( - '--tau', + "-g", + "--gamma", type=float, - default=1.00, - metavar='T', - help='parameter for GAE (default: 1.00)') + default=0.99, + help="discount factor for rewards (default: 0.99)", +) parser.add_argument( - '--seed', - type=int, - default=1, - metavar='S', - help='random seed (default: 1)') + "-t", "--tau", type=float, default=1.00, help="parameter for GAE (default: 1.00)" +) +parser.add_argument( + "-s", "--seed", type=int, default=1, help="random seed (default: 1)" +) parser.add_argument( - '--workers', + "-w", + "--workers", type=int, default=32, - metavar='W', - help='how many training processes to use (default: 32)') + help="how many training processes to use (default: 32)", +) parser.add_argument( - '--num-steps', + "-ns", + "--num-steps", type=int, default=20, - metavar='NS', - help='number of forward steps in A3C (default: 20)') + help="number of forward steps in A3C (default: 20)", +) parser.add_argument( - '--max-episode-length', + "-mel", + "--max-episode-length", type=int, default=10000, - metavar='M', - help='maximum length of an episode (default: 10000)') + help="maximum length of an episode (default: 10000)", +) parser.add_argument( - '--env', - default='Pong-v0', - metavar='ENV', - help='environment to train on (default: Pong-v0)') + "-ev", + "--env", + default="PongNoFrameSkip-v4", + help="environment to train on (default: PongNoFrameSkip-v4)", +) parser.add_argument( - '--env-config', - default='config.json', - metavar='EC', - help='environment to crop and resize info (default: config.json)') -parser.add_argument( - '--shared-optimizer', + "-so", + "--shared-optimizer", default=True, - metavar='SO', - help='use an optimizer without shared statistics.') -parser.add_argument( - '--load', default=False, metavar='L', help='load a trained model') -parser.add_argument( - '--save-max', - default=True, - metavar='SM', - help='Save model on every test run high score matched or bested') -parser.add_argument( - '--optimizer', - default='Adam', - metavar='OPT', - help='shares optimizer choice of Adam or RMSprop') -parser.add_argument( - '--load-model-dir', - default='trained_models/', - metavar='LMD', - help='folder to load trained models from') -parser.add_argument( - '--save-model-dir', - default='trained_models/', - metavar='SMD', - help='folder to save trained models') -parser.add_argument( - '--log-dir', default='logs/', metavar='LG', help='folder to save logs') -parser.add_argument( - '--gpu-ids', + help="use an optimizer with shared statistics.", +) +parser.add_argument("-ld", "--load", action="store_true", help="load a trained model") +parser.add_argument( + "-sm", + "--save-max", + action="store_true", + help="Save model on every test run high score matched or bested", +) +parser.add_argument( + "-o", + "--optimizer", + default="Adam", + choices=["Adam", "RMSprop"], + help="optimizer choice of Adam or RMSprop", +) +parser.add_argument( + "-lmd", + "--load-model-dir", + default="trained_models/", + help="folder to load trained models from", +) +parser.add_argument( + "-smd", + "--save-model-dir", + default="trained_models/", + help="folder to save trained models", +) +parser.add_argument("-lg", "--log-dir", default="logs/", help="folder to save logs") +parser.add_argument( + "-gp", + "--gpu-ids", type=int, - default=-1, - nargs='+', - help='GPUs to use [-1 CPU only] (default: -1)') + default=[-1], + nargs="+", + help="GPUs to use [-1 CPU only] (default: -1)", +) parser.add_argument( - '--amsgrad', - default=True, - metavar='AM', - help='Adam optimizer amsgrad parameter') + "-a", "--amsgrad", action="store_true", help="Adam optimizer amsgrad parameter" +) parser.add_argument( - '--skip-rate', + "--skip-rate", type=int, default=4, metavar='SR', - help='frame skip rate (default: 4)') + help="frame skip rate (default: 4)") +parser.add_argument( + "-hs", + "--hidden-size", + type=int, + default=512, + help="LSTM Cell number of features in the hidden state h", +) +parser.add_argument( + "-tl", + "--tensorboard-logger", + action="store_true", + help="Creates tensorboard logger to monitor value and policy loss", +) +parser.add_argument( + "-evc", "--env-config", + default="config.json", + help="environment to crop and resize info (default: config.json)") +parser.add_argument( + "-dss", + "--distributed-step-size", + type=int, + default=[], + nargs="+", + help="use different step size among workers by using a list of step sizes to distributed among workers to use (default: [])", +) # Based on # https://github.com/pytorch/examples/tree/master/mnist_hogwild @@ -123,22 +157,21 @@ if __name__ == '__main__': args = parser.parse_args() torch.manual_seed(args.seed) - if args.gpu_ids == -1: - args.gpu_ids = [-1] - else: + if args.gpu_ids != [-1]: torch.cuda.manual_seed(args.seed) - mp.set_start_method('spawn') + mp.set_start_method("spawn") setup_json = read_config(args.env_config) env_conf = setup_json["Default"] for i in setup_json.keys(): if i in args.env: env_conf = setup_json[i] env = atari_env(args.env, env_conf, args) - shared_model = A3Clstm(env.observation_space.shape[0], env.action_space) + shared_model = A3Clstm(env.observation_space.shape[0], env.action_space, args) if args.load: saved_state = torch.load( - '{0}{1}.dat'.format(args.load_model_dir, args.env), - map_location=lambda storage, loc: storage) + f"{args.load_model_dir}{args.env}.dat", + map_location=lambda storage, loc: storage, + ) shared_model.load_state_dict(saved_state) shared_model.share_memory() @@ -157,13 +190,13 @@ p = mp.Process(target=test, args=(args, shared_model, env_conf)) p.start() processes.append(p) - time.sleep(0.1) + time.sleep(0.001) for rank in range(0, args.workers): p = mp.Process( target=train, args=(rank, args, shared_model, optimizer, env_conf)) p.start() processes.append(p) - time.sleep(0.1) + time.sleep(0.001) for p in processes: - time.sleep(0.1) + time.sleep(0.001) p.join() diff --git a/model.py b/model.py index e5a5362..412694e 100755 --- a/model.py +++ b/model.py @@ -6,8 +6,9 @@ class A3Clstm(torch.nn.Module): - def __init__(self, num_inputs, action_space): + def __init__(self, num_inputs, action_space, args): super(A3Clstm, self).__init__() + self.hidden_size = args.hidden_size self.conv1 = nn.Conv2d(num_inputs, 32, 5, stride=1, padding=2) self.maxp1 = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(32, 32, 5, stride=1, padding=1) @@ -17,7 +18,7 @@ def __init__(self, num_inputs, action_space): self.conv4 = nn.Conv2d(64, 64, 3, stride=1, padding=1) self.maxp4 = nn.MaxPool2d(2, 2) - self.lstm = nn.LSTMCell(1024, 512) + self.lstm = nn.LSTMCell(1024, self.hidden_size) num_outputs = action_space.n self.critic_linear = nn.Linear(512, 1) self.actor_linear = nn.Linear(512, num_outputs) @@ -35,13 +36,23 @@ def __init__(self, num_inputs, action_space): self.critic_linear.weight.data, 1.0) self.critic_linear.bias.data.fill_(0) - self.lstm.bias_ih.data.fill_(0) - self.lstm.bias_hh.data.fill_(0) + for name, p in self.named_parameters(): + if "lstm" in name: + if "weight_ih" in name: + nn.init.xavier_uniform_(p.data) + elif "weight_hh" in name: + nn.init.orthogonal_(p.data) + elif "bias_ih" in name: + p.data.fill_(0) + # Set forget-gate bias to 1 + n = p.size(0) + p.data[(n // 4) : (n // 2)].fill_(1) + elif "bias_hh" in name: + p.data.fill_(0) self.train() - def forward(self, inputs): - inputs, (hx, cx) = inputs + def forward(self, inputs, hx, cx): x = F.relu(self.maxp1(self.conv1(inputs))) x = F.relu(self.maxp2(self.conv2(x))) x = F.relu(self.maxp3(self.conv3(x))) @@ -53,4 +64,4 @@ def forward(self, inputs): x = hx - return self.critic_linear(x), self.actor_linear(x), (hx, cx) + return self.critic_linear(x), self.actor_linear(x), hx, cx diff --git a/player_util.py b/player_util.py index 449ce2f..82351e8 100755 --- a/player_util.py +++ b/player_util.py @@ -1,4 +1,6 @@ from __future__ import division +import os +os.environ["OMP_NUM_THREADS"] = "1" import torch import torch.nn.functional as F from torch.autograd import Variable @@ -21,22 +23,26 @@ def __init__(self, model, env, args, state): self.info = None self.reward = 0 self.gpu_id = -1 + self.hidden_size = args.hidden_size def action_train(self): - value, logit, (self.hx, self.cx) = self.model((Variable( - self.state.unsqueeze(0)), (self.hx, self.cx))) + value, logit, self.hx, self.cx = self.model( + self.state.unsqueeze(0), self.hx, self.cx + ) prob = F.softmax(logit, dim=1) log_prob = F.log_softmax(logit, dim=1) entropy = -(log_prob * prob).sum(1) self.entropies.append(entropy) action = prob.multinomial(1).data - log_prob = log_prob.gather(1, Variable(action)) + log_prob = log_prob.gather(1, action) state, self.reward, self.done, self.info = self.env.step( - action.cpu().numpy()) - self.state = torch.from_numpy(state).float() + action.item()) if self.gpu_id >= 0: with torch.cuda.device(self.gpu_id): - self.state = self.state.cuda() + self.state = torch.from_numpy(state).float().cuda() + else: + self.state = torch.from_numpy(state).float() + self.eps_len += 1 self.reward = max(min(self.reward, 1), -1) self.values.append(value) self.log_probs.append(log_prob) @@ -48,25 +54,24 @@ def action_test(self): if self.done: if self.gpu_id >= 0: with torch.cuda.device(self.gpu_id): - self.cx = Variable( - torch.zeros(1, 512).cuda()) - self.hx = Variable( - torch.zeros(1, 512).cuda()) + self.cx = torch.zeros(1, self.hidden_size).cuda() + self.hx = torch.zeros(1, self.hidden_size).cuda() else: - self.cx = Variable(torch.zeros(1, 512)) - self.hx = Variable(torch.zeros(1, 512)) - else: - self.cx = Variable(self.cx.data) - self.hx = Variable(self.hx.data) - value, logit, (self.hx, self.cx) = self.model((Variable( - self.state.unsqueeze(0)), (self.hx, self.cx))) - prob = F.softmax(logit, dim=1) - action = prob.max(1)[1].data.cpu().numpy() - state, self.reward, self.done, self.info = self.env.step(action[0]) - self.state = torch.from_numpy(state).float() + self.cx = torch.zeros(1, self.hidden_size) + self.hx = torch.zeros(1, self.hidden_size) + + value, logit, self.hx, self.cx = self.model( + self.state.unsqueeze(0), self.hx, self.cx + ) + prob = F.softmax(logit, dim=1) + action = prob.cpu().numpy().argmax() + state, self.reward, self.done, self.info = self.env.step(action) if self.gpu_id >= 0: with torch.cuda.device(self.gpu_id): - self.state = self.state.cuda() + self.state = torch.from_numpy(state).float().cuda() + else: + self.state = torch.from_numpy(state).float() + self.eps_len += 1 return self diff --git a/shared_optim.py b/shared_optim.py index f69fa6c..4cb0ab2 100755 --- a/shared_optim.py +++ b/shared_optim.py @@ -3,46 +3,48 @@ import torch import torch.optim as optim from collections import defaultdict +from math import sqrt class SharedRMSprop(optim.Optimizer): - """Implements RMSprop algorithm with shared states. - """ - - def __init__(self, - params, - lr=7e-4, - alpha=0.99, - eps=0.1, - weight_decay=0, - momentum=0, - centered=False): + """Implements RMSprop algorithm with shared states.""" + + def __init__( + self, + params, + lr=7e-4, + alpha=0.99, + eps=0.1, + weight_decay=0, + momentum=0, + centered=False, + ): defaults = defaultdict( lr=lr, alpha=alpha, eps=eps, weight_decay=weight_decay, momentum=momentum, - centered=centered) + centered=centered, + ) super(SharedRMSprop, self).__init__(params, defaults) for group in self.param_groups: - for p in group['params']: + for p in group["params"]: state = self.state[p] - state['step'] = torch.zeros(1) - state['grad_avg'] = p.data.new().resize_as_(p.data).zero_() - state['square_avg'] = p.data.new().resize_as_(p.data).zero_() - state['momentum_buffer'] = p.data.new().resize_as_( - p.data).zero_() + state["step"] = torch.zeros(1) + state["grad_avg"] = p.data.new().resize_as_(p.data).zero_() + state["square_avg"] = p.data.new().resize_as_(p.data).zero_() + state["momentum_buffer"] = p.data.new().resize_as_(p.data).zero_() def share_memory(self): for group in self.param_groups: - for p in group['params']: + for p in group["params"]: state = self.state[p] - state['square_avg'].share_memory_() - state['step'].share_memory_() - state['grad_avg'].share_memory_() - state['momentum_buffer'].share_memory_() + state["square_avg"].share_memory_() + state["step"].share_memory_() + state["grad_avg"].share_memory_() + state["momentum_buffer"].share_memory_() def step(self, closure=None): """Performs a single optimization step. @@ -55,80 +57,80 @@ def step(self, closure=None): loss = closure() for group in self.param_groups: - for p in group['params']: + for p in group["params"]: if p.grad is None: continue grad = p.grad.data if grad.is_sparse: - raise RuntimeError( - 'RMSprop does not support sparse gradients') + raise RuntimeError("RMSprop does not support sparse gradients") state = self.state[p] - square_avg = state['square_avg'] - alpha = group['alpha'] + square_avg = state["square_avg"] + alpha = group["alpha"] - state['step'] += 1 + state["step"] += 1 - if group['weight_decay'] != 0: - grad = grad.add(group['weight_decay'], p.data) + if group["weight_decay"] != 0: + grad = grad.add(p, alpha=group["weight_decay"]) - square_avg.mul_(alpha).addcmul_(1 - alpha, grad, grad) + square_avg.mul_(alpha).addcmul_(grad, grad, value=1 - alpha) - if group['centered']: - grad_avg = state['grad_avg'] - grad_avg.mul_(alpha).add_(1 - alpha, grad) - avg = square_avg.addcmul(-1, grad_avg, - grad_avg).sqrt().add_( - group['eps']) + if group["centered"]: + grad_avg = state["grad_avg"] + grad_avg.mul_(alpha).add_(grad, alpha=1 - alpha) + avg = ( + square_avg.addcmul(grad_avg, grad_avg, value=-1) + .sqrt_() + .add_(group["eps"]) + ) else: - avg = square_avg.sqrt().add_(group['eps']) + avg = square_avg.sqrt().add_(group["eps"]) - if group['momentum'] > 0: - buf = state['momentum_buffer'] - buf.mul_(group['momentum']).addcdiv_(grad, avg) - p.data.add_(-group['lr'], buf) + if group["momentum"] > 0: + buf = state["momentum_buffer"] + buf.mul_(group["momentum"]).addcdiv_(grad, avg) + # Need to avoid version tracking for parameter. + p.data.add_(buf, alpha=-group["lr"]) else: - p.data.addcdiv_(-group['lr'], grad, avg) + # Need to avoid version tracking for parameter. + p.data.addcdiv_(grad, avg, value=-group["lr"]) return loss class SharedAdam(optim.Optimizer): - """Implements Adam algorithm with shared states. - """ - - def __init__(self, - params, - lr=1e-3, - betas=(0.9, 0.999), - eps=1e-3, - weight_decay=0, - amsgrad=False): + """Implements Adam algorithm with shared states.""" + + def __init__( + self, + params, + lr=1e-3, + betas=(0.9, 0.999), + eps=1e-3, + weight_decay=0, + amsgrad=False, + ): defaults = defaultdict( - lr=lr, - betas=betas, - eps=eps, - weight_decay=weight_decay, - amsgrad=amsgrad) + lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, amsgrad=amsgrad + ) super(SharedAdam, self).__init__(params, defaults) for group in self.param_groups: - for p in group['params']: + for p in group["params"]: state = self.state[p] - state['step'] = torch.zeros(1) - state['exp_avg'] = p.data.new().resize_as_(p.data).zero_() - state['exp_avg_sq'] = p.data.new().resize_as_(p.data).zero_() - state['max_exp_avg_sq'] = p.data.new().resize_as_( - p.data).zero_() + state["step"] = torch.zeros(1) + state["exp_avg"] = p.data.new().resize_as_(p.data).zero_() + state["exp_avg_sq"] = p.data.new().resize_as_(p.data).zero_() + state["max_exp_avg_sq"] = p.data.new().resize_as_(p.data).zero_() def share_memory(self): for group in self.param_groups: - for p in group['params']: + for p in group["params"]: state = self.state[p] - state['step'].share_memory_() - state['exp_avg'].share_memory_() - state['exp_avg_sq'].share_memory_() - state['max_exp_avg_sq'].share_memory_() + state["step"].share_memory_() + state["exp_avg"].share_memory_() + state["exp_avg_sq"].share_memory_() + state["max_exp_avg_sq"].share_memory_() def step(self, closure=None): """Performs a single optimization step. @@ -141,46 +143,51 @@ def step(self, closure=None): loss = closure() for group in self.param_groups: - for p in group['params']: + for p in group["params"]: if p.grad is None: continue grad = p.grad.data if grad.is_sparse: raise RuntimeError( - 'Adam does not support sparse gradients, please consider SparseAdam instead' + "Adam does not support sparse gradients, please consider SparseAdam instead" ) - amsgrad = group['amsgrad'] + amsgrad = group["amsgrad"] state = self.state[p] - exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] + exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"] if amsgrad: - max_exp_avg_sq = state['max_exp_avg_sq'] - beta1, beta2 = group['betas'] + max_exp_avg_sq = state["max_exp_avg_sq"] + beta1, beta2 = group["betas"] - state['step'] += 1 + state["step"] += 1 - if group['weight_decay'] != 0: - grad = grad.add(group['weight_decay'], p.data) + if group["weight_decay"] != 0: + grad = grad.add(group["weight_decay"], p.data) # Decay the first and second moment running average coefficient - exp_avg.mul_(beta1).add_(1 - beta1, grad) - exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) + exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) + exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) + step_t = state["step"].item() + bias_correction1 = 1 - beta1**step_t + bias_correction2 = 1 - beta2**step_t + + step_size = group["lr"] / bias_correction1 + + bias_correction2_sqrt = sqrt(bias_correction2) if amsgrad: - # Maintains the maximum of all 2nd moment running avg. till - # now - torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq) + # Maintains the maximum of all 2nd moment running avg. till now + torch.maximum(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq) # Use the max. for normalizing running avg. of gradient - denom = max_exp_avg_sq.sqrt().add_(group['eps']) + denom = (max_exp_avg_sq.sqrt() / bias_correction2_sqrt).add_( + group["eps"] + ) else: - denom = exp_avg_sq.sqrt().add_(group['eps']) - - bias_correction1 = 1 - beta1**state['step'].item() - bias_correction2 = 1 - beta2**state['step'].item() - step_size = group['lr'] * \ - math.sqrt(bias_correction2) / bias_correction1 + denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_( + group["eps"] + ) - p.data.addcdiv_(-step_size, exp_avg, denom) + p.data.addcdiv_(exp_avg, denom, value=-step_size) return loss diff --git a/test.py b/test.py index 1559bda..f8822d2 100755 --- a/test.py +++ b/test.py @@ -1,4 +1,6 @@ from __future__ import division +import os +os.environ["OMP_NUM_THREADS"] = "1" from setproctitle import setproctitle as ptitle import torch from environment import atari_env @@ -11,16 +13,13 @@ def test(args, shared_model, env_conf): - ptitle('Test Agent') + ptitle("Test Agent") gpu_id = args.gpu_ids[-1] - log = {} - setup_logger('{}_log'.format(args.env), r'{0}{1}_log'.format( - args.log_dir, args.env)) - log['{}_log'.format(args.env)] = logging.getLogger('{}_log'.format( - args.env)) + setup_logger(f"{args.env}_log", rf"{args.log_dir}{args.env}_log") + log = logging.getLogger(f"{args.env}_log") d_args = vars(args) for k in d_args.keys(): - log['{}_log'.format(args.env)].info('{0}: {1}'.format(k, d_args[k])) + log.info(f"{k}: {d_args[k]}") torch.manual_seed(args.seed) if gpu_id >= 0: @@ -32,68 +31,85 @@ def test(args, shared_model, env_conf): reward_total_sum = 0 player = Agent(None, env, args, None) player.gpu_id = gpu_id - player.model = A3Clstm(player.env.observation_space.shape[0], - player.env.action_space) + player.model = A3Clstm(player.env.observation_space.shape[0], player.env.action_space, args) + + if args.tensorboard_logger: + from torch.utils.tensorboard import SummaryWriter + dummy_input = (torch.zeros(1, player.env.observation_space.shape[0], 80, 80), torch.zeros(1, args.hidden_size), torch.zeros(1, args.hidden_size), ) + writer = SummaryWriter(f"runs/{args.env}_training") + writer.add_graph(player.model, dummy_input, False) + writer.close() player.state = player.env.reset() - player.eps_len += 2 - player.state = torch.from_numpy(player.state).float() if gpu_id >= 0: with torch.cuda.device(gpu_id): player.model = player.model.cuda() - player.state = player.state.cuda() + player.state = torch.from_numpy(player.state).float().cuda() + else: + player.state = torch.from_numpy(player.state).float() + flag = True max_score = 0 - while True: - if flag: - if gpu_id >= 0: - with torch.cuda.device(gpu_id): + try: + while 1: + if player.done: + if gpu_id >= 0: + with torch.cuda.device(gpu_id): + player.model.load_state_dict(shared_model.state_dict()) + else: player.model.load_state_dict(shared_model.state_dict()) - else: - player.model.load_state_dict(shared_model.state_dict()) - player.model.eval() - flag = False - player.action_test() - reward_sum += player.reward + player.action_test() + reward_sum += player.reward - if player.done and not player.info: - state = player.env.reset() - player.eps_len += 2 - player.state = torch.from_numpy(state).float() - if gpu_id >= 0: - with torch.cuda.device(gpu_id): - player.state = player.state.cuda() - elif player.info: - flag = True - num_tests += 1 - reward_total_sum += reward_sum - reward_mean = reward_total_sum / num_tests - log['{}_log'.format(args.env)].info( - "Time {0}, episode reward {1}, episode length {2}, reward mean {3:.4f}". - format( - time.strftime("%Hh %Mm %Ss", - time.gmtime(time.time() - start_time)), - reward_sum, player.eps_len, reward_mean)) - - if args.save_max and reward_sum >= max_score: - max_score = reward_sum + if player.done and not player.env.was_real_done: + state = player.env.reset() + player.state = torch.from_numpy(state).float() if gpu_id >= 0: with torch.cuda.device(gpu_id): + player.state = player.state.cuda() + elif player.done and player.env.was_real_done: + num_tests += 1 + reward_total_sum += reward_sum + reward_mean = reward_total_sum / num_tests + log.info( + f'Time {time.strftime("%Hh %Mm %Ss", time.gmtime(time.time() - start_time))}, episode reward {reward_sum}, episode length {player.eps_len}, reward mean {reward_mean:.4f}' + ) + if args.tensorboard_logger: + writer.add_scalar( + f"{args.env}_Episode_Rewards", reward_sum, num_tests + ) + for name, weight in player.model.named_parameters(): + writer.add_histogram(name, weight, num_tests) + if (args.save_max and reward_sum >= max_score) or not args.save_max: + if reward_sum >= max_score: + max_score = reward_sum + if gpu_id >= 0: + with torch.cuda.device(gpu_id): + state_to_save = player.model.state_dict() + torch.save( + state_to_save, f"{args.save_model_dir}{args.env}.dat" + ) + else: state_to_save = player.model.state_dict() - torch.save(state_to_save, '{0}{1}.dat'.format( - args.save_model_dir, args.env)) + torch.save( + state_to_save, f"{args.save_model_dir}{args.env}.dat" + ) + + reward_sum = 0 + player.eps_len = 0 + state = player.env.reset() + time.sleep(60) + if gpu_id >= 0: + with torch.cuda.device(gpu_id): + player.state = torch.from_numpy(state).float().cuda() else: - state_to_save = player.model.state_dict() - torch.save(state_to_save, '{0}{1}.dat'.format( - args.save_model_dir, args.env)) + player.state = torch.from_numpy(state).float() - reward_sum = 0 - player.eps_len = 0 - state = player.env.reset() - player.eps_len += 2 - time.sleep(10) - player.state = torch.from_numpy(state).float() - if gpu_id >= 0: - with torch.cuda.device(gpu_id): - player.state = player.state.cuda() + except KeyboardInterrupt: + time.sleep(0.01) + print("KeyboardInterrupt exception is caught") + finally: + print("test agent process finished") + if args.tensorboard_logger: + writer.close() diff --git a/train.py b/train.py index 916d95f..c9ea696 100755 --- a/train.py +++ b/train.py @@ -1,4 +1,6 @@ from __future__ import division +import os +os.environ["OMP_NUM_THREADS"] = "1" from setproctitle import setproctitle as ptitle import torch import torch.optim as optim @@ -7,14 +9,15 @@ from model import A3Clstm from player_util import Agent from torch.autograd import Variable - +import time def train(rank, args, shared_model, optimizer, env_conf): - ptitle('Training Agent: {}'.format(rank)) + ptitle(f"Train Agent: {rank}") gpu_id = args.gpu_ids[rank % len(args.gpu_ids)] torch.manual_seed(args.seed + rank) if gpu_id >= 0: torch.cuda.manual_seed(args.seed + rank) + hidden_size = args.hidden_size env = atari_env(args.env, env_conf, args) if optimizer is None: if args.optimizer == 'RMSprop': @@ -25,82 +28,95 @@ def train(rank, args, shared_model, optimizer, env_conf): env.seed(args.seed + rank) player = Agent(None, env, args, None) player.gpu_id = gpu_id - player.model = A3Clstm(player.env.observation_space.shape[0], - player.env.action_space) + player.model = A3Clstm(player.env.observation_space.shape[0], player.env.action_space, args) player.state = player.env.reset() - player.state = torch.from_numpy(player.state).float() if gpu_id >= 0: with torch.cuda.device(gpu_id): - player.state = player.state.cuda() + player.state = torch.from_numpy(player.state).float().cuda() player.model = player.model.cuda() + else: + player.state = torch.from_numpy(player.state).float() player.model.train() - player.eps_len += 2 - while True: - if gpu_id >= 0: - with torch.cuda.device(gpu_id): - player.model.load_state_dict(shared_model.state_dict()) - else: - player.model.load_state_dict(shared_model.state_dict()) - if player.done: + if len(args.distributed_step_size) > 0: + num_steps = args.distributed_step_size[rank%len(args.distributed_step_size)] + else: + num_steps = args.num_steps + try: + while 1: if gpu_id >= 0: with torch.cuda.device(gpu_id): - player.cx = Variable(torch.zeros(1, 512).cuda()) - player.hx = Variable(torch.zeros(1, 512).cuda()) + player.model.load_state_dict(shared_model.state_dict()) + else: + player.model.load_state_dict(shared_model.state_dict()) + if player.done: + if gpu_id >= 0: + with torch.cuda.device(gpu_id): + player.cx = torch.zeros(1, hidden_size).cuda() + player.hx = torch.zeros(1, hidden_size).cuda() + else: + player.cx = torch.zeros(1, hidden_size) + player.hx = torch.zeros(1, hidden_size) else: - player.cx = Variable(torch.zeros(1, 512)) - player.hx = Variable(torch.zeros(1, 512)) - else: - player.cx = Variable(player.cx.data) - player.hx = Variable(player.hx.data) + player.cx = player.cx.data + player.hx = player.hx.data + for step in range(num_steps): + player.action_train() + + if player.done: + break - for step in range(args.num_steps): - player.action_train() if player.done: - break + player.eps_len = 0 + state = player.env.reset() + if gpu_id >= 0: + with torch.cuda.device(gpu_id): + player.state = torch.from_numpy(state).float().cuda() + else: + player.state = torch.from_numpy(state).float() - if player.done: - state = player.env.reset() - player.state = torch.from_numpy(state).float() if gpu_id >= 0: with torch.cuda.device(gpu_id): - player.state = player.state.cuda() - - R = torch.zeros(1, 1) - if not player.done: - value, _, _ = player.model((Variable(player.state.unsqueeze(0)), - (player.hx, player.cx))) - R = value.data - - if gpu_id >= 0: - with torch.cuda.device(gpu_id): - R = R.cuda() - - player.values.append(Variable(R)) - policy_loss = 0 - value_loss = 0 - gae = torch.zeros(1, 1) - if gpu_id >= 0: - with torch.cuda.device(gpu_id): - gae = gae.cuda() - R = Variable(R) - for i in reversed(range(len(player.rewards))): - R = args.gamma * R + player.rewards[i] - advantage = R - player.values[i] - value_loss = value_loss + 0.5 * advantage.pow(2) - - # Generalized Advantage Estimataion - delta_t = player.rewards[i] + args.gamma * \ - player.values[i + 1].data - player.values[i].data + R = torch.zeros(1, 1).cuda() + gae = torch.zeros(1, 1).cuda() + else: + R = torch.zeros(1, 1) + gae = torch.zeros(1, 1) + if not player.done: + state = player.state + value, _, _, _ = player.model( + state.unsqueeze(0), player.hx, player.cx + ) + R = value.detach() + player.values.append(R) + policy_loss = 0 + value_loss = 0 + for i in reversed(range(len(player.rewards))): + R = args.gamma * R + player.rewards[i] + advantage = R - player.values[i] + value_loss = value_loss + 0.5 * advantage.pow(2) - gae = gae * args.gamma * args.tau + delta_t + # Generalized Advantage Estimataion + delta_t = ( + player.rewards[i] + + args.gamma * player.values[i + 1].data + - player.values[i].data + ) - policy_loss = policy_loss - \ - player.log_probs[i] * \ - Variable(gae) - 0.01 * player.entropies[i] + gae = gae * args.gamma * args.tau + delta_t + policy_loss = ( + policy_loss + - (player.log_probs[i] * gae) + - (args.entropy_coef * player.entropies[i]) + ) - player.model.zero_grad() - (policy_loss + 0.5 * value_loss).backward() - ensure_shared_grads(player.model, shared_model, gpu=gpu_id >= 0) - optimizer.step() - player.clear_actions() + player.model.zero_grad() + (policy_loss + 0.5 * value_loss).backward() + ensure_shared_grads(player.model, shared_model, gpu=gpu_id >= 0) + optimizer.step() + player.clear_actions() + except KeyboardInterrupt: + time.sleep(0.01) + print("KeyboardInterrupt exception is caught") + finally: + print(f"train agent {rank} process finished")