-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all #23582
Comments
update: installed latest version of TensorBoardX, and the problem persists. |
Hi @Arcadianlee, can you try running your training using |
You and I may be having similar issues. After following @krfricke comment please also make sure Unfortunately the dependencies for tensorflow work out in a way such that we are stuck with a problematic release of If you find you're still having issues after that then I think you and I may be experiencing the same or related issues. In my case, however, I get some initial data in the TF event files and progress.csv, but after a seemingly arbitrary number of iterations all of the files under the trial directory are present but completely empty. I've attached some screenshots to demonstrate how data was being recorded in tensorboard and the TF event files/progress.csv up to a point but now all the files are mysteriously empty. I will try to get a minimum working example script attached to this thread soon - FWIW I always use This was not an issue I had experienced prior with Ray 2.6 or below. Please LMK if this seems similar to your issue, otherwise I will open a separate issue for the problems I'm having. |
Also here are segments of the tracebacks from tune - all for the same experiment. This trial errored initially for other reasons, but when 2023-10-19 23:38:55,946 ERROR tune_controller.py:1502 -- Trial task failed for trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000
Traceback (most recent call last):
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::VFDPPO.restore() (pid=2211586, ip=192.168.86.56, actor_id=f26abfbcb38b2bc4534378d401000000, repr=VFDPPO)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 976, in restore
self.load_checkpoint(checkpoint_dir)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/corl/experiments/rllib_experiment.py", line 528, in load_checkpoint
super(trainer_class, cls).load_checkpoint(checkpoint_path) # type: ignore
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2152, in load_checkpoint
self.__setstate__(checkpoint_data)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2595, in __setstate__
self.workers.local_worker().set_state(state["worker"])
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1454, in set_state
self.policy_map[pid].set_state(policy_state)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/torch_mixins.py", line 114, in set_state
super().set_state(state)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1091, in set_state
super().set_state(state)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/policy.py", line 1059, in set_state
policy_spec = PolicySpec.deserialize(state["policy_spec"])
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/policy.py", line 161, in deserialize
policy_class = get_policy_class(spec["policy_class"])
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/registry.py", line 451, in get_policy_class
module = importlib.import_module("ray.rllib.algorithms." + path)
TypeError: can only concatenate str (not "ABCMeta") to str
Trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000 errored after 39 iterations at 2023-10-19 23:38:55. Total running time: 26min 52s
Error file: /tmp/data/VFD/PPO-MT/LL/ray_results/LL-PPO-MT/LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000_0_2023-10-19_23-12-03/error.txt 2023-10-19 23:39:34,036 ERROR tune_controller.py:1502 -- Trial task failed for trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000
Traceback (most recent call last):
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::VFDPPO.restore() (pid=2213824, ip=192.168.86.56, actor_id=06d75ce2034489b567e20a3d01000000, repr=VFDPPO)
File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 954, in restore
if not _exists_at_fs_path(checkpoint.filesystem, checkpoint.path):
AttributeError: 'NoneType' object has no attribute 'filesystem'
Trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000 errored after 39 iterations at 2023-10-19 23:39:34. Total running time: 27min 30s
Error file: /tmp/data/VFD/PPO-MT/LL/ray_results/LL-PPO-MT/LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000_0_2023-10-19_23-12-03/error.txt Be aware that the |
I am beginning to suspect that tune.run() isn't compatible with tensorboard. |
Search before asking
Ray Component
RLlib
Issue Severity
High: It blocks me to complete my task.
What happened + What you expected to happen
Hi, my issue is that the result files (e.g. result.json, progress.csv) inside the ray_results folder are all empty, meaning no training data were saved to TensorBoard during the training process. This leads to TensorBoard not plotting anything at all. I'm using rllib's DQN trainer (the torch version) with TensorBoard (not TensorboardX) on a linux CentoOS machine. Any ideas why this happens?
code:
trainer = dqn.DQNTrainer(env = FdtdEnv, config = config)
num_episodes = 500
reward_threshold=1000.0
for i_episode in range(num_episodes):
Versions / Dependencies
TensorBoard 2.6.0
Python 3.9.7
Ray 1.11.0
CentOS 7.9.2009
Reproduction script
trainer = dqn.DQNTrainer(env = FdtdEnv, config = config)
main training loop
num_episodes = 500
tempRew = -1000
lastScore = 0
maxScore = []
reward_threshold=1000.0
for i_episode in range(num_episodes):
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: