Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all #23582

Open
2 tasks done
Arcadianlee opened this issue Mar 30, 2022 · 5 comments
Open
2 tasks done

[Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all #23582

Arcadianlee opened this issue Mar 30, 2022 · 5 comments
Labels
bug Something that is supposed to be working; but isn't needs-repro-script Issue needs a runnable script to be reproduced P2 Important issue, but not time-critical rllib RLlib related issues

Comments

@Arcadianlee
Copy link

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

RLlib

Issue Severity

High: It blocks me to complete my task.

What happened + What you expected to happen

Hi, my issue is that the result files (e.g. result.json, progress.csv) inside the ray_results folder are all empty, meaning no training data were saved to TensorBoard during the training process. This leads to TensorBoard not plotting anything at all. I'm using rllib's DQN trainer (the torch version) with TensorBoard (not TensorboardX) on a linux CentoOS machine. Any ideas why this happens?

code:
trainer = dqn.DQNTrainer(env = FdtdEnv, config = config)
num_episodes = 500
reward_threshold=1000.0
for i_episode in range(num_episodes):

print('\nStarting episode No.{}'.format(i_episode+1))
results = trainer.train()

#if i_episodes % 25 == 0:
   #checkpoint = trainer.save()

print(pretty_print(results))

if results["episode_reward_mean"] >= reward_threshold:
   print('\nSolved! Episode: {}, Steps: {}, Current_state: {}, Current_score: {}\n'.format(
            i_episode, results["agent_timesteps_total"],  next_state, results["episode_reward_mean"] ))
   break

3681648564682_ pic

3751648565118_ pic

Versions / Dependencies

TensorBoard 2.6.0
Python 3.9.7
Ray 1.11.0
CentOS 7.9.2009

Reproduction script

trainer = dqn.DQNTrainer(env = FdtdEnv, config = config)

main training loop

num_episodes = 500
tempRew = -1000
lastScore = 0
maxScore = []
reward_threshold=1000.0

for i_episode in range(num_episodes):

print('\nStarting episode No.{}'.format(i_episode+1))
results = trainer.train()

#if i_episodes % 25 == 0:
   #checkpoint = trainer.save()

print(pretty_print(results))

if results["episode_reward_mean"] >= reward_threshold:
   print('\nSolved! Episode: {}, Steps: {}, Current_state: {}, Current_score: {}\n'.format(
            i_episode, results["agent_timesteps_total"],  next_state, results["episode_reward_mean"] ))
   break

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Arcadianlee Arcadianlee added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 30, 2022
@Arcadianlee Arcadianlee changed the title [Bug] Tensorboard w/ RLlib not plotting any data at all [Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all Mar 30, 2022
@Arcadianlee
Copy link
Author

update: installed latest version of TensorBoardX, and the problem persists.

@krfricke krfricke added the rllib RLlib related issues label Apr 4, 2022
@krfricke
Copy link
Contributor

krfricke commented Apr 4, 2022

Hi @Arcadianlee, can you try running your training using tune.run()? See e.g. https://docs.ray.io/en/latest/rllib/rllib-training.html#basic-python-api

@gjoliver gjoliver added needs-repro-script Issue needs a runnable script to be reproduced P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 8, 2022
@gresavage
Copy link

You and I may be having similar issues.

After following @krfricke comment please also make sure builder.py is present in the site-packages for protobuf. There is an issue with protobuf<3.20 where builder.py is missing which ultimately causes the TF event files to be empty. See here for more details.

Unfortunately the dependencies for tensorflow work out in a way such that we are stuck with a problematic release of protobuf

If you find you're still having issues after that then I think you and I may be experiencing the same or related issues. In my case, however, I get some initial data in the TF event files and progress.csv, but after a seemingly arbitrary number of iterations all of the files under the trial directory are present but completely empty. I've attached some screenshots to demonstrate how data was being recorded in tensorboard and the TF event files/progress.csv up to a point but now all the files are mysteriously empty.

I will try to get a minimum working example script attached to this thread soon - FWIW I always use Tune.run() for my experiments and have recently been doing a lot of testing with the new RL Module and Learner APIs... I cannot remember at this moment whether the issue occurs under the old ModelV2 API

This was not an issue I had experienced prior with Ray 2.6 or below. Please LMK if this seems similar to your issue, otherwise I will open a separate issue for the problems I'm having.

image

image

@gresavage
Copy link

gresavage commented Oct 20, 2023

Also here are segments of the tracebacks from tune - all for the same experiment. This trial errored initially for other reasons, but when Tune tried to resume/continue it resulted in these different errors. I think it's important to note that even if one of my Tune trial doesn't error the aforementioned files are empty, and that these errors are referring to the existence of checkpoint files and some strange behavior with importlib so they may be a useful breadcrumb:

2023-10-19 23:38:55,946 ERROR tune_controller.py:1502 -- Trial task failed for trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000
Traceback (most recent call last):
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::VFDPPO.restore() (pid=2211586, ip=192.168.86.56, actor_id=f26abfbcb38b2bc4534378d401000000, repr=VFDPPO)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 976, in restore
    self.load_checkpoint(checkpoint_dir)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/corl/experiments/rllib_experiment.py", line 528, in load_checkpoint
    super(trainer_class, cls).load_checkpoint(checkpoint_path)  # type: ignore
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2152, in load_checkpoint
    self.__setstate__(checkpoint_data)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2595, in __setstate__
    self.workers.local_worker().set_state(state["worker"])
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1454, in set_state
    self.policy_map[pid].set_state(policy_state)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/torch_mixins.py", line 114, in set_state
    super().set_state(state)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1091, in set_state
    super().set_state(state)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/policy.py", line 1059, in set_state
    policy_spec = PolicySpec.deserialize(state["policy_spec"])
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/policy.py", line 161, in deserialize
    policy_class = get_policy_class(spec["policy_class"])
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/registry.py", line 451, in get_policy_class
    module = importlib.import_module("ray.rllib.algorithms." + path)
TypeError: can only concatenate str (not "ABCMeta") to str

Trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000 errored after 39 iterations at 2023-10-19 23:38:55. Total running time: 26min 52s
Error file: /tmp/data/VFD/PPO-MT/LL/ray_results/LL-PPO-MT/LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000_0_2023-10-19_23-12-03/error.txt
2023-10-19 23:39:34,036 ERROR tune_controller.py:1502 -- Trial task failed for trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000
Traceback (most recent call last):
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::VFDPPO.restore() (pid=2213824, ip=192.168.86.56, actor_id=06d75ce2034489b567e20a3d01000000, repr=VFDPPO)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 954, in restore
    if not _exists_at_fs_path(checkpoint.filesystem, checkpoint.path):
AttributeError: 'NoneType' object has no attribute 'filesystem'

Trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000 errored after 39 iterations at 2023-10-19 23:39:34. Total running time: 27min 30s
Error file: /tmp/data/VFD/PPO-MT/LL/ray_results/LL-PPO-MT/LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000_0_2023-10-19_23-12-03/error.txt

Be aware that the error.txt file mentioned is also empty.

@Arcadianlee
Copy link
Author

I am beginning to suspect that tune.run() isn't compatible with tensorboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't needs-repro-script Issue needs a runnable script to be reproduced P2 Important issue, but not time-critical rllib RLlib related issues
Projects
None yet
Development

No branches or pull requests

4 participants