[Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all #23582

Arcadianlee · 2022-03-30T06:56:18Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

RLlib

Issue Severity

High: It blocks me to complete my task.

What happened + What you expected to happen

Hi, my issue is that the result files (e.g. result.json, progress.csv) inside the ray_results folder are all empty, meaning no training data were saved to TensorBoard during the training process. This leads to TensorBoard not plotting anything at all. I'm using rllib's DQN trainer (the torch version) with TensorBoard (not TensorboardX) on a linux CentoOS machine. Any ideas why this happens?

code:
trainer = dqn.DQNTrainer(env = FdtdEnv, config = config)
num_episodes = 500
reward_threshold=1000.0
for i_episode in range(num_episodes):

print('\nStarting episode No.{}'.format(i_episode+1))
results = trainer.train()

#if i_episodes % 25 == 0:
   #checkpoint = trainer.save()

print(pretty_print(results))

if results["episode_reward_mean"] >= reward_threshold:
   print('\nSolved! Episode: {}, Steps: {}, Current_state: {}, Current_score: {}\n'.format(
            i_episode, results["agent_timesteps_total"],  next_state, results["episode_reward_mean"] ))
   break

Versions / Dependencies

TensorBoard 2.6.0
Python 3.9.7
Ray 1.11.0
CentOS 7.9.2009

Reproduction script

trainer = dqn.DQNTrainer(env = FdtdEnv, config = config)

main training loop

num_episodes = 500
tempRew = -1000
lastScore = 0
maxScore = []
reward_threshold=1000.0

for i_episode in range(num_episodes):

print('\nStarting episode No.{}'.format(i_episode+1))
results = trainer.train()

#if i_episodes % 25 == 0:
   #checkpoint = trainer.save()

print(pretty_print(results))

if results["episode_reward_mean"] >= reward_threshold:
   print('\nSolved! Episode: {}, Steps: {}, Current_state: {}, Current_score: {}\n'.format(
            i_episode, results["agent_timesteps_total"],  next_state, results["episode_reward_mean"] ))
   break

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

Arcadianlee · 2022-03-30T10:40:16Z

update: installed latest version of TensorBoardX, and the problem persists.

krfricke · 2022-04-04T18:37:28Z

Hi @Arcadianlee, can you try running your training using tune.run()? See e.g. https://docs.ray.io/en/latest/rllib/rllib-training.html#basic-python-api

gresavage · 2023-10-20T03:50:15Z

You and I may be having similar issues.

After following @krfricke comment please also make sure builder.py is present in the site-packages for protobuf. There is an issue with protobuf<3.20 where builder.py is missing which ultimately causes the TF event files to be empty. See here for more details.

Unfortunately the dependencies for tensorflow work out in a way such that we are stuck with a problematic release of protobuf

If you find you're still having issues after that then I think you and I may be experiencing the same or related issues. In my case, however, I get some initial data in the TF event files and progress.csv, but after a seemingly arbitrary number of iterations all of the files under the trial directory are present but completely empty. I've attached some screenshots to demonstrate how data was being recorded in tensorboard and the TF event files/progress.csv up to a point but now all the files are mysteriously empty.

I will try to get a minimum working example script attached to this thread soon - FWIW I always use Tune.run() for my experiments and have recently been doing a lot of testing with the new RL Module and Learner APIs... I cannot remember at this moment whether the issue occurs under the old ModelV2 API

This was not an issue I had experienced prior with Ray 2.6 or below. Please LMK if this seems similar to your issue, otherwise I will open a separate issue for the problems I'm having.

gresavage · 2023-10-20T03:53:20Z

Also here are segments of the tracebacks from tune - all for the same experiment. This trial errored initially for other reasons, but when Tune tried to resume/continue it resulted in these different errors. I think it's important to note that even if one of my Tune trial doesn't error the aforementioned files are empty, and that these errors are referring to the existence of checkpoint files and some strange behavior with importlib so they may be a useful breadcrumb:

2023-10-19 23:38:55,946 ERROR tune_controller.py:1502 -- Trial task failed for trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000
Traceback (most recent call last):
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::VFDPPO.restore() (pid=2211586, ip=192.168.86.56, actor_id=f26abfbcb38b2bc4534378d401000000, repr=VFDPPO)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 976, in restore
    self.load_checkpoint(checkpoint_dir)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/corl/experiments/rllib_experiment.py", line 528, in load_checkpoint
    super(trainer_class, cls).load_checkpoint(checkpoint_path)  # type: ignore
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2152, in load_checkpoint
    self.__setstate__(checkpoint_data)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2595, in __setstate__
    self.workers.local_worker().set_state(state["worker"])
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1454, in set_state
    self.policy_map[pid].set_state(policy_state)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/torch_mixins.py", line 114, in set_state
    super().set_state(state)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1091, in set_state
    super().set_state(state)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/policy.py", line 1059, in set_state
    policy_spec = PolicySpec.deserialize(state["policy_spec"])
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/policy.py", line 161, in deserialize
    policy_class = get_policy_class(spec["policy_class"])
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/registry.py", line 451, in get_policy_class
    module = importlib.import_module("ray.rllib.algorithms." + path)
TypeError: can only concatenate str (not "ABCMeta") to str

Trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000 errored after 39 iterations at 2023-10-19 23:38:55. Total running time: 26min 52s
Error file: /tmp/data/VFD/PPO-MT/LL/ray_results/LL-PPO-MT/LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000_0_2023-10-19_23-12-03/error.txt

2023-10-19 23:39:34,036 ERROR tune_controller.py:1502 -- Trial task failed for trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000
Traceback (most recent call last):
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::VFDPPO.restore() (pid=2213824, ip=192.168.86.56, actor_id=06d75ce2034489b567e20a3d01000000, repr=VFDPPO)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 954, in restore
    if not _exists_at_fs_path(checkpoint.filesystem, checkpoint.path):
AttributeError: 'NoneType' object has no attribute 'filesystem'

Trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000 errored after 39 iterations at 2023-10-19 23:39:34. Total running time: 27min 30s
Error file: /tmp/data/VFD/PPO-MT/LL/ray_results/LL-PPO-MT/LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000_0_2023-10-19_23-12-03/error.txt

Be aware that the error.txt file mentioned is also empty.

Arcadianlee · 2023-10-25T02:11:22Z

I am beginning to suspect that tune.run() isn't compatible with tensorboard.

Arcadianlee added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 30, 2022

Arcadianlee changed the title ~~[Bug] Tensorboard w/ RLlib not plotting any data at all~~ [Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all Mar 30, 2022

krfricke added the rllib RLlib related issues label Apr 4, 2022

gjoliver added needs-repro-script Issue needs a runnable script to be reproduced P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all #23582

[Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all #23582

Arcadianlee commented Mar 30, 2022

Arcadianlee commented Mar 30, 2022

krfricke commented Apr 4, 2022

gresavage commented Oct 20, 2023

gresavage commented Oct 20, 2023 •

edited

Loading

Arcadianlee commented Oct 25, 2023

[Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all #23582

[Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all #23582

Comments

Arcadianlee commented Mar 30, 2022

Search before asking

Ray Component

Issue Severity

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

main training loop

Anything else

Are you willing to submit a PR?

Arcadianlee commented Mar 30, 2022

krfricke commented Apr 4, 2022

gresavage commented Oct 20, 2023

gresavage commented Oct 20, 2023 • edited Loading

Arcadianlee commented Oct 25, 2023

gresavage commented Oct 20, 2023 •

edited

Loading