Skip to content

Make __del__ call shutdown and close in DataCollectors and Envs #209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 24, 2022

Conversation

OdedKrams
Copy link
Contributor

Rearrange the code and remove unneeded error handling so shutdown and close functions will be called from the del function.
From the unit tests it seems that:

  • Envs don't need to be close nor deleted before getting out of scope. The del and the GC are doing the job well.
  • In DataCollectors it seems there is some race condition in the GC so that the internal Env may be deleted before the DataCollector. So the DataCollector must be deleted before getting out of scope. Its del function delete and remove all objects in the proper way. Without calling del collector in the end to the unit test, the test hangs in the end.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 21, 2022
@OdedKrams OdedKrams marked this pull request as ready for review June 21, 2022 12:08
@vmoens vmoens changed the title T123527109: TorchRL: Make __del__ call shutdown and close in DataCollectors and Envs Make __del__ call shutdown and close in DataCollectors and Envs Jun 21, 2022
@vmoens
Copy link
Collaborator

vmoens commented Jun 21, 2022

In DataCollectors it seems there is some race condition in the GC so that the internal Env may be deleted before the DataCollector. So the DataCollector must be deleted before getting out of scope. Its del function delete and remove all objects in the proper way. Without calling del collector in the end to the unit test, the test hangs in the end.

I noticed that too, thanks for identifying the issue!
So what's your take on this?
It might be the reason why we decided to explicitly ask the user to close data collectors and envs, otherwise there may be a silent problem with their script (ie the script never terminates). This may have really bad consequences (e.g. a training script on a cluster that borrows nodes for ages).

I see basically 2 options: either we make sure that this phenomenon never happens or we revert and ask users to close the collector / env when it's not used, otherwise an error is thrown.

@vmoens vmoens added the enhancement New feature or request label Jun 23, 2022
Copy link
Collaborator

@vmoens vmoens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think about what would be the behaviour of the test without the new feature.

pin_memory=False,
)
for i, d in enumerate(ccollector):
if i == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just break when i == 2 or something

b2c = d
else:
break
with pytest.raises(AssertionError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tested elsewhere

@@ -107,7 +107,7 @@ def env_fn(seed):
break
with pytest.raises(AssertionError):
assert_allclose_td(b1, b2)
collector.shutdown()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep these, as we test it somewhere else?
Like this if for some reason the new feature breaks only that test will fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@@ -129,7 +129,6 @@ def test_env_seed(env_name, frame_skip, seed=0):
assert_allclose_td(td0a, td0c.select(*td0a.keys()))
with pytest.raises(AssertionError):
assert_allclose_td(td1a, td1c)
env.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep these though, for the same reason as above.
But if we can test that things work properly in a separate function that's even better!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

assert_allclose_td(b1c, b2c)

if should_shutdown:
ccollector.shutdown()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the function hangs forever, the test will not fail (it will not close)
Do you think that without the new feature we'd get a meaningful error message with this?

@@ -87,7 +87,7 @@ def _check_for_faulty_process(processes):
break
if terminate:
raise RuntimeError(
"At least on process failed. Check for more infos in the log."
"At least one process failed. Check for more infos in the log."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make __del__ call shutdown and close in DataCollectors and Envs
3 participants