Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the inference server exit gracefully in case of errors instead of hanging #33

Merged
merged 4 commits into from
Sep 9, 2022

Conversation

kmaziarz
Copy link
Collaborator

@kmaziarz kmaziarz commented Sep 7, 2022

Currently, the inference server waits for the results from child processes indefinitely, which leads to hanging if one of these processes dies. There are at least two ways to reliably trigger this:
(1) Pass in an invalid SMILES for encoding (tracked in #15)
(2) Initialize the inference server after having initialized tensorflow (which poisons the context of the forked child processes)

We could handle (1) better, and this is further described in #15, but I didn't find a way to detect (2) other than let the child process die and then try to recover from that.

In this PR, I make the experience a bit smoother by making the parent process exit gracefully in case a child process dies.

Copy link
Contributor

@sarahnlewis sarahnlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be lifechanging for MoLeR users everywhere. Thank you!

@kmaziarz kmaziarz merged commit c749811 into main Sep 9, 2022
@kmaziarz kmaziarz deleted the kmaziarz/fix-hanging-on-error branch September 9, 2022 13:02
@cankobanz
Copy link

Hello,

I am currently using the latest version of molecule-generation (0.4.1) and am encountering an issue similar to one that was previously discussed and resolved. Here is the error message I'm receiving:

[08:48:42] Explicit valence for atom # 22 N, 4, is greater than permitted
[08:48:45] Explicit valence for atom # 22 N, 4, is greater than permitted
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/molecule_generation/utils/moler_inference_server.py", line 272, in try_collect_results
result_id, result = self._output_queue.get(timeout=10)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 108, in get
raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "notebooks/create_bindingdb_moler.py", line 37, in
setup(drug_featurizer=featurizer, _device=device)
File "notebooks/create_bindingdb_moler.py", line 31, in setup
drug_featurizer.preload(all_drugs)
File "/ConGen/src/featurizers/molecule.py", line 96, in preload
self.write_to_disk(seq_list, verbose=verbose)
File "/ConGen/src/featurizers/molecule.py", line 73, in write_to_disk
feats_list = self.transform_list(seq_list)
File "/ConGen/src/featurizers/molecule.py", line 53, in transform_list
feats_batch = model.encode(batch)
File "/usr/local/lib/python3.8/dist-packages/molecule_generation/wrapper.py", line 113, in encode
return self._inference_server.encode(
File "/usr/local/lib/python3.8/dist-packages/molecule_generation/utils/moler_inference_server.py", line 310, in encode
return self.try_collect_results(num_results)
File "/usr/local/lib/python3.8/dist-packages/molecule_generation/utils/moler_inference_server.py", line 281, in try_collect_results
raise RuntimeError("Worker process died")
RuntimeError: Worker process died

My current environment where the code is running includes:

Python 3.8.10
TensorFlow 2.9.3
CUDA 11.8
Single GPU: NVIDIA GeForce RTX 2060

Despite the solution provided here, I am still encountering an error. Any assistance for solving or understanding the problem I am encountering would be greatly appreciated. Thank you.

@kmaziarz
Copy link
Collaborator Author

@cankobanz The PR you posted in made it so that MoLeR no longer hangs on errors, and instead exists gracefully. Your logs indeed show it has done so, thus it's handling the error as expected - the only question is, why is it encountering the error 🙂

From your logs, I assume this happens during encoding (not decoding) of a potentially long sequence of SMILES. Is that correct? If so, do you see the error always or only at scale (which could suggest it's being triggered by outliers)? Are all of the SMILES you're encoding valid (i.e. can they be parsed by rdkit)?

@cankobanz
Copy link

@kmaziarz, thank you for the quick and detailed response. Now, I understand the fix that was applied here. I can easily identify and remove the invalid SMILES molecules from my dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants