Make the inference server exit gracefully in case of errors instead of hanging #33

kmaziarz · 2022-09-07T11:04:38Z

Currently, the inference server waits for the results from child processes indefinitely, which leads to hanging if one of these processes dies. There are at least two ways to reliably trigger this:
(1) Pass in an invalid SMILES for encoding (tracked in #15)
(2) Initialize the inference server after having initialized tensorflow (which poisons the context of the forked child processes)

We could handle (1) better, and this is further described in #15, but I didn't find a way to detect (2) other than let the child process die and then try to recover from that.

In this PR, I make the experience a bit smoother by making the parent process exit gracefully in case a child process dies.

molecule_generation/utils/moler_inference_server.py

sarahnlewis

This will be lifechanging for MoLeR users everywhere. Thank you!

cankobanz · 2024-02-14T14:02:07Z

Hello,

I am currently using the latest version of molecule-generation (0.4.1) and am encountering an issue similar to one that was previously discussed and resolved. Here is the error message I'm receiving:

[08:48:42] Explicit valence for atom # 22 N, 4, is greater than permitted
[08:48:45] Explicit valence for atom # 22 N, 4, is greater than permitted
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/molecule_generation/utils/moler_inference_server.py", line 272, in try_collect_results
result_id, result = self._output_queue.get(timeout=10)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 108, in get
raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "notebooks/create_bindingdb_moler.py", line 37, in
setup(drug_featurizer=featurizer, _device=device)
File "notebooks/create_bindingdb_moler.py", line 31, in setup
drug_featurizer.preload(all_drugs)
File "/ConGen/src/featurizers/molecule.py", line 96, in preload
self.write_to_disk(seq_list, verbose=verbose)
File "/ConGen/src/featurizers/molecule.py", line 73, in write_to_disk
feats_list = self.transform_list(seq_list)
File "/ConGen/src/featurizers/molecule.py", line 53, in transform_list
feats_batch = model.encode(batch)
File "/usr/local/lib/python3.8/dist-packages/molecule_generation/wrapper.py", line 113, in encode
return self._inference_server.encode(
File "/usr/local/lib/python3.8/dist-packages/molecule_generation/utils/moler_inference_server.py", line 310, in encode
return self.try_collect_results(num_results)
File "/usr/local/lib/python3.8/dist-packages/molecule_generation/utils/moler_inference_server.py", line 281, in try_collect_results
raise RuntimeError("Worker process died")
RuntimeError: Worker process died

My current environment where the code is running includes:

Python 3.8.10
TensorFlow 2.9.3
CUDA 11.8
Single GPU: NVIDIA GeForce RTX 2060

Despite the solution provided here, I am still encountering an error. Any assistance for solving or understanding the problem I am encountering would be greatly appreciated. Thank you.

kmaziarz · 2024-02-14T19:14:49Z

@cankobanz The PR you posted in made it so that MoLeR no longer hangs on errors, and instead exists gracefully. Your logs indeed show it has done so, thus it's handling the error as expected - the only question is, why is it encountering the error 🙂

From your logs, I assume this happens during encoding (not decoding) of a potentially long sequence of SMILES. Is that correct? If so, do you see the error always or only at scale (which could suggest it's being triggered by outliers)? Are all of the SMILES you're encoding valid (i.e. can they be parsed by rdkit)?

cankobanz · 2024-02-15T09:47:40Z

@kmaziarz, thank you for the quick and detailed response. Now, I understand the fix that was applied here. I can easily identify and remove the invalid SMILES molecules from my dataset.

kmaziarz added 3 commits September 6, 2022 14:49

refactor(moler_inference_server): Sort imports

2cd7eac

feat(moler_inference_server): Exit gracefully if inference processes die

6522a70

doc(CHANGELOG): Add an entry for #33

f2125f4

kmaziarz requested review from josejimenezluna, sarahnlewis and mrwnmsr September 7, 2022 11:04

sarahnlewis reviewed Sep 7, 2022

View reviewed changes

molecule_generation/utils/moler_inference_server.py Show resolved Hide resolved

sarahnlewis reviewed Sep 7, 2022

View reviewed changes

molecule_generation/utils/moler_inference_server.py Show resolved Hide resolved

sarahnlewis approved these changes Sep 7, 2022

View reviewed changes

doc(CHANGELOG): Expand the entry for #33

4ec5921

kmaziarz merged commit c749811 into main Sep 9, 2022

kmaziarz deleted the kmaziarz/fix-hanging-on-error branch September 9, 2022 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the inference server exit gracefully in case of errors instead of hanging #33

Make the inference server exit gracefully in case of errors instead of hanging #33

kmaziarz commented Sep 7, 2022

sarahnlewis left a comment

cankobanz commented Feb 14, 2024

kmaziarz commented Feb 14, 2024

cankobanz commented Feb 15, 2024

Make the inference server exit gracefully in case of errors instead of hanging #33

Make the inference server exit gracefully in case of errors instead of hanging #33

Conversation

kmaziarz commented Sep 7, 2022

sarahnlewis left a comment

Choose a reason for hiding this comment

cankobanz commented Feb 14, 2024

kmaziarz commented Feb 14, 2024

cankobanz commented Feb 15, 2024