You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
It seems like Mesmer reads the TF SavedModel in write mode which means that multiple processes cannot load Mesmer simultaneously. This results in flaky instantiation when running Mesmer in parallel on a HPC cluster.
To Reproduce
Run the code below with >20 cores. If one core is currently loading Mesmer other cores will throw Read less bytes than requested or a number of other errors.
Code:
from deepcell.applications import Mesmer
attempts = 10
model = None
for attempt in range(attempts):
try:
model = Mesmer()
break # If successful, exit the loop
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
time.sleep(10)
if model is None:
print("Failed to initialize the Mesmer after 10 attempts.")
else:
print("Model initialized successfully.")
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:27:00.092152: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wcolgan/miniconda3/envs/py10-env/lib/python3.10/site-packages/cv2/../../lib64:
2024-09-19 08:27:00.092216: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2024-09-19 08:27:00.092255: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (c3b7): /proc/driver/nvidia/version does not exist
2024-09-19 08:27:00.092772: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-19 08:27:10.302641: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:27:31.767108: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:28:14.570058: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
Attempt 1 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 2 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 3 failed: Read less bytes than requested
Attempt 4 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 5 failed: Read less bytes than requested
Attempt 6 failed: Read less bytes than requested
Attempt 7 failed: Read less bytes than requested
Model initialized successfully.
Expected behavior
Initiating Mesmer should be reliable and not include any file locks or write operations
Desktop (please complete the following information):
OS: Linux c4b2 5.4.0-137-generic 154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Python Version: 3.10.13
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Thanks for reporting - this indeed does look like a real issue related to the authentication layer. The model downloading component implements a simple cache, but the model extraction component doesn't - so what I think is happening is that the .tar.gz is being extracted in every run, which can definitely cause issues if one process is reading while another is overwriting with a newly extracted stream.
I suspect the most straightforward fix would be to add caching to the model extraction piece as well.
Thanks for looking into this. Let me know when you have a patch. For now, I'm able to work around it my not running to many jobs in parallel and using the try-catch above.
Describe the bug
It seems like Mesmer reads the TF SavedModel in write mode which means that multiple processes cannot load Mesmer simultaneously. This results in flaky instantiation when running Mesmer in parallel on a HPC cluster.
To Reproduce
Run the code below with >20 cores. If one core is currently loading Mesmer other cores will throw
Read less bytes than requested
or a number of other errors.Code:
Running:
Error:
Expected behavior
Initiating Mesmer should be reliable and not include any file locks or write operations
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: