Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesmer flaky instantiation on HPC cluster #733

Open
colganwi opened this issue Sep 19, 2024 · 2 comments
Open

Mesmer flaky instantiation on HPC cluster #733

colganwi opened this issue Sep 19, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@colganwi
Copy link

colganwi commented Sep 19, 2024

Describe the bug
It seems like Mesmer reads the TF SavedModel in write mode which means that multiple processes cannot load Mesmer simultaneously. This results in flaky instantiation when running Mesmer in parallel on a HPC cluster.

To Reproduce
Run the code below with >20 cores. If one core is currently loading Mesmer other cores will throw Read less bytes than requested or a number of other errors.

Code:

from deepcell.applications import Mesmer
attempts = 10
model = None
for attempt in range(attempts):
    try:
        model = Mesmer()
        break  # If successful, exit the loop
    except Exception as e:
        print(f"Attempt {attempt + 1} failed: {e}")
        time.sleep(10) 
if model is None:
    print("Failed to initialize the Mesmer after 10 attempts.")
else:
    print("Model initialized successfully.")

Running:

#!/bin/bash
# Configuration values for SLURM job submission.
#SBATCH --job-name=mesmer
#SBATCH --nodes=1 
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=8gb
#SBATCH --array=1-400%50

FOV=$(($SLURM_ARRAY_TASK_ID - 1))
echo "FOV: ${FOV}"

source activate deepcell-env
python run_mesmer.pu

Error:

INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:27:00.092152: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wcolgan/miniconda3/envs/py10-env/lib/python3.10/site-packages/cv2/../../lib64:
2024-09-19 08:27:00.092216: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2024-09-19 08:27:00.092255: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (c3b7): /proc/driver/nvidia/version does not exist
2024-09-19 08:27:00.092772: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-19 08:27:10.302641: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:27:31.767108: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:28:14.570058: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
Attempt 1 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 2 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 3 failed: Read less bytes than requested
Attempt 4 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 5 failed: Read less bytes than requested
Attempt 6 failed: Read less bytes than requested
Attempt 7 failed: Read less bytes than requested
Model initialized successfully.

Expected behavior
Initiating Mesmer should be reliable and not include any file locks or write operations

Desktop (please complete the following information):

  • OS: Linux c4b2 5.4.0-137-generic 154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Python Version: 3.10.13

Additional context
Add any other context about the problem here.

@colganwi colganwi added the bug Something isn't working label Sep 19, 2024
@colganwi colganwi changed the title Mesmer flaky instantiation Mesmer flaky instantiation on HPC cluster Sep 19, 2024
@rossbar
Copy link
Contributor

rossbar commented Sep 19, 2024

Thanks for reporting - this indeed does look like a real issue related to the authentication layer. The model downloading component implements a simple cache, but the model extraction component doesn't - so what I think is happening is that the .tar.gz is being extracted in every run, which can definitely cause issues if one process is reading while another is overwriting with a newly extracted stream.

I suspect the most straightforward fix would be to add caching to the model extraction piece as well.

@colganwi
Copy link
Author

Thanks for looking into this. Let me know when you have a patch. For now, I'm able to work around it my not running to many jobs in parallel and using the try-catch above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants