-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Version: 1.2.8
We are experiencing an issue on our HPC cluster. When users run Cryosieve, the application appears to hang after outputting the first iteration. See here:
[z3545907_sa@c03 2026-01-14-cryosieve]$ ls -al /data/ess-test/z3545907_sa/troubleshooting/z5025120/2026-01-22-cryosieve/cryosieve/J421_5217/
total 82209
drwxr-xr-x. 2 z3545907_sa unsw-sa 4096 Feb 9 13:25 .
drwxr-xr-x. 7 z3545907_sa unsw-sa 4096 Feb 9 12:20 ..
-rw-r--r--. 1 z3545907_sa unsw-sa 0 Feb 9 13:22 out.star_iter0_half1.mrc
-rw-r--r--. 1 z3545907_sa unsw-sa 0 Feb 9 13:25 out.star_iter0_half2.mrc
-rw-r--r--. 1 z3545907_sa unsw-sa 10478 Feb 9 13:21 out.star_iter0_reconstruct_half1.txt
-rw-r--r--. 1 z3545907_sa unsw-sa 10272 Feb 9 13:25 out.star_iter0_reconstruct_half2.txt
-rw-r--r--. 1 z3545907_sa unsw-sa 84135802 Feb 9 12:20 out.star_iter0.star
Curiously, I find that this only happens when running on remotely mounted storage. When running on a local disk, the calculations complete just fine. We use NFS to mount our remote storage, so there shouldn't be any strange discrepancy between the two. As we are an HPC, running on a local disk is not really an option.
I'm quite stumped by this, so I ran it with strace -f in order to see what the program is doing. It appears to get stuck reading the same file over and over:
[pid 1734032] stat("J361/extract/FoilHole_14475909_Data_13401456_31_20251117_044426_Fractions_patch_aligned_doseweighted_particles.mrcs", {st_mode=S_IFREG|0755, st_size=7258624, ...}) = 0
[pid 1734032] openat(AT_FDCWD, "J361/extract/FoilHole_14475909_Data_13401456_31_20251117_044426_Fractions_patch_aligned_doseweighted_particles.mrcs", O_RDONLY) = 3
[pid 1734032] fstat(3, {st_mode=S_IFREG|0755, st_size=7258624, ...}) = 0
[pid 1734032] read(3, "h\1\0\0h\1\0\0\34\0\0\0\f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0h\1\0\0"..., 8192) = 8192
[pid 1734032] lseek(3, 4923392, SEEK_SET) = 4923392
[pid 1734032] read(3, "\211<g@^\273\243\274\336\266\36\266\273\276\34\300e9\325\274\323\271_20\271\3634\">\3027"..., 8192) = 8192
[pid 1734032] read(3, "\2409\27\267\2236\316;L44.;\266\2565H\255\202=\226;\2523\353\300f=\2057\2064"..., 245760) = 245760
[pid 1734032] read(3, "q<\221\2646;\245\27601\378\3165\3555\2249\7;F:k:6.\n\267\241\270v5"..., 8192) = 8192
[pid 1734032] close(3) = 0
[pid 1734032] stat("J361/extract/FoilHole_14475909_Data_13401456_31_20251117_044426_Fractions_patch_aligned_doseweighted_particles.mrcs", {st_mode=S_IFREG|0755, st_size=7258624, ...}) = 0
[pid 1734032] openat(AT_FDCWD, "J361/extract/FoilHole_14475909_Data_13401456_31_20251117_044426_Fractions_patch_aligned_doseweighted_particles.mrcs", O_RDONLY) = 3
[pid 1734032] fstat(3, {st_mode=S_IFREG|0755, st_size=7258624, ...}) = 0
[pid 1734032] read(3, "h\1\0\0h\1\0\0\34\0\0\0\f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0h\1\0\0"..., 8192) = 8192
[pid 1734032] lseek(3, 5177344, SEEK_SET) = 5177344
[pid 1734032] read(3, "q<\221\2646;\245\27601\378\3165\3555\2249\7;F:k:6.\n\267\241\270v5"..., 8192) = 8192
[pid 1734032] read(3, "\2667[-\3324\3415X\273\241<C\267\312\274\255?\30\247\3729\305=90)(\370)\203:"..., 253952) = 253952
[pid 1734032] read(3, "s9n\275\326\301k\265r\274\263\272@\254\226=\10\276\235\36F<\3765\27\270W\262*9\371:"..., 8192) = 8192
[pid 1734032] close(3) = 0
[pid 1734032] brk(0xd29e000) = 0xd29e000
[pid 1734032] stat("J361/extract/FoilHole_14475909_Data_13401456_31_20251117_044426_Fractions_patch_aligned_doseweighted_particles.mrcs", {st_mode=S_IFREG|0755, st_size=7258624, ...}) = 0
[pid 1734032] openat(AT_FDCWD, "J361/extract/FoilHole_14475909_Data_13401456_31_20251117_044426_Fractions_patch_aligned_doseweighted_particles.mrcs", O_RDONLY) = 3
[pid 1734032] fstat(3, {st_mode=S_IFREG|0755, st_size=7258624, ...}) = 0
[pid 1734032] read(3, "h\1\0\0h\1\0\0\34\0\0\0\f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0h\1\0\0"..., 8192) = 8192
[pid 1734032] lseek(3, 5439488, SEEK_SET) = 5439488
[pid 1734032] read(3, "s9n\275\326\301k\265r\274\263\272@\254\226=\10\276\235\36F<\3765\27\270W\262*9\371:"..., 8192) = 8192
[pid 1734032] read(3, ":1\215\270E;\26)\317&\340\274\335\300\323\267!<\266\263!\27588\331\272N\275\376:X>"..., 253952) = 253952
[pid 1734032] read(3, "\0360\32\301=\300\f\300\3462b<\233\275\263\300\350\266\251\275\210\300I\260\2476\300<\321\267*\266"..., 8192) = 8192
[pid 1734032] close(3) = 0
[pid 1734032] stat("J361/extract/FoilHole_14475909_Data_13401456_31_20251117_044426_Fractions_patch_aligned_doseweighted_particles.mrcs", {st_mode=S_IFREG|0755, st_size=7258624, ...}) = 0
[pid 1734032] openat(AT_FDCWD, "J361/extract/FoilHole_14475909_Data_13401456_31_20251117_044426_Fractions_patch_aligned_doseweighted_particles.mrcs", O_RDONLY) = 3
[pid 1734032] brk(0xcfa4000) = 0xcfa4000
[pid 1734032] fstat(3, {st_mode=S_IFREG|0755, st_size=7258624, ...}) = 0
[pid 1734032] read(3, "h\1\0\0h\1\0\0\34\0\0\0\f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0h\1\0\0"..., 8192) = 8192
[pid 1734032] lseek(3, 5955584, SEEK_SET) = 5955584
[pid 1734032] read(3, "\321\267\213\272\220<\3612\f5\300;\340>\10\272\266\255S\274\304\264\261<\261>\3116\230\272\343%"..., 8192) = 8192
[pid 1734032] read(3, "Y\273\3004\357:.<J\271\241\265\3425\2675z:\3468g8\3426\213\262\343\275a5A;"..., 253952) = 253952
[pid 1734032] read(3, "L\265\353<N@\360\276u\300\374\270\247*O1\222\270\3349\5<\2405z:\320\271\350\252\16="..., 8192) = 8192
[pid 1734032] close(3) = 0
Below is the script used to run Cryosieve:
#!/bin/bash
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=400:00:00
#SBATCH --job-name="cryosieve"
#SBATCH --gres=gpu:1
# Get the number of available GPUs
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
# cd to the data directory
cd /data/ess-test/z3545907_sa/troubleshooting/z5025120/2026-01-22-cryosieve/
# Load the module
module load sbgrid/cryosieve
module load sbgrid/relion
module load cuda/11.8.0
# Run the command
strace -f cryosieve --reconstruct_software "relion_reconstruct" --postprocess_software relion_postprocess --i J421_particles.star --o cryosieve/J421_$SLURM_JOB_ID/out.star --mask J421_005_volume_mask_fsc_auto.mrc --angpix 0.7736 --num_iters 10 --frequency_start 40 --frequency_end 3 --retention_ratio 0.8 --sym C1 --num_gpus ${GPU_COUNT}I've also attached the two output txt files.
+ Taking data dimensions from the first optics group: 2
+ Back-projecting all images ...
32.33/32.33 min ............................................................~~(,_,"> yum!
+ Starting the reconstruction ...
+ Taking data dimensions from the first optics group: 2
+ Back-projecting all images ...
32.33/32.33 min ............................................................~~(,_,"> yum!
+ Starting the reconstruction ...
RELION on it's own runs just fine, so I don't believe it to be an issue with RELION. Any insight would be quite helpful as there is not much debug output from the program itself so I do not have much to debug from.