Closed
Description
As part of #37 I was testing using that setup to run GROMACS. Things work fine if I don't use too many MPI tasks per node, but once I go above 4 I'm getting errors:
[ocais1@juwels03 test]$ OMP_NUM_THREADS=6 srun --time=00:05:00 --nodes=1 --ntasks-per-node=6 --cpus-per-task=6 singularity exec --fusemount "$EESSI_CONFIG" --fusemount "$EESSI_PILOT" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 1000 -g logfile
srun: job 2622253 queued and waiting for resources
srun: job 2622253 has been allocated resources
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
Failed to initialize loader socket
Failed to initialize loader socket
Failed to initialize loader socket
FATAL: stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL: stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL: stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL: stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL: stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL: stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... srun: error: jwc04n178: tasks 0-5: Exited with exit code 255
I suspect the alien cache right now is not enough and we also need a local cache on the node for this use case
Metadata
Metadata
Assignees
Labels
No labels