Skip to content

Using the stack in a container on a parallel FS #38

Closed
@ocaisa

Description

@ocaisa

As part of #37 I was testing using that setup to run GROMACS. Things work fine if I don't use too many MPI tasks per node, but once I go above 4 I'm getting errors:

[ocais1@juwels03 test]$  OMP_NUM_THREADS=6 srun --time=00:05:00 --nodes=1 --ntasks-per-node=6 --cpus-per-task=6 singularity exec --fusemount "$EESSI_CONFIG" --fusemount "$EESSI_PILOT" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 1000 -g logfile
srun: job 2622253 queued and waiting for resources
srun: job 2622253 has been allocated resources
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
Failed to initialize loader socket
Failed to initialize loader socket
Failed to initialize loader socket
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... srun: error: jwc04n178: tasks 0-5: Exited with exit code 255

I suspect the alien cache right now is not enough and we also need a local cache on the node for this use case

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions