Skip to content

Several Infrastructure-Related Issues Getting cactus-pangenome to work #1878

@bbimber

Description

@bbimber

Hello,

I finally got the cactus-pangenome pipeline to run, after encountering a number of different issues. most of which I believe are due to difference in the cluster environment where I'm trying to run this. Notably, our main filesystems are NFS (also Weka/Lustre). I often use docker, though our cluster uses podman for 'rootless' docker execution.

I hit these issues in various different ways, but the two categories were: 1) issues with jemalloc, and 2) what I assume are issues with NFS filesystems. I'm reporting them here in case others hit them, and at least the NFS piece might make sense as a more obvious note in your documentation.

  1. Category 1: I ran into many issues with jemalloc and vg. Initially reported here: <jemalloc>: Error in munmap(): Cannot allocate memory #1871. This occurred both with docker and without docker. I ended up creating a new docker container where vg is installed with "mimalloc=on" (see: https://github.com/bimberlabinternal/DevOps/blob/master/containers/cactus/Dockerfile, and ), based on this thread: vg giraffe hangs or stalls when reading large gzipped FASTQ files vgteam/vg#4645.

  2. Category 2: what I assume are issues with an NFS or lustre filesystem. Both within docker and without docker, I initially ran the pipeline giving cactus jobStore and output directories on our default NFS filesystem. The exact error would change, but one example is below:

[2026-01-15T11:16:29+0000] [MainThread] [I] [toil.lib.history] Workflow a10623d8-fe9f-4af7-9162-ba2eca93bebf stopped. Success: False
Traceback (most recent call last):
  File "/home/cactus/cactus_env/bin/cactus-pangenome", line 7, in <module>
    sys.exit(main())
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_pangenome.py", line 269, in main
    toil.start(Job.wrapJobFn(pangenome_end_to_end_workflow, options, config_wrapper, input_seq_id_map, input_path_map, input_seq_order, ref_collapse_paf_id, last_scores_id))
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1185, in start
    return self._runMainLoop(rootJobDescription)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1795, in _runMainLoop
    ).run()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 341, in run
    raise FailedJobsException(
toil.exceptions.FailedJobsException: The job store '/home/exacloud/gscratch/prime-seq/Bimber/ONT/cactus/js' contains 12 failed jobs: 'Job' kind-Job/instance-umzrgkrn v2, 'batch_align_jobs' kind-export_split_wrapper/instance-qriu1rae v12, 'Job' kind-export_minigraph_wrapper/instance-finl2_qq v8, 'join_vg' kind-join_vg/instance-jnr68g00 v3, 'Job' kind-Job/instance-a6njlacf v2, 'vcf_cat' kind-vcf_cat/instance-msw9hw0m v6, 'sort_minigraph_input_with_mash' kind-minigraph_construct_workflow/instance-q3fvzif2 v7, 'sanitize_fasta_headers' kind-pangenome_end_to_end_workflow/instance-xj5a6703 v6, 'graphmap_join_workflow' kind-export_align_wrapper/instance-fl2ts3oy v6, 'Job' kind-make_vcf/instance-jttebdk7 v5, 'Job' kind-export_graphmap_wrapper/instance-jcdts05s v11, 'Job' kind-Job/instance-m9sar_yg v2
Log from job "'vcf_cat' kind-vcf_cat/instance-msw9hw0m v6" follows:
=========>
	[2026-01-15T11:15:56+0000] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
	[2026-01-15T11:15:56+0000] [MainThread] [I] [toil] Running Toil version 9.1.2-355b72ba0e425619c0367562caf2a337078dba65 on host cbcebd992853.
	[2026-01-15T11:15:56+0000] [MainThread] [I] [toil.worker] Working on job 'vcf_cat' kind-vcf_cat/instance-msw9hw0m v4
	[2026-01-15T11:16:03+0000] [MainThread] [I] [toil.worker] Loaded body Job('vcf_cat' kind-vcf_cat/instance-msw9hw0m v4) from description 'vcf_cat' kind-vcf_cat/instance-msw9hw0m v4
	[2026-01-15T11:16:03+0000] [MainThread] [C] [toil.worker] Worker crashed with traceback:
	Traceback (most recent call last):
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/worker.py", line 595, in workerScript
	    job._runner(
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3254, in _runner
	    returnValues = self._run(jobGraph=None, fileStore=fileStore)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3137, in _run
	    return self.run(fileStore)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3452, in run
	    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 1467, in vcf_cat
	    job.fileStore.readGlobalFile(vcf_id, vcf_path)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/fileStores/nonCachingFileStore.py", line 162, in readGlobalFile
	    self.jobStore.read_file(fileStoreID, localFilePath, symlink=symlink)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/jobStores/fileJobStore.py", line 551, in read_file
	    self._check_job_store_file_id(file_id)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/jobStores/fileJobStore.py", line 1051, in _check_job_store_file_id
	    raise NoSuchFileException(jobStoreFileID)
	toil.jobStores.abstractJobStore.NoSuchFileException: File 'files/for-job/kind-deconstruct/instance-4wy55cn1/file-e3f47756c3b041f183faa91a72f72605/rm-pg.chr1.clip.raw.vcf.gz' does not exist.
	
	[2026-01-15T11:16:03+0000] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host cbcebd992853
<=========

I also frequently got errors about /var/run/lock* files not existing. Eventually I ran cactus using:

# $WORK_DIR is a local SSD connected to the node:
docker run \
	--rm \
	-v /mnt/scratch:/mnt/scratch \
	-e TMPDIR=$WORK_DIR \
	$CONTAINER_NAME \
	cactus-pangenome \
		$JOB_STORE \
		--mgCores $THREADS \
		--maxMemory ${MEM}G \
		--workDir $WORK_DIR \
		--coordinationDir $WORK_DIR \
		$GENOME_FILE \
		--outDir $OUTPUT \
		--outName $OUTPUT \
		--reference $REF_NAME \
		--vcf \
		--giraffe \
		--gfa \
		--gbz

I didnt see 'NFS' listed anywhere in your documentation, so I thought I'd report this in case there is user-guidance that would make sense to add to that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions