Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache can have files go missing on UCSC Slurm cluster #5084

Open
adamnovak opened this issue Sep 10, 2024 · 0 comments
Open

Cache can have files go missing on UCSC Slurm cluster #5084

adamnovak opened this issue Sep 10, 2024 · 0 comments
Assignees

Comments

@adamnovak
Copy link
Member

adamnovak commented Sep 10, 2024

It looks like, on the GI Slurm cluster, files are able to vanish from the cache.

I ran:

sbatch -c2 --mem 8G --partition long --time 7-00:00:00 --wrap "toil-wdl-runner https://raw.githubusercontent.com/vgteam/vg_wdl/9b3e4016b16d657a0a7c73e01e1b4c4410f5593e/workflows/giraffe.wdl ./inputs-training-HG002.m84005_220827_014912_s1.json --wdlOutputDirectory ./output/training/HG002.m84005_220827_014912_s1 --wdlOutputFile ./output/training/HG002.m84005_220827_014912_s1.json --logFile ./output/training/HG002.m84005_220827_014912_s1.log --writeLogs ./output/training/log-HG002.m84005_220827_014912_s1 --jobStore ./output/training/tree-HG002.m84005_220827_014912_s1 --batchSystem slurm --slurmTime 11:59:59 --disableProgress --caching=True"

Using the inputs file:

{
    "Giraffe.INPUT_READ_FILE_1": "https://storage.googleapis.com/brain-genomics/awcarroll/share/ucsc/pacbio_fastq/HG002.m84005_220827_014912_s1.fastq.gz",
    "Giraffe.SAMPLE_NAME": "HG002.m84005_220827_014912_s1",
    "Giraffe.PAIRED_READS": false,
    "Giraffe.HAPLOTYPE_SAMPLING": false,
    "Giraffe.GBZ_FILE": "/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-grch38.d9.gbz",
    "Giraffe.MIN_FILE": "/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-grch38.d9.k31.w50.W.withzip.min",
    "Giraffe.ZIPCODES_FILE": "/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-grch38.d9.k31.w50.W.zipcodes",
    "Giraffe.DIST_FILE": "/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-grch38.d9.dist",
    "Giraffe.VG_DOCKER": "quay.io/adamnovak/vg:beec239",
    "Giraffe.READS_PER_CHUNK": 150000,
    "Giraffe.GIRAFFE_PRESET": "hifi",
    "Giraffe.PRUNE_LOW_COMPLEXITY": true,
    "Giraffe.LEFTALIGN_BAM": true,
    "Giraffe.REALIGN_INDELS": false,
    "Giraffe.OUTPUT_SINGLE_BAM": true,
    "Giraffe.REFERENCE_PREFIX": "GRCh38#0#",
    "Giraffe.REFERENCE_FILE": "/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/references/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna",
    "Giraffe.CONTIGS": ["GRCh38#0#chr1", "GRCh38#0#chr2", "GRCh38#0#chr3",  "GRCh38#0#chr4", "GRCh38#0#chr5", "GRCh38#0#chr6", "GRCh38#0#chr7", "GRCh38#0#chr8", "GRCh38#0#chr9", "GRCh38#0#chr10", "GRCh38#0#chr11", "GRCh38#0#chr12", "GRCh38#0#chr13", "GRCh38#0#chr14", "GRCh38#0#chr15", "GRCh38#0#chr16", "GRCh38#0#chr17", "GRCh38#0#chr18", "GRCh38#0#chr19", "GRCh38#0#chr20", "GRCh38#0#chr21", "GRCh38#0#chr22", "GRCh38#0#chrX", "GRCh38#0#chrY"]
}

On Toil c8ba20fa7e95714966cbbfd002e46c26fcafcc05.

I got errors like this in the log from some jobs:

Log from job "'WDLTaskJob' Giraffe.14.runVGGIRAFFEse.command kind-WDLTaskJob/instance-yeovm589 v6" follows:
=========>
	[2024-09-10T11:46:36-0700] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
	[2024-09-10T11:46:36-0700] [MainThread] [I] [toil] Running Toil version 6.2.0a1-8281a413cc028d4d448bbba9c344d42dcf55c2b8 on host phoenix-12.prism.
	[2024-09-10T11:46:36-0700] [MainThread] [I] [toil.worker] Working on job 'WDLTaskJob' Giraffe.14.runVGGIRAFFEse.command kind-WDLTaskJob/instance-yeovm589 v4
	[2024-09-10T11:46:36-0700] [MainThread] [I] [toil.worker] Loaded body Job('WDLTaskJob' Giraffe.14.runVGGIRAFFEse.command kind-WDLTaskJob/instance-yeovm589 v4) from description 'WDLTaskJob' Giraffe.14.runVGGIRAFFEse.command kind-WDLTaskJob/instance-yeovm589 v4
	[2024-09-10T11:46:36-0700] [MainThread] [I] [toil.wdl.wdltoil] Running task command for runVGGIRAFFE (['map', 'runVGGIRAFFE']) called as Giraffe.runVGGIRAFFEse
	[2024-09-10T11:46:36-0700] [MainThread] [I] [MiniWDLContainers] no configuration file found
	[2024-09-10T11:46:36-0700] [MainThread] [N] [MiniWDLContainers] Singularity runtime initialized (BETA) :: singularity_version: "singularity-ce version 3.10.3"
	[2024-09-10T11:46:36-0700] [MainThread] [I] [MiniWDLContainers] detected host resources :: cpu: 256, mem_bytes: 2151637909504
	[2024-09-10T11:46:36-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
	[2024-09-10T11:46:36-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-f72e592d1205456e987d787ad09cfc6a/hprc-v1.1-mc-grch38.d9.k31.w50.W.zipcodes' to path '/data/tmp/toilwf-f683e8bc4898542ab64ebee26f3926d8/c935/job/Giraffe/hprc-v1.1-mc-grch38.d9.k31.w50.W.zipcodes'
	[2024-09-10T11:46:36-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-71817319594f4aa99cbcd408d65a2690/hprc-v1.1-mc-grch38.d9.k31.w50.W.withzip.min' to path '/data/tmp/toilwf-f683e8bc4898542ab64ebee26f3926d8/c935/job/Giraffe/hprc-v1.1-mc-grch38.d9.k31.w50.W.withzip.min'
	[2024-09-10T11:46:36-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-fe9289b731b4460f8fcaa8423e41fdf4/hprc-v1.1-mc-grch38.d9.dist' to path '/data/tmp/toilwf-f683e8bc4898542ab64ebee26f3926d8/c935/job/Giraffe/hprc-v1.1-mc-grch38.d9.dist'
	[2024-09-10T11:46:36-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-099bfda4486c46da94e4ec9202d82fb3/hprc-v1.1-mc-grch38.d9.gbz' to path '/data/tmp/toilwf-f683e8bc4898542ab64ebee26f3926d8/c935/job/Giraffe/hprc-v1.1-mc-grch38.d9.gbz'
	[2024-09-10T11:46:36-0700] [MainThread] [C] [toil.worker] Worker crashed with traceback:
	Traceback (most recent call last):
	  File "/private/home/anovak/workspace/toil/src/toil/worker.py", line 439, in workerScript
	    job._runner(jobGraph=None, jobStore=job_store, fileStore=fileStore, defer=defer)
	  File "/private/home/anovak/workspace/toil/src/toil/job.py", line 3008, in _runner
	    returnValues = self._run(jobGraph=None, fileStore=fileStore)
	  File "/private/home/anovak/workspace/toil/src/toil/job.py", line 2919, in _run
	    return self.run(fileStore)
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 150, in decorated
	    return decoratee(*args, **kwargs)
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 2314, in run
	    bindings = devirtualize_files(bindings, standard_library)
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 1419, in devirtualize_files
	    return map_over_files_in_bindings(environment, stdlib._devirtualize_filename)
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 1645, in map_over_files_in_bindings
	    return map_over_typed_files_in_bindings(bindings, lambda _, x: transform(x))
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 1635, in map_over_typed_files_in_bindings
	    return environment.map(lambda b: map_over_typed_files_in_binding(b, transform))
	  File "/private/home/anovak/workspace/toil/venv/lib/python3.10/site-packages/WDL/Env.py", line 151, in map
	    fb = f(b)
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 1635, in <lambda>
	    return environment.map(lambda b: map_over_typed_files_in_binding(b, transform))
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 1654, in map_over_typed_files_in_binding
	    return WDL.Env.Binding(binding.name, map_over_typed_files_in_value(binding.value, transform), binding.info)
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 1679, in map_over_typed_files_in_value
	    new_path = transform(value.type, value.value)
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 1645, in <lambda>
	    return map_over_typed_files_in_bindings(bindings, lambda _, x: transform(x))
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 880, in _devirtualize_filename
	    result = self.devirtualize_to(
	  File "/private/home/anovak/workspace/toil/src/toil/wdl/wdltoil.py", line 972, in devirtualize_to
	    result = file_source.readGlobalFile(file_id, dest_path, mutable=False, symlink=True)
	  File "/private/home/anovak/workspace/toil/src/toil/fileStores/cachingFileStore.py", line 1163, in readGlobalFile
	    finalPath = self._readGlobalFileWithCache(fileStoreID, localFilePath, symlink, readerID)
	  File "/private/home/anovak/workspace/toil/src/toil/fileStores/cachingFileStore.py", line 1611, in _readGlobalFileWithCache
	    if self._createLinkFromCache(cachedPath, localFilePath, symlink):
	  File "/private/home/anovak/workspace/toil/src/toil/fileStores/cachingFileStore.py", line 1507, in _createLinkFromCache
	    assert os.path.exists(cachedPath), "Cannot create link to missing cache file %s" % cachedPath
	AssertionError: Cannot create link to missing cache file /data/tmp/toilwf-f683e8bc4898542ab64ebee26f3926d8/cache-2a428c6e-0fce-48b3-ac46-5ce532ae055a/tmp8i75v67v3c77a1488820623ed088bca6ae103da3bde63510

	[2024-09-10T11:46:36-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-12.prism
<=========

Something is wrong with the caching logic and files are apparently going missing from the cache while other jobs are trying to link to them.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1640

@adamnovak adamnovak changed the title Caching Cache can have files go missing on UCSC Slurm cluster Sep 10, 2024
@unito-bot unito-bot added the bug label Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants