Production fixes #87

muffato · 2024-02-28T18:37:35Z

TOLIT-1928

This pull-request addresses MEMLIMIT errors found on BWAMEM2_MEM and SAMTOOLS_SORMADUP.
For both processes, the jobs were retried 5 times (as per maxRetries) and failed at every attempt despite the resources (CPUs and memory) increasing.
This is because in both cases, the memory is a function of the number of CPUs: as the number of CPUs increases, so does the memory. But the config was such that the memory-per-CPU ratio was not increasing, or not fast enough, so retries could not address the initial MEMLIMIT problem.

The changes I propose are:

for SAMTOOLS_SORMADUP: add a per-CPU component to the memory formula.
for BWAMEM2_MEM: add the attempt number to the formula so that the memory-per-CPU ratio increases every attempt.

In both cases, I'm not changing the baseline, but rather making sure that the second attempt gives enough memory for the job to succeed.

bwamem2_mem

Initial failure: /lustre/scratch122/tol/share/weskit/data/prod/03ba/03ba5c76-fccd-49da-a0cc-682362128eb1/exc_priority_1_readmapping/.nextflow.log and /lustre/scratch122/tol/share/weskit/data/prod/03ba/03ba5c76-fccd-49da-a0cc-682362128eb1/exc_priority_1_readmapping/work/bf/2ab57f0e97ef957747d31766cf5cf2/.command.log

Successful run: /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/bwamem2

First attempt /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/bwamem2/work/86/e2ed519694856ac74b17ba380a5118/.command.log (12 CPUs, 15.4 G RAM) failed because of MEMLIMIT
Second attempt /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/bwamem2/work/ce/664d8f5406e98fb33390c3a966ce0d/.command.log (18 CPUs, 34.1 GB RAM) succeeded (22.8 GB RAM used)

sormadup

Initial failure: /lustre/scratch122/tol/share/weskit/data/prod/ea13/ea1384a3-41d8-43ee-96d8-915731b14164/exc_priority_1_readmapping/.nextflow.log and /lustre/scratch122/tol/share/weskit/data/prod/ea13/ea1384a3-41d8-43ee-96d8-915731b14164/exc_priority_1_readmapping/work/7c/a7e6a393747ba7456d9174450850d3/.command.log

Successful run: /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/sormadup

First attempt /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/sormadup/work/ed/b2b47555f5ad665760931e4ff92a33/.command.log (8 CPUs, 14.8 G RAM) failed because of MEMLIMIT
Second attempt /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/sormadup/work/33/3f71dacd11dc58cd1605018dc24625/.command.log (14 CPUs, 31.4 GB RAM) succeeded (27.2 GB RAM used)

I will make a release (v1.2.1) right after this pull-request is merged

PR checklist

This comment contains a description of changes (with reason).
If you've fixed a bug or added code that should be tested, add tests!
If you've added a new tool - have you followed the pipeline conventions in the contribution docs
Make sure your code lints (nf-core lint).
Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
Usage Documentation in docs/usage.md is updated.
Output Documentation in docs/output.md is updated.
CHANGELOG.md is updated.
README.md is updated (including new tool citations and authors/contributors).

We had a bunch of failures where retrying with larger resources didn't help. They initially needed 16-22 GB of memory (for 12 CPUs) but were only getting 15 GB. Every attempt they would get more memory (20 GB, 25 GB, etc) but the number of CPUs used was increasing at the same time and pushing the memory usage even faster. Since it's a relatively small number of failures, I don't want to increase the base request. Instead, I'm increasing the per-CPU amount of memory at every attempt so that those species would have succeeded at the second attempt. For the largest offender: - 22 GB required at the first attempt (12 CPUs): 15 GB given -> failure - 32 GB required at the second attempt (18 CPUs): 34 GB given -> success

Contrary to my first, limited, assessment of the SORMADUP composite module, the memory usage does increase with the number of CPUs. We found that after 1 SORMADUP failed because of MEMLIMIT at every attempt. I reran the job with all CPU requirements and checked the memory usage ``` 8 22,429 MB 14 27,303 MB 20 32,383 MB 26 37,089 MB 32 41,990 MB 38 46,931 MB ``` The correlation line is y=816*x+15929 I rounded it up to 850 MB per CPU and decreased the base offset accordingly so that the base memory assignment remains the same (after all, it's only 1 failure out of many species). However, I made sure that the per-CPU factor increases every attempt, so that that species would have succeeded at the second attempt.

… nextflow.config nextflow-io/nextflow#4722

muffato added 4 commits February 28, 2024 08:33

Fix for Nextflow 24.01-edge: functions have to be defined in the main…

c1aa919

… nextflow.config nextflow-io/nextflow#4722

Bumped the version number

a3e6a4e

muffato added the bug Something isn't working label Feb 28, 2024

muffato requested a review from BethYates February 28, 2024 18:37

muffato self-assigned this Feb 28, 2024

Couldn't miss the opportunity to release software on a Feb 29 !

bba028e

BethYates approved these changes Feb 29, 2024

View reviewed changes

muffato merged commit a27e9ce into main Feb 29, 2024
6 checks passed

muffato deleted the prod_fix branch May 22, 2024 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production fixes #87

Production fixes #87

muffato commented Feb 28, 2024

Production fixes #87

Production fixes #87

Conversation

muffato commented Feb 28, 2024

bwamem2_mem

sormadup

PR checklist