Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production fixes #87

Merged
merged 5 commits into from
Feb 29, 2024
Merged

Production fixes #87

merged 5 commits into from
Feb 29, 2024

Conversation

muffato
Copy link
Member

@muffato muffato commented Feb 28, 2024

TOLIT-1928

This pull-request addresses MEMLIMIT errors found on BWAMEM2_MEM and SAMTOOLS_SORMADUP.
For both processes, the jobs were retried 5 times (as per maxRetries) and failed at every attempt despite the resources (CPUs and memory) increasing.
This is because in both cases, the memory is a function of the number of CPUs: as the number of CPUs increases, so does the memory. But the config was such that the memory-per-CPU ratio was not increasing, or not fast enough, so retries could not address the initial MEMLIMIT problem.

The changes I propose are:

  • for SAMTOOLS_SORMADUP: add a per-CPU component to the memory formula.
  • for BWAMEM2_MEM: add the attempt number to the formula so that the memory-per-CPU ratio increases every attempt.

In both cases, I'm not changing the baseline, but rather making sure that the second attempt gives enough memory for the job to succeed.

bwamem2_mem

Initial failure: /lustre/scratch122/tol/share/weskit/data/prod/03ba/03ba5c76-fccd-49da-a0cc-682362128eb1/exc_priority_1_readmapping/.nextflow.log and /lustre/scratch122/tol/share/weskit/data/prod/03ba/03ba5c76-fccd-49da-a0cc-682362128eb1/exc_priority_1_readmapping/work/bf/2ab57f0e97ef957747d31766cf5cf2/.command.log

Successful run: /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/bwamem2

First attempt /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/bwamem2/work/86/e2ed519694856ac74b17ba380a5118/.command.log (12 CPUs, 15.4 G RAM) failed because of MEMLIMIT
Second attempt /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/bwamem2/work/ce/664d8f5406e98fb33390c3a966ce0d/.command.log (18 CPUs, 34.1 GB RAM) succeeded (22.8 GB RAM used)

sormadup

Initial failure: /lustre/scratch122/tol/share/weskit/data/prod/ea13/ea1384a3-41d8-43ee-96d8-915731b14164/exc_priority_1_readmapping/.nextflow.log and /lustre/scratch122/tol/share/weskit/data/prod/ea13/ea1384a3-41d8-43ee-96d8-915731b14164/exc_priority_1_readmapping/work/7c/a7e6a393747ba7456d9174450850d3/.command.log

Successful run: /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/sormadup

First attempt /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/sormadup/work/ed/b2b47555f5ad665760931e4ff92a33/.command.log (8 CPUs, 14.8 G RAM) failed because of MEMLIMIT
Second attempt /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/rm/sormadup/work/33/3f71dacd11dc58cd1605018dc24625/.command.log (14 CPUs, 31.4 GB RAM) succeeded (27.2 GB RAM used)


I will make a release (v1.2.1) right after this pull-request is merged

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

We had a bunch of failures where retrying with larger resources didn't help.
They initially needed 16-22 GB of memory (for 12 CPUs) but were only getting 15
GB. Every attempt they would get more memory (20 GB, 25 GB, etc) but the
number of CPUs used was increasing at the same time and pushing the memory
usage even faster.

Since it's a relatively small number of failures, I don't want to increase
the base request. Instead, I'm increasing the per-CPU amount of memory
at every attempt so that those species would have succeeded at the second
attempt.

For the largest offender:

- 22 GB required at the first attempt (12 CPUs): 15 GB given -> failure
- 32 GB required at the second attempt (18 CPUs): 34 GB given -> success
Contrary to my first, limited, assessment of the SORMADUP composite module,
the memory usage does increase with the number of CPUs.

We found that after 1 SORMADUP failed because of MEMLIMIT at every attempt.

I reran the job with all CPU requirements and checked the memory usage

```
 8  22,429 MB
14  27,303 MB
20  32,383 MB
26  37,089 MB
32  41,990 MB
38  46,931 MB
```
The correlation line is y=816*x+15929

I rounded it up to 850 MB per CPU and decreased the base offset accordingly
so that the base memory assignment remains the same (after all, it's only
1 failure out of many species).
However, I made sure that the per-CPU factor increases every attempt, so that
that species would have succeeded at the second attempt.
@muffato muffato added the bug Something isn't working label Feb 28, 2024
@muffato muffato self-assigned this Feb 28, 2024
@muffato muffato merged commit a27e9ce into main Feb 29, 2024
6 checks passed
@muffato muffato deleted the prod_fix branch May 22, 2024 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants