Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1357 grouping strategy applied by counting number of FASTQ files generated by FASTP #1364

Merged

Conversation

adamrtalbot
Copy link
Contributor

@adamrtalbot adamrtalbot commented Dec 20, 2023

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Changes:
 - The grouping strategy for sharded data has been improved
 - The number of BAM files per sample is calculated by grouping the sample by ID after splitting the FASTQ files, then counting the total number of FASTQ files created.
 - This has to wait for all FASTQ files to be produced by FASTP, but is more reliable.
 - After alignment, the number of FASTQ files is used to wait to determine the expected number of BAM files used by groupBy.

Fixes #1357
Changes:
 - FASTP uses blocks of 250 reads when splitting a FASTQ file.
 - This update makes 250 the minimum sized block to split a FASTQ file into.
 - Updates help text accordingly

Fixes #1363
Copy link

github-actions bot commented Dec 20, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit b815378

+| ✅ 146 tests passed       |+
#| ❔  10 tests were ignored |#
!| ❗   2 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: .github/workflows/awstest.yml
  • pipeline_todos - TODO string in WorkflowSarek.groovy: Optionally add in-text citation tools to this list.

❔ Tests ignored:

  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: conf/modules.config
  • files_unchanged - File ignored due to lint config: assets/nf-core-sarek_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-sarek_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-sarek_logo_dark.png
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
  • actions_ci - actions_ci
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/sarek/sarek/.github/workflows/awstest.yml
  • template_strings - template_strings

✅ Tests passed:

Run details

  • nf-core/tools version 2.11.1
  • Run at 2023-12-21 17:41:30

@adamrtalbot
Copy link
Contributor Author

@FriederikeHanssen @maxulysse I created a test set of data which included only 60 reads for lane 2 to recreate this problem. Attached here, you'll have to modify the path in the input samplesheet.
fastq_single.csv
test_1_slice60.1.fastq.gz
test_1_slice60.2.fastq.gz

There are no tests in Sarek right now. What shall we do? It's easy to do but we'd need to add some more data to test-datasets (such as those FASTQ files).

@adamrtalbot adamrtalbot changed the title 1357 grouping strategy fails with large mismatch in sizes 1357 grouping strategy applied by counting number of FASTQ files generated by FASTP Dec 20, 2023
Copy link
Member

@maxulysse maxulysse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

workflows/sarek.nf Outdated Show resolved Hide resolved
@maxulysse
Copy link
Member

Can you update changelog too?

@FriederikeHanssen
Copy link
Contributor

Testing might be good, but that data probably can't be added to the modules repo, right?

nextflow_schema.json Outdated Show resolved Hide resolved
@adamrtalbot
Copy link
Contributor Author

adamrtalbot commented Dec 21, 2023

Testing might be good, but that data probably can't be added to the modules repo, right?

I don't see why not. I just sliced 60 reads from the existing data. Alternatively we could generate it on the fly?

Here is the mini workflow to generate a channel with a slice of the reads:

workflow UNEVEN_FASTQ {
    take:
        csv

    main:
        ch_csv = Channel.fromPath(csv, checkIfExists: true)
            .splitCsv(header: true)
            .map { row ->
                [
                    [
                        patient: row.patient,
                        sex:     row.sex,
                        status:  row.status,
                        sample:  row.sample,
                        lane:    "small_lane"
                    ],
                    file(row.fastq_1),
                    file(row.fastq_2)
                ]
            }
            .first()
        ch_csv
            .splitFastq(by: 60, file: true, pe: true)
            .map { meta, read1, read2 -> [ meta, [ read1, read2 ] ] }
            .first()
            .mix(ch_csv)
            .set { fastq }

    emit:
        fastq
}

Copy link
Contributor

@FriederikeHanssen FriederikeHanssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 😍

// Group
.groupTuple()

bai_mapped = FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON.out.bai
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bai at only produced/tested with sentieon, but since it is the same, should work

@adamrtalbot adamrtalbot merged commit 048f06e into dev Dec 21, 2023
23 checks passed
@adamrtalbot adamrtalbot deleted the 1357_grouping_strategy_fails_with_large_mismatch_in_sizes branch December 21, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants