Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add input type fastq.gz and fq.gz for Illumina and HiC reads #96

Merged
merged 8 commits into from
Jul 22, 2024

Conversation

reichan1998
Copy link
Contributor

@reichan1998 reichan1998 commented Jul 9, 2024

To solve issue #88

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@reichan1998 reichan1998 changed the title add input type fastq.gz and fg.gz for Illumina reads add input type fastq.gz and fq.gz for Illumina reads Jul 9, 2024
@tkchafin
Copy link
Contributor

tkchafin commented Jul 9, 2024

For this one since align_illumina_fastq is just copying all of the steps (except for CRAM -> fastq conversion), it would be simpler to using a branch operation to determine if the SAMTOOLS_FASTQ step is needed (true for CRAM but false if input type is FASTQ).

For example you can see further on in align_short.nf where there is a branch operation to determine whether SAMTOOLS_MERGE will be run:

    // Collect all BWAMEM2 output by sample name
    BWAMEM2_MEM.out.bam
    | map { meta, bam -> [['id': meta.id.split('_')[0..-2].join('_'), 'datatype': meta.datatype], meta.read_count, bam] }
    | groupTuple( by: [0] )
    | map { meta, read_counts, bams -> [meta + [read_count: read_counts.sum()], bams] }
    | branch {
        meta, bams ->
            single_bam: bams.size() == 1
            multi_bams: true
    }
    | set { ch_bams }


    // Merge, but only if there is more than 1 file
    SAMTOOLS_MERGE ( ch_bams.multi_bams, [ [], [] ], [ [], [] ] )
    ch_versions = ch_versions.mix ( SAMTOOLS_MERGE.out.versions.first() )

The branch section of this is creating two branches: single_bam if the sample has a single BAM file, or multi_bams, which catches all cases where there is not a single BAM file:

    | branch {
        meta, bams ->
            single_bam: bams.size() == 1
            multi_bams: true
    }

The next step is we run SAMTOOLS_MERGE only on the multi_bams branch, and then combine the outputs with the single_bam branch. The process that you will need to implement for FASTQ vs CRAM inputs will be similar

@tkchafin
Copy link
Contributor

tkchafin commented Jul 9, 2024

Also note that ALIGN_SHORT is actually used for both Illumina and HiC reads, so any changes in documentation for Illumina also applies to HiC:

include { ALIGN_SHORT as ALIGN_HIC      } from '../subworkflows/local/align_short'
include { ALIGN_SHORT as ALIGN_ILLUMINA } from '../subworkflows/local/align_short'

@reichan1998
Copy link
Contributor Author

Thank you very much for your instructions, @tkchafin. I noticed that ALIGN_SHORT is used for both Illumina and HiC reads. Therefore, I decided to add a condition to branch reads with the fastq.gz format in readmapping.nf, so it will leave the ALIGN_HIC unchanged. Your approach is certainly more efficient and structured, and I will adjust as you described.

@tkchafin
Copy link
Contributor

tkchafin commented Jul 9, 2024

I think we can allow the FASTQ option for both HiC and Illumina, so it might be best to keep the branching inside of align_short (the feature request actually was about HiC #88 )

@reichan1998
Copy link
Contributor Author

reichan1998 commented Jul 9, 2024

I'm so sorry for the oversight and misunderstanding regarding the FASTQ option for HiC. I've made the changes as per your instructions . Thank you for your review and guidance.

@reichan1998 reichan1998 changed the title add input type fastq.gz and fq.gz for Illumina reads add input type fastq.gz and fq.gz for Illumina and HiC reads Jul 9, 2024
@reichan1998
Copy link
Contributor Author

Hi @tkchafin, I've made the changes following your instructions. Could you please help me review them? I really appreciate your guidance. Thank you!

@muffato muffato requested a review from tkchafin July 10, 2024 13:56
@muffato muffato linked an issue Jul 12, 2024 that may be closed by this pull request
@tkchafin
Copy link
Contributor

Hi @reichan1998 thanks for these changes. I've been off sick the past few days, but will have a look soon.

Copy link
Contributor

@tkchafin tkchafin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on these changes @reichan1998! The next step is we should incorporate the new input type into the test profile. If you look in conf/test.config you can see the settings and sample sheet used when you run with -profile test. This points you to assets/samplesheet_s3.csv which gives paths to files saved remotely in AWS S3 buckets.

Since there are already two, would suggest we try converting one of the Illumina samples from CRAM to FASTQ so tests will cover both branches of your new align_short.nf:

mMelMel2,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4%231.subset.cram,

You can download the file with curl -L -O mMelMel2/illumina/31231_4%231.subset.cram and then convert to fastq using samtools fastq.

First try manually editing the sample sheet to point to your new local file to test, and if this passes you can place the file onto the farm and give me the path and I'll add it to the remotely hosted test files.

@tkchafin
Copy link
Contributor

I have added the file to the hosted test_data, the path is https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4%231.subset.fastq so you can put this into the assets/samplesheet_s3.csv for mMelMel2,illumina to test the pipeline with the fastq input option for that sample

@tkchafin tkchafin merged commit 313a325 into sanger-tol:dev Jul 22, 2024
6 checks passed
@reichan1998 reichan1998 deleted the edit_cram_input branch August 16, 2024 11:10
@tkchafin tkchafin mentioned this pull request Aug 20, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cram input necessary?
2 participants