Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev => Master 2.3.0 #185

Open
wants to merge 266 commits into
base: master
Choose a base branch
from
Open

Dev => Master 2.3.0 #185

wants to merge 266 commits into from

Conversation

edmundmiller
Copy link
Collaborator

@edmundmiller edmundmiller commented Dec 23, 2024

Waiting for #184

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/nascent branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

edmundmiller and others added 30 commits March 20, 2024 12:10
The only thing using more than one thread is samtools
Not sure how that snuck back in there
Co-authored-by: maxulysse <maxulysse@users.noreply.github.com>
@edmundmiller edmundmiller self-assigned this Dec 23, 2024
@edmundmiller edmundmiller added this to the 2.3.0 milestone Dec 23, 2024
Copy link

github-actions bot commented Dec 23, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 7d9ce8a

+| ✅ 223 tests passed       |+
#| ❔   4 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in ro-crate-metadata.json: "description": "

    \n \n <source media="(prefers-color-scheme: dark)" srcset="docs/images/nf-core-nascent_logo_dark.png">\n <img alt="nf-core/nascent" src="docs/images/nf-core-nascent_logo_light.png">\n \n

    \n\nGitHub Actions CI Status\nGitHub Actions Linting StatusAWS CICite with Zenodo\nnf-test\n\nNextflow\nrun with conda\nrun with docker\nrun with singularity\nLaunch on Seqera Platform\n\nGet help on SlackFollow on TwitterFollow on MastodonWatch on YouTube\n\n## Introduction\n\nnf-core/nascent is a bioinformatics best-practice analysis pipeline for nascent transcript (NT) and Transcriptional Start Site (TSS) assays.\n\nThe pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!\n\nOn release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the nf-core website.\n\n## Pipeline summary\n\n1. Read QC (FastQC)\n2. Adapter and quality trimming (fastp)\n3. Alignment\n 1. bwa\n 2. bwamem2\n 3. DRAGMAP\n4. Sort and index alignments (SAMtools)\n5. UMI-based deduplication (UMI-tools)\n6. Duplicate read marking (picard MarkDuplicates)\n7. Quality Control\n 1. RSeQC - Various RNA-seq QC metrics\n 2. Preseq - Estimation of library complexity\n 3. BBMap - Analyzes the sequencing coverage\n8. Coverage Graphs\n 1. Create bedGraph coverage files (BEDTools\n 2. Create bigWig coverage files (deeptools)\n9. Transcript identification\n 1. HOMER\n 2. GroHMM\n 3. PINTS\n10. Quantification of Genes and Nascent Transcripts (featureCounts)\n11. Aggregate report describing results and QC from the whole pipeline (MultiQC)\n\n## Usage\n\n> [!NOTE]\n> If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.Make sure to test your setup with -profile test before running the workflow on actual data.\n\n Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.\n Explain what rows and columns represent. For instance (please edit as appropriate):\n\nFirst, prepare a samplesheet with your input data that looks as follows:\n\n\ncsv title=\"samplesheet.csv\nsample,fastq_1,fastq_2\nCONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz\n\n\nEach row represents a fastq file (single-end) or a pair of fastq files (paired end).\n\n\n\nNow, you can run the pipeline using:\n\nbash\nnextflow run nf-core/nascent \\\n -profile <docker/singularity/.../institute> \\\n --input samplesheet.csv \\\n --outdir <OUTDIR>\n\n\n> [!WARNING]\n> Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.\n\nFor more details and further functionality, please refer to the usage documentation and the parameter documentation.\n\n## Pipeline output\n\nTo see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page.\nFor more details about the output files and reports, please refer to the\noutput documentation.\n\n## Credits\n\nnf-core/nascent was originally written by Ignacio Tripodi (@ignaciot) and Margaret Gruca (@magruca).\n\nThe pipeline was re-written in Nextflow DSL2 by Edmund Miller (@edmundmiller) and Sruthi Suresh (@sruthipsuresh) from The Functional Genomics Laboratory at The Univeristy of Texas at Dallas\n\nWe thank the following people for their extensive assistance in the development of this pipeline:\n\n- @apeltzer\n- @ewels\n- @drpatelh\n- @pditommaso\n- @FriederikeHanssen\n- Tae Hoon Kim\n- @easterwoods\n\n## Contributions and Support\n\nIf you would like to contribute to this pipeline, please see the contributing guidelines.\n\nFor further information or help, don't hesitate to get in touch on the Slack #nascent channel (you can join with this invite).\n\n## Citations\n\nIf you use nf-core/nascent for your analysis, please cite it using the following doi: 10.5281/zenodo.7245273\n\nAn extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.\n\nYou can cite the nf-core publication as follows:\n\n> The nf-core framework for community-curated bioinformatics pipelines.\n>\n> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.\n>\n> Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.\n",
  • pipeline_todos - TODO string in README.md: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • actions_ci - actions_ci

✅ Tests passed:

Run details

  • nf-core/tools version 3.1.1
  • Run at 2024-12-29 04:09:25

Bump version for 2.3.0 release
@edmundmiller edmundmiller marked this pull request as ready for review December 23, 2024 13:47
Copy link

@sateeshperi sateeshperi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those local module & sub-wfs will need nf-test next


When running the pipeline with groHMM as a transcript identification method, the pipeline will automatically perform a parameter tuning process. This process is unique to the groHMM transcript identification method and is designed to select the optimal hold-out parameters for the groHMM algorithm. See [this issue](https://github.com/dankoc/groHMM/issues/4) for more information.

In the groHMM vignette, the code is ran using a single mclapply call, which is a scatter gather approach. This is not ideal for large datasets, because it ends up being bottlenecked by the memory available on your local machine. To improve this, we have written a Nextflow script that runs the pipeline with a scatter gather approach. This is done by running the pipeline with a single hold-out parameter, and then the next parameter, and so on. This is more memory efficient and scales better to larger datasets. The results are then combined then combined in the end as intended and used in the transcript identification process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the groHMM vignette, the code is ran using a single mclapply call, which is a scatter gather approach. This is not ideal for large datasets, because it ends up being bottlenecked by the memory available on your local machine. To improve this, we have written a Nextflow script that runs the pipeline with a scatter gather approach. This is done by running the pipeline with a single hold-out parameter, and then the next parameter, and so on. This is more memory efficient and scales better to larger datasets. The results are then combined then combined in the end as intended and used in the transcript identification process.
In the groHMM vignette, the code is ran using a single mclapply call, which is a scatter gather approach. This is not ideal for large datasets, because it ends up being bottle-necked by the memory available on your local machine. To improve this, we have written a Nextflow script that runs the pipeline with a scatter gather approach. This is done by running the pipeline with a single hold-out parameter, and then the next parameter, and so on. This is more memory efficient and scales better to larger datasets. The results are then combined in the end as intended and used in the transcript identification process.

- Mouse: mm10
- Fly: dm6

**This setting is off by default**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**This setting is off by default**
:::info
**This setting is off by default**
:::

Comment on lines +29 to +39
params.fasta = getGenomeAttribute('fasta')
params.gtf = getGenomeAttribute('gtf')
params.gff = getGenomeAttribute('gff')
params.gene_bed = getGenomeAttribute('bed12')
params.bwa_index = getGenomeAttribute('bwa')
params.bwamem2_index = getGenomeAttribute('bwamem2')
params.dragmap = getGenomeAttribute('dragmap')
params.bowtie2_index = getGenomeAttribute('bowtie2')
params.hisat2_index = getGenomeAttribute('hisat2')
params.star_index = null
params.homer_uniqmap = getGenomeAttribute('uniqmap')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params.fasta = getGenomeAttribute('fasta')
params.gtf = getGenomeAttribute('gtf')
params.gff = getGenomeAttribute('gff')
params.gene_bed = getGenomeAttribute('bed12')
params.bwa_index = getGenomeAttribute('bwa')
params.bwamem2_index = getGenomeAttribute('bwamem2')
params.dragmap = getGenomeAttribute('dragmap')
params.bowtie2_index = getGenomeAttribute('bowtie2')
params.hisat2_index = getGenomeAttribute('hisat2')
params.star_index = null
params.homer_uniqmap = getGenomeAttribute('uniqmap')
params.fasta = getGenomeAttribute('fasta')
params.gtf = getGenomeAttribute('gtf')
params.gff = getGenomeAttribute('gff')
params.gene_bed = getGenomeAttribute('bed12')
params.bwa_index = getGenomeAttribute('bwa')
params.bwamem2_index = getGenomeAttribute('bwamem2')
params.dragmap = getGenomeAttribute('dragmap')
params.bowtie2_index = getGenomeAttribute('bowtie2')
params.hisat2_index = getGenomeAttribute('hisat2')
params.star_index = null
params.homer_uniqmap = getGenomeAttribute('uniqmap')

tuning_file = null
grohmm_min_uts = 5
grohmm_max_uts = 45
// Depends on how you look at this one... But I figured most will ignore the negative

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Depends on how you look at this one... But I figured most will ignore the negative
// Depends on how you look at this one... But I figured most will ignore the negative

Could you clarify what you mean here ?

Copy link

@vagkaratzas vagkaratzas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Regular test runs smoothly under 10' and seems easy to follow. I left a couple of comments after a first pass. Will come back for more :D

cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
// TODO nf-core: Check the defaults for all processes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All local modules seem to have labels, and test seems to run fine. You can probably remove this TODO comment

@@ -51,7 +53,7 @@ On release, automated continuous integration tests run the pipeline on a full-si
## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow.Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.

<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is left for last. Just a reminder for minimal description of pipeline steps + metro map

channels:
- conda-forge
- bioconda
- defaults

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think - defaults should now be removed from every module

@@ -0,0 +1,170 @@
#!/usr/bin/env Rscript

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this: https://nf-co.re/docs/checklists/reviews/pipeline_release_pr#do-local-code-and-modules
which says that bin scripts should have author and MIT license embedded. I don't have them in my pipeline either and not sure how new and mandatory this rule is; just leaving the link and comment here to figure it out

assets/schema_input.json Show resolved Hide resolved
@@ -225,7 +244,7 @@ The [Preseq](http://smithlabresearch.org/software/preseq/) package is aimed at p
<details markdown="1">
<summary>Output files</summary>

- `bbmap/`
- `quality_control/bbmap/`
- `*.coverage.hist.txt`: Histogram of read coverage over each chromosome
- `*.coverage.stats.txt`: Coverage stats broken down by chromosome including %GC, pos/neg read coverage, total coverage, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `*.coverage.stats.txt`: Coverage stats broken down by chromosome including %GC, pos/neg read coverage, total coverage, etc.
- `<samplename>.coverage.stats.txt`: Coverage stats broken down by chromosome including %GC, pos/neg read coverage, total coverage, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I would change the asterisk to everywhere else (when viable) in the output.md likewise.

@@ -209,9 +228,9 @@ The majority of RSeQC scripts generate output files which can be plotted and sum
<details markdown="1">
<summary>Output files</summary>

- `<ALIGNER>/preseq/`
- `quality_control/preseq/`
- `*.lc_extrap.txt`: Preseq expected future yield file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see a <samplename>.c_curve.txt

@@ -240,7 +259,7 @@ The [Preseq](http://smithlabresearch.org/software/preseq/) package is aimed at p
<details markdown="1">
<summary>Output files</summary>

- `bedtools/`
- `coverage_graphs/`
- `*.minus.bedGraph`: Sample coverage file (negative strand only) in bedGraph format
- `*.plus.bedGraph`: Sample coverage file (positive strand only) in bedGraph format

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see a <samplename>.dreg.bedGraph file

### PINTS

<details markdown="1">
<summary>Output files</summary>

- `pints/`
- `transcript_identification/pints/`
- `*_bidirectional_peaks.bed`: Bidirectional TREs (divergent + convergent)
- `*_divergent_peaks.bed`: Divergent TREs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add (optional) to the two files above, since you don't always receive those (at least not with the test.config).

@@ -346,7 +371,7 @@ They've also created some bed files that might be useful for analysis.
<details markdown="1">
<summary>Output files</summary>

- `<ALIGNER>/featurecounts/`
- `quantification/featurecounts/`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under quantification, I got two folders instead: gene and nascent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants