-
Notifications
You must be signed in to change notification settings - Fork 46
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: canonical transcript mapped read extraction (#77)
The main aim of this PR is to get the extraction of reads mapped to the canonical transcripts in the `rule get_mapped_canonical_transcripts` down from about 44GB of memory usage regardless of the input BAM file size and hours of grepping, down to seconds / minutes of extraction time with almost no memory footprint. This happens here: https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/compare/fix-canonical-transcript-mapped-read-extraction?expand=1#diff-6562a38fb77f8839a8731b8a882bf0d0b683d6268cdffb433dcbe1f360ccedc4R103 This required using a BED file, and thus I switched to generating a valid BED file in the `rule get_canonical_ids` and switched to using that BED file for keeping track of transcript strand information instead of hacking this into the contig names of the reference fasta file. This lead to some cleanup in the workflow. Other things that happened along the way are: * removed an unnecessary `r-dplyr` dependency, as this is pulled in by `r-tidyverse` anyways * cleaned up file names to more clearly reflect what they contain * cleaned up variable / column names in python script to use only lowercase letters * removed some redundancy in wrapper calling by re-`use`ing the `samtools index` rule -- thus, we only have to update the wrapper version in one place and all instances should always stay in sync For now, this is not yet tested. So I'll mark this as a draft to start with. But I wanted it up here to be able to test it on different setups by checking out the branch. --------- Co-authored-by: Lähnemann <d189n@odcf-worker02.inet.dkfz-heidelberg.de> Co-authored-by: Addimator <adpri100@hhu.de>
- Loading branch information
1 parent
e7ddff6
commit 52b56b0
Showing
37 changed files
with
436 additions
and
526 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
The testing data for this test case is downloaded from this Zenodo dataset: | ||
https://zenodo.org/doi/10.5281/zenodo.10572745 | ||
|
||
It has been generated by this snakemake workflow: | ||
https://github.com/dlaehnemann/create-quant-seq-testing-dataset | ||
|
||
So it is based from data from this publication: | ||
|
||
Corley, S.M., Troy, N.M., Bosco, A. et al. QuantSeq. 3′ Sequencing combined with Salmon provides a fast, reliable approach for high throughput RNA expression analysis. Sci Rep 9, 18895 (2019). https://doi.org/10.1038/s41598-019-55434-x | ||
|
||
The MSigDB gene sets are used according to their [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/), which is given here: | ||
https://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
UROSEVIC_RESPONSE_TO_IMIQUIMOD https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/UROSEVIC_RESPONSE_TO_IMIQUIMOD BRCA1 CCL8 CXCL11 IDO1 IFI35 IFI6 IFITM1 IL6 IRF7 ISG15 ISG20 MICB MX1 OAS2 OASL PLAAT4 SECTM1 STAT1 TRAFD1 | ||
KEGG_PROTEASOME https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KEGG_PROTEASOME IFNG POMP PSMA1 PSMA2 PSMA3 PSMA4 PSMA5 PSMA6 PSMA6P4 PSMA7 PSMA8 PSMB1 PSMB10 PSMB11 PSMB2 PSMB3 PSMB4 PSMB5 PSMB6 PSMB7 PSMB8 PSMB9 PSMC1 PSMC1P4 PSMC2 PSMC3 PSMC4 PSMC5 PSMC6 PSMD1 PSMD11 PSMD12 PSMD13 PSMD14 PSMD2 PSMD3 PSMD4 PSMD6 PSMD7 PSMD8 PSME1 PSME2 PSME3 PSME4 PSMF1 SEM1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
sample condition | ||
SRR8309096 Control | ||
SRR8309094 Control | ||
SRR8309095 Treated | ||
SRR8309097 Treated | ||
SRR8309098 Control | ||
SRR8309099 Treated |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
sample unit fragment_len_mean fragment_len_sd fq1 fq2 | ||
SRR8309096 u1 430 43 quant_seq_test_data/SRR8309096.fastq.gz | ||
SRR8309094 u1 430 43 quant_seq_test_data/SRR8309094.fastq.gz | ||
SRR8309095 u1 430 43 quant_seq_test_data/SRR8309095.fastq.gz | ||
SRR8309097 u1 430 43 quant_seq_test_data/SRR8309097.fastq.gz | ||
SRR8309098 u1 430 43 quant_seq_test_data/SRR8309098.fastq.gz | ||
SRR8309099 u1 430 43 quant_seq_test_data/SRR8309099.fastq.gz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
from snakemake.utils import min_version | ||
|
||
min_version("7.17.0") | ||
|
||
|
||
configfile: "config/config.yaml" | ||
|
||
|
||
# declare main workflow as a module | ||
module rna_seq_kallisto_sleuth: | ||
snakefile: | ||
"../../../workflow/Snakefile" | ||
config: | ||
config | ||
|
||
|
||
use rule * from rna_seq_kallisto_sleuth | ||
|
||
|
||
rule download_quant_seq_testing_data: | ||
output: | ||
"quant_seq_test_data.tar.gz", | ||
log: | ||
"logs/download_quant_seq_testing_data.log", | ||
shell: | ||
"wget https://zenodo.org/records/10572746/files/quant_seq_test_data.tar.gz 2>{log} " | ||
|
||
|
||
rule extract_quant_seq_testing_data: | ||
input: | ||
"quant_seq_test_data.tar.gz", | ||
output: | ||
"quant_seq_test_data/README.md", | ||
"quant_seq_test_data/SRR8309099.fastq.gz", | ||
"quant_seq_test_data/SRR8309095.fastq.gz", | ||
"quant_seq_test_data/SRR8309096.fastq.gz", | ||
"quant_seq_test_data/samples.tsv", | ||
"quant_seq_test_data/SRR8309097.fastq.gz", | ||
"quant_seq_test_data/SRR8309094.fastq.gz", | ||
"quant_seq_test_data/SRR8309098.fastq.gz", | ||
log: | ||
"logs/extract_quant_seq_testing_data.log", | ||
shell: | ||
"tar xzfv {input} 2>{log}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,4 +3,5 @@ channels: | |
- bioconda | ||
- nodefaults | ||
dependencies: | ||
- biopython =1.79 | ||
- biopython =1.79 | ||
- pandas >=1.3,<2 |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,4 +3,5 @@ channels: | |
- bioconda | ||
- nodefaults | ||
dependencies: | ||
- pysam =0.19 | ||
- pysam =0.21 | ||
- pandas =2.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.