- Make Drosophila public RNA-Seq data easier to access
- Normalize sample annotations for easier discovery
- Create a gene expression atlas across stages and tissues
- Provide necessary data for quick preliminary analysis
- Have all data available through GEO GSE117217
- Have data easier to access through the drosSRA CLI (WORK IN PROGRESS)
- Use
SraMongo
to download associated metadata for all Drosophila melanogaster samples - The Pre-Alignment workflow pre-processes all samples to automatically discover/validate technical metadata associated with each sample.
- The Alignment workflow process all RNA-Seq samples that pass filtering to generate coverage counts and genomic browser tracks.
- The Metadata workflow normalizes biological metadata to FlyBase controlled vocabulary.
- All data is accessible directly from GEO. We are also developing a command line tool
drosSRA
to allow users to easily access the data and perform different types of queries. Finally, we provide genome browser visualization through the FlyBase JBRowse instance.
This project uses a set snakemake workflows. The snakemake workflows run in a pre-built singularity container which has all of the required software.
To run the singularity container you need to have singulairty, mongoDB, and miniconda installed.
To create the running environment run.
$ conda env create --file environment.yaml
This project makes use of the following environmental variables.
export SLACK_SNAKEMAKE_BOT_TOKEN=<secret>
export ENTREZ_API_KEY=<secret>
export PROJECT_PATH=<absolute path to where folder is cloned>
export SINGULARITY_IMG=$PROJECT_PATH/singularity/drosSRA_workflow.sif
export SINGULARITY_BINDPATH=<list of paths that need mounted in the singularity container>
export SLURM_JOBID=<optional, used to make temp directories on /lscratch>
export TMPDIR=<optional, used to make temp directories>
For development I have some extra packages that are useful.
$ conda env update -n drosSRA_workflow --file environment_dev.yaml
It is also helpful to pip install the two python packages that are distributed as part of this project.
$ conda activate drosSRA_workflow
[drosSRA_workflow] $ pip install -e src/
[drosSRA_workflow] $ pip install -e biometa-app/
This workflow runs SraMongo
and builds a list of all SRXs and thier SRRs
(./output/srx2srr.csv
).
NOTE 1: SraMongo
requires an environment variable $ENTREZ_API_KEY
to be set. This API key can be generated following these directions here.
NOTE 2: The majority of workflows only need the ./output/srx2srr.csv
they do not require the local mongoDB
. I am not able to have an always on instance of mongoDB
on our cluster, so workflows that are typically run on a the cluster (./prealn-wf/Snakefile
and ./rnaseq-wf/Snakefile
) do not need access.
This workflow downloads FASTQ data from the SRA and checks that the download was sucessful. It was initially designed as a subworkflow, but snakemake
was not running groups correctly with subworkflows. Currently, I just pull the rules from this workflow into ./prealn-wf/Snakefile
and ./rnaseq-wf/Snakefile
.
This workflow downloads FASTQs for all samples and generates various QC features describing each sample.
This workflow runs outlier detection on the RNA-Seq samples. It creates a golden set of RNA-Seq samples to proceed with.
This is the meat of the project. It provides all deliverables.
I am currently working on this section. I am hand normalizing samples using a web app to assist me ./biometa-app
. Once this is done I will build up the outlier detection for each discrete group.
I am in the middle of a major refactoring to simply the project. The following workflows are deprecated or broken.
./agg-rnaseq-wf
./aln-downstream-wf
./aln-wf
: This old alignment workflow../geo-wf
: This is the workflow to put together the data currently upload to geo. It included biological metadata processing../stranded-bigwig-wf
: This is the workflow used to generate the current aggregated track up on FlyBase../ovary-rnaseq-wf
: Pulled out ovary data for another project../testis-rnaseq-pe-wf
,./testis-rnaseq-stranded-wf
,./testis-rnaseq-wf
: I was working on annotating the testis transcriptome. This part of the project became too large for our current scope so has been moved to a separate project.