-
Couldn't load subscription status.
- Fork 1
Bioinformatics Pipeline for detecting novel spliced regions in RNA-Seq Data
License
Couldn't load subscription status.
xuric/read-split-run
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
README
This is the software package for the Read-Split-Run pipeline. Included are the scripts and software necessary to run the entire process from beginning to end.
INSTALLATION
Installing the pipeline:
decompress rsr.tar.gz (tar -zxvf rsr.tar.gz) to the location of your choice; this shall be the INSTALLATION DIRECTORY.
Perl 5.16 (or later):
The converter for the gene reference file (see below) requires Perl to run. We tested it with version 5.16, though it is likely that earlier versions will work. Perl can be downloaded from http://www.perl.org .
To verify your installation of Perl is compatable, check the output of the command; perl --version
Python 2.7 (or later):
The encoding guesser requires that python be installed on the system and its executable in an accessible location. It is likely that earlier versions of python will work, but our tests used version 2.7.5. Python can be downloaded from https://www.python.org .
To verify your installation of python is compatable, check the output of the command; python --version
Bowtie 1.0.1 (or later):
Ensure that you have bowtie (version 1.0.1 or later) install. By default, Read Split Run will look for bowtie in the INSTALLATION DIRECTORY/bowtie, however if you install it in another location, make sure it is in the PATH (see also: CONFIGURATION and DEPENDENCIES, below; it is recommended that the path to bowtie be in the PATH anyway). It can be downloaded from http://bowtie-bio.sourceforge.net.
To verify your installation of bowtie is compatable, check the output of the command: bowtie --version
gcc (g++) version 4.8 or better:
Compiling the associated programs requires gcc version 4.8 or higher (to accommodate features of c++11 that are used in splitpairs).
To verify your installation of bowtie is compatable, check the output of the command; gcc --version
When Gcc is installed, one can easily create the necessary binary programs from the src directory by changing directory to the INSTALLATION DIRECTORY and typing 'make'. This will automatically make each of the programs. Note that binaries are included in the package. These binaries were compiled on linux, and may or may not work on your system. It is recommended to run 'make' to compile binaries for your system before use.
Bowtie Index files:
You will need bowtie index files and knownGene files for the genome(s) of your choice. Put the index files in the indexes subdirectory of your bowtie install directory (if you cannot install there, any directory will work). SET THE THE BOWTIE_INDEXES ENVIRONMENT VARIABLE to the location of the indexes (this is exactly the same as the requirement for running bowtie by itself).
Indexes can be found on the bowtie website (http://bowtie-bio.sourceforge.net). We use the human hg19 genome (ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip) and mouse mm9 (ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm9.ebwt.zip).
Note: Normally, bowtie will attempt to guess where the index files are, but because of the necessities below, we cannot do that. you MUST set the variable.
Setting the variable examples:
in bash: export BOWTIE_INDEXES=/usr/local/bowtie/indexes
in csh: setenv BOWTIE_INDEXES /usr/local/bowtie/indexes
refFlat files:
For the genomes to which you plan to align, download and uncompress their refFlat reference file from UCSC (http://hgdownload.cse.ucsc.edu/downloads.html). Place the refFlat.txt file in the same directory as your bowtie indexes, and CHANGE THE NAME OF THE FILE TO HAVE THE FOLLOWING PATTERN: OrganismAssemblyName.refFlat.txt (i.e. If you plan on using Human assembly hg19, its refFlat.txt should be "hg19.refFlat.txt"; if you plan on using Mouse assembly mm9sp35, make its refFlat file "mm9sp35.refFlat.txt"; these files are CASE SENSITIVE.
Human hg19 reference can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz
Mouse mm9 reference can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/refFlat.txt.gz
Furthermore, the splitPairs portion of Read-Split-Run requires a special parsed refFlat reference with intron/exon boundaries identified. We provide a script to create this file called "refflat_parse_RSW.pl" in the RSR INSTALLATION DIECTORY. Run this script and supply the refFlat reference file as the input argument...
Ex: perl refflat_parse_RSW.pl /usr/local/bowtie/indexes/hg19.refFlat.txt
It will generate the annotated file "hg19.refFlat.txt.intronBoundary.exonsgaps"
CONFIGURATION (optional):
There is no mandatory configuration that needs be done if you have followed the installation instructions to this point. The following are presented as options for the advanced user.
The configuration file: "rsr_config.sh" contains all the configurable values used by the pipeline. It can be edited with any plain-text editor (nano, vim, etc). You can change the following variables in the USER CONFIGURATION section to suit your needs:
BOWTIE_TEMP_DIR location to store intermediate bowtie files.
Default: INSTALLATION DIRECTORY/tmp/bowtie
SPLIT_TEMP_DIR location to store intermediate split reads files.
Default: INSTALLATION DIRECTORY/tmp/split
RSR_TEMP_DIR location to store intermediate RSR options and output files.
Default: INSTALLATION DIRECTORY/tmp/splitPairs
LOG_DIR location to store diagnostic and operational logs.
Default: INSTALLATION DIRECTORY/logs
BOWTIE_PROGRAM the absolute path to bowtie executable (e.g. /usr/bin/bowtie).
Default: INSTALLATION DIRECTORY/bowtie/bowtie
(if the bowtie executable is not in this directory, RSR will check the PATH).
REFDIR Directory containing refFlat files (and gene Intron/exon boundary files). All files for your available genomes must be in this directory.
Default: BOWTIE_INDEXES
BASES_TO_TRIM How many bases should be trimmed off the right-end of your reads? Set this according ot the quality of your reads.
Default: 0.
All other variables are internal-use and should not be changed. Read the comments in the configuration file for details as to what function they provide.
DEPENDENCIES:
All shell scripts (files ending in .sh) rely on "rsw_config.sh," which holds the configuration data. Other scripts' call upon other execuables as needed, depicted below.
rsw_batch_job.sh
- rsw_pipeline.sh
- bowtie.sh
- guess-encoding.py
- Python 2.7 or newer
- bowtie v1.0.1 or newer
- split.sh
- srr
- sfc
- rsr.sh
- sp4
- compare_sh
- rsw_comparison
refFlat_parse_RSW.pl requires Perl 5.16 or newer
sbc has no external dependencies.
RUNNING:
To run the pipeline, first, set the BOWTIE_INDEXES environment variable to the location of your bowtie indexes directory; according to your shell:
so, if your indexes were in /usr/share/bowtie/indexes, you would type:
bash: export BOWTIE_INDEXES=/usr/share/bowtie/indexes
csh: setenv BOWTIE_INDEXES /usr/share/bowtie/indexes
After that, you can execute rsw_batch_job.sh with the following inputs:
mode Choose "analytic" or "comparison"
analytic jobs only produce RSW output files. This the mode you will use the most.
comparative jobs produce RSW output files for both data sets, AND a file which shows the differences between the two
genome The assembly name of the bowtie index for the genome to which to align reads. Also specifies which refFlat file to use (see INSTALLATION).
readsFile The file(s) with RNA-Seq data in plain-text FASTQ format.
The nature of your run will determine how you should specify your files.
if you have ... you should specify your file(s) as...
Just one file the file name, with full path.
Replicates a double-quoted, comma-separated list
ex: "replicate1.fastq, replicate2.fastq, ..."
** Yes, include the double-quotes.
Paired-end data place all left-pairs in a comma-separated list, as above,
then put a vertical-bar | and a second comma-separated
list with the right-pairs.
ex: "Replicate1_1.fastq,Replicate2_1.fastq|Replicate1_2.fast1,Replicate2_2.fastq"
** Make sure your pairs are ordered correctly:
"REPLICATE 1 LEFT, REPLICATE 2 LEFT | REPLICATE 1 RIGHT, REPLICATE 2 RIGHT"
[readsFile2] If you are using "comparison" mode, a second set of reads-files goes here,
the format is the same as above.
maxGoodAlignments Maximum number of matches allowed in bowtie (see bowtie -k and -m parameters)
minSplitSize Smallest length to split your reads into. If you specify more than half the reads' length,
the pipeline will exchange it with (readlength - minSplitSize).
** The smaller your split, the more memory, disk space, and time will be needed.
minSplitdistance Smallest amount of distance allowed between split-reads to be considered a splice-junction candidate.
maxSplitdistance Largest amount of distance between split-reads to be considered a splice-junction candidate.
regionBuffer By how much candidate junctions are allowed to differ in their start-position in order to "support" one-another.
requiredSupports Report only junctions with this many supporting reads, or more.
pathToSaveFesults Where to put the final output files.
Examples:
Here are presented example command-lines for doing various kinds of runs the assembly names are real but the file names are made-up...
Analytic runs:
Normal Run: ./rsw_batch_job.sh analytic mm9 "mus1.fastq" 11 11 3 30000 4 2 ~/mm9_results
Run with Replicates: ./rsw_batch_job analytic hg19 "hg19-1.fastq,hg19-2.fastq,hg19-3.fastq" 11 33 3 50000 5 2 ~/hg19_results
Paired Run: ./rsw_batch_job.sh analytic hg19sp101 "hg19_1.fastq|hg19_2.fastq" 11 33 3 50000 5 2 ~/hg19_paired_results
Paired, replicated: ./rsw_batch_job analytic hg19sp101 "hg19-1_1.fastq,hg19-2_1.fastq|hg19-1_2.fastq,hg19-2_2.fastq" 11 33 3 50000 5 2 ~/hg19_pair_repl_results
Comparative runs:
Normal Run: ./rsw_batch_job.sh comparative mm9 "set1.fastq" "set2.fastq" 11 11 3 30000 4 2 ~/mm9_compare
Run with Replicates: ./rsw_batch_job analytic hg19 "set1replicate1.fastq,set1replicate2.fastq" "set2replicate1.fastq,set2replicate2.fastq" 11 33 3 50000 5 2 ~/hg19_results
Paired Run: ./rsw_batch_job.sh analytic hg19sp101 "set1_1.fastq|set1_2.fastq" "set2_1.fastq|set2_2.fastq" 11 33 3 50000 5 2 ~/hg19_paired_results
...
KNOWN ISSUES
The Quality-encoding detection portion of bowtie.sh is known to cause a broken pipe with awk. This is acceptible and does not interfere with the performance of the pipeline.
COPYRIGHT
Read-split-run is copyright(c) 2014-2015 Yongsheng Bai, Brandon Donham, Randal J. Kaufman, Jeff Kinne.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
a copy is also provided in the LICENSE file, accompanying this.
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
About
Bioinformatics Pipeline for detecting novel spliced regions in RNA-Seq Data
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published