% electus(1) Electus User Manual % Bryan Beresford-Smith, Andrew Bromage, Thomas Conway, Jeremy Wazny % June 22, 2012
electus - a tool for filtering reads.
Version 1.0.0
electus index -T 8 --prefix idx --ref-fasta ref1.fa
electus classify -T 8 --ref-prefix idx --ref-fasta ref2.fa --ref-threshold 1 --match-prefix match --non-match-prefix non --pairs -i in_1.fastq -i in_2.fastq
electus help
Second Generation Sequencing enables the collection of large quantities of sequence data. While such data sets are incredibly rich, often researchers are interested in answering specific questions about the data. Typically, a researcher with 100s of gigabytes of RNA-Seq data may be interested in SNPs and relative expression for only a small number of specific genes. In such a scenario, it is unnecessarily time consuming to process the entire data set against a whole reference set of genes, only to discard most of the analysis products.
Electus is a tool for quickly and sensitively extracting a relevant selection of reads.
Electus works by constructing the set of k-mers belonging to a set of references and then classifying reads according to whether they contain k-mers in common with sufficiently many references. Often, only a single reference is required. Reads which share enough k-mers with references are considered 'matching', while all others are 'non-matching'. The individual reads are partitioned into separate files according to this classification. The ordering of reads in pairs is maintained.
The following options can be used with all of the electus commands and are therefore not listed separately for each command.
-h, --help : Show a help message.
-l FILE, --log-file FILE : Place to write progress messages. Messages are only written if the -v flag is used. If omitted, messages are written to stderr.
-T INT, --num-threads INT : The maximum number of worker threads to use. The actual number of threads used during the algorithms depends on each implementation. electus may use a small number of additional threads for performing non cpu-bound operations, such as file I/O.
--tmp-dir DIRECTORY : A directory to use for temporary files. This flag may be repeated in order to nominate multiple temporary directories.
-v, --verbose : Show progress messages.
-V, --version : Show the software version.
electus index [-k INT] [-M INT] -P PREFIX --ref-fasta FASTA-filename
Build an electus reference index from a single FASTA file. The file may be gzip compressed, in which case the filename suffix must be .gz.
The k-mer size may be specified using the -k flag. If omitted, electus defaults to k=25.
During index construction, electus maintains a hash table of the k-mers seen so far. When this table fills, its contents are written to disk, and the table is reinitialised. The more memory electus can use, the less often it will need to write to disk, and the faster index construction will run. By default, electus will limit itself to 2 GB during index construction.
The -M, --max-memory flag can be used to explicitly control the amount of memory available to electus (in GB). To improve performance, this should generally be set close to the amount memory available in the system - having accounted for operating system and other overhead.
OPTIONS
-k INT, --kmer-size INT : The k-mer size to use for building the reference index: in version 1.0.0 this must be an integer strictly less than 63. If not supplied, the default value of 25 is used.
-M INT, --max-memory INT : The maximum amount of memory (in GB) of memory to use. Making more memory available will reduce the number of times electus writes intermediate index data to disk. The default is 2 GB.
-P PREFIX, --prefix PREFIX : The path prefix for all generated reference index files. The prefix may contain directory separators (e.g. '/') in order to have the index files written to another directory.
--ref-fasta FILE : The name of the FASTA file containing the host reference sequence. If the filename ends in .gz it will be read as a gzip file.
xenome classify [--ref-index PREFIX]* [--ref-fasta FASTA-filename]* {-I FASTA-filename | -i FASTQ-filename | --line-in filename}+ [--pairs] [--match-prefix STRING] [--nonmatch-prefix STRING] [--ref-threshold INT] [-k INT] [--single-sequence-refs]
Classifies input reads according to the k-mers they share with the given references. References may be specified as either pre-built indices (see the 'index' command), or as sequences within FASTA files. By default, all of the sequences within a single FASTA file are considered part of the same reference, but if --single-sequence-refs is passed, each individual sequence is treated as a separate reference. Electus 1.0.0 supports up to 64 references.
All input reads (or pairs) are classified as either 'matching' or 'non-matching' and may be written to corresponding files. The names of these files are specified by supplying prefixes via --match-prefix and --nonmatch-prefix. If a prefix is omitted, the corresponding reads are not written to any file. e.g. if --nonmatch-prefix is not given, non-matching reads are simply ignored. By default a read (or pair) is considered 'matching' if it contains a k-mer in common with each of the references. This requirement may be relaxed with the --ref-threshold flag, which can be used to set the minimum number of references that the read must share k-mers with in order to be considered 'matching'. e.g. --ref-threshold 1 implies that a read only needs a single k-mer in common with any reference to be 'matching'.
OPTIONS
--ref-index PREFIX : The path prefix for a single reference's index files. The prefix may contain directory separators (e.g. '/') in order to have the index files written to another directory. Multiple reference indices may be specified and there is no requirement that they were built using the same k-mer size.
--ref-fasta FASTA-filename : The path to a reference FASTA file.
-i FILE, --fastq-in FILE : Input file in FASTQ format.
-I FILE, --fasta-in FILE : Input file in FASTA format.
--line-in FILE : Input file with one read per line and no other annotation.
--pairs : Treat reads from consecutive input files of the same type as pairs.
--match-prefix STRING : The filename prefix for matching reads. The actual filename(s) will consist of a prefix, then '_1' or '_2' if the file contains reads from a pair, and finally a suffix corresponding to the read format, i.e. '.fastq', '.fasta', or '.txt'. Note that the prefix may contain path separators, e.g. 'out/match', in order to deposit the files in another directory. If this flag is omitted, matching reads are not written to any file.
--non-match-prefix STRING : The filename prefix for non-matching reads. The actual filename(s) are constructed in the same way as for the matching reads. If this flag is omitted, non-matching reads are not written to any file.
-k INT, --kmer-size INT : The k-mer size to use for building k-mer sets from reference FASTA files: in version 1.0.0 this must be an integer strictly less than 63. If not supplied, the default value of 25 is used. Note that this value of k applies only to the k-mers extracted from FASTA files passed in via --ref-fasta. It need not be the same value as the k used for any prebuilt reference index, as specified by --ref-index.
--ref-threshold INT : Specifies the minimum number of references that a read must share k-mers with in order to be considered 'matching'. If this flag is not specified, the default is all of the references.
--single-sequence-refs : Treat each sequence in each reference FASTA file as a separate reference. By default all of the sequences in each file are considered to belong to the same reference.
electus help
Prints a summary of all of the electus commands.
Eliminating bacterial contamination by extracting all possibly-human reads.
Accelerating gene expression level analysis by extracting gene-specific reads. (Just the exons? Lots of genes?)
Speeding up the search for gene fusions by extracting reads that contain k-mers belonging to more than one reference.
--
Bzip support will be introduced.