Skip to content

Latest commit

 

History

History
2774 lines (1711 loc) · 151 KB

SEAStAR_User_Guide.md

File metadata and controls

2774 lines (1711 loc) · 151 KB

SEAStAR User Guide, version 0.5.0

Vaughn Iverson and Chris Berthiaume

Armbrust Lab, School of Oceanography, University of Washington

Table of Contents

The SEAStAR tools are a collection of programs which implement many data and processing intensive steps in next-generation sequencing analysis pipelines. The tools quantitatively analyze alignments to reference sequence databases (including de novo assembled contigs), to produce a wide variety of statistics and relationships, including producing binned, assembled genomic scaffolds. At one time, SEAStAR was an acronym for Select and Estimate Abundance from Short Aligned Reads. However, as described above, the tools and their capabilities have expanded well beyond that description, and so now SEAStAR is just our name for the in-house components of our analysis pipeline.

SEAStAR Pipeline

Our philosophy has been to create stable, modular, high-performance tools that fill important gaps we encountered between other established pre-existing tools in our pipeline. As such, you will see references to other packages throughout this document. We recognize that considerable innovation is ongoing in analysis software for next generation sequence, and so we have maintained a modular, pipelined approach to permit experimentation with, and substitution of, new tools as they become available. Due to the wide variety of different computational environments that undoubtedly exist in different labs and institutes, our focus will be on releasing stable, documented, portable and high performance components from our analysis pipeline, but not the pipeline scripts themselves. This is because these scripts will need to be highly customized for both the goals and computational environment of any specific project.

Initially our goal was to release tools to enable others to work directly with datasets generated by the Life Technologies SOLiD™ sequencing platform. We recognized that most of the components we have developed are also valuable for use with Illumina® sequence data, or with a mixture of sequence data types, and so in this release we have generalized most components to support both colorspace and nucleotide reads.

Technical guidance

The next-generation sequence datasets that the SEAStAR tools are designed to work with are typically very large (>> one billion reads). All compiled components of the SEAStAR tools are multithreaded and optimized for modern multi-core processors. Many SEAStAR tools will take advantage of 8 or more cores, particularly when compressed (gzip format) input and output files are being processed. When running on such a system, you will find that these components may run considerably more than 10x faster than equivalent tools written in a scripting language such as Python. For this reason, the operation of such tools will often be "I/O bound"; that is: the performance bottleneck is the speed of the disk(s) and/or network connections connected to the workstation, and not the availability of CPU cycles. The use of compressed files and local, separate independent disks for input and output files can often relieve some of this bottleneck and further improve run times.

Several of the SEAStAR tools (as well as many other tools used by our pipeline) have very high memory requirements when working with typical next-generation sequence datasets. We note these cases in the discussion of each individual SEAStAR tool, and attempt to provide guidance about the impact different parameters will have on memory use. However, in general, doing this kind of work will require a workstation with at least 32 gigabytes of RAM, and depending on the dataset and the types of analysis your project requires, you may require some multiple of that amount (64, 128, 256, or even 512 gigabytes) to successfully implement an analysis pipeline that meets your needs. When running these tools, it is highly advisable to set a session limit on the memory a process can attempt to allocate, to prevent machines from crashing and impacting other users who may be logged in. For example, adding the Unix shell command:

ulimit -v 64000000

to any script that invokes high memory tools will effectively prevent a 64GB machine from being unintentionally brought down (i.e. thrashing the swap file) by an application that attempts to allocate more memory than is physically available.

License and citation

The SEAStAR tools are open source and are currently publicly distributed under the GPL version 3 license, a copy of which is provided in a file named COPYING in the project root directory. By using this software, you are agreeing to be bound by the terms of this license. Please contact the authors if you are interested in discussing an alternative licensing arrangement.

We are in the process of preparing several publications specific to the SEAStAR analysis pipeline and the custom tools it contains. In the interim, we ask that you cite the following paper (which was the first to use and describe these tools):

Iverson, V., Morris, R. M., Frazar, C. D., Berthiaume, C. T., Morales, R. L .and Armbrust, E. V., Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota, Science 335 pp.587-590 (2012)

This section is a quick introduction to using SEAStAR tools for some common use cases. The SEAStAR tools are "command-line" tools that run on UNIX-like operating systems. If you are not familiar with how to use the UNIX command-line shell, you should work your way through a tutorial before proceeding. Here are a few possibilities:

This tutorial assumes that you have already installed the SEAStAR software. If you are working from the SEAStAR Quick Start virtual machine appliance file in VirtualBox, your environment is already set-up and ready to go, so you can skip to the next section: Read Preparation.

For SEAStAR installation instructions see the README file in the SEAStAR root directory. Although the SEAStAR tools can be quite memory and resource intensive, with the data used in this tutorial you should be able to comforably complete all of the steps below on a modern laptop with a dual-core processor and 2GB of RAM.

The data used in these examples are actual raw SOLiD sequence reads corresponding to the Lambda phage genome, and can be found in the test_data subdirectory of the SEAStAR source directory.

Create a new working directory and copy the following files from the test_data directory into that directory.

# Substitute the actual path to your SEAStAR "source directory" for
# [SEASTAR_src] in the copy commands below:

mkdir Quick_start
cd Quick_start
cp [SEASTAR_src]/test_data/lambda_reads_F3.csfasta.gz .
cp [SEASTAR_src]/test_data/lambda_reads_F3_QV.qual.gz .
cp [SEASTAR_src]/test_data/lambda_reads_R3.csfasta.gz .
cp [SEASTAR_src]/test_data/lambda_reads_R3_QV.qual.gz .

And make sure that your PATH environment variable points to the executable files in your SEAStAR binaries directory:

# Where [SEASTAR_dest] below is the fully qualified path to your
# SEAStAR build destination directory.

export PATH=$PATH:[SEASTAR_dest]/bin

If the PATH above is working, you should be able to run ref_select --version and get back something like: ref_select version: v0.4.x. If this doesn't work, you will need to figure out where SEAStAR is installed before proceeding so that the instructions that follow in this tutorial can find the SEAStAR tools on your computer.

NOTE: To work your way through all of the examples in this tutorial, you will need the following external tools (in addition to those that were required to build SEAStAR):

  • Velvet -- You need the colorspace version. Build it with: make color, see the Velvet README file for instructions.
  • BWA -- Version 0.5.10. Important!! Because newer versions do not support the SOLiD colorspace reads used by this tutorial
  • GraphViz

Please ensure that the directories where these tools reside are also in your PATH environment variable before proceeding. You can test this by typing velveth_de, bwa, and dot -V at your command line. Each of these commands should write its version string (and perhaps other help messages) in response to being run.

Convert SOLiD .csfasta and .qual files to gzipped .fastq files. For real projects, users with no colorspace data will omit this step in their analysis pipeline, although the FASTQ file naming conventions still need to be followed for nucleotide (e.g. Illumina) reads; see the FASTQ appendix for details.

solid2fastq -z lambda_reads lambda_conv

Remove presumed PCR duplicate reads, identified by mate-pairs seen more than once.

fastq_nodup -z -l 13 -d 2 -e 3 lambda_conv lambda_dedup

Trim and filter the de-duplicated FASTQ reads based on read quality, information content, and length.

trimfastq -z --mates_file -p 0.9 -l 34 -m 34 --add_len -e 3.0 lambda_dedup lambda_trim

Randomly sample reads from .fastq files, retaining approximately 10% of the original reads. Randomly subsampled read sets are useful for quickly tuning the parameters of trimfastq for a given set of reads against some reference. We won't do that here, but tuning the trimfastq -p setting is important for getting clean assemblies and maximizing the useful information in your sequence data.

samplefastq -z -f 0.1 lambda_trim lambda_samp

De novo contig assembly

If you have installed the colorspace aware build of the de novo assembly tool Velvet, it can be used to assemble lambda-phage colorspace contigs:

velveth_de lambda_asm/ 19 -fastq.gz -shortPaired lambda_trim.mates.fastq.gz -short lambda_trim.single.fastq.gz > lambda_asm.velveth_de.log 2>&1
velvetg_de lambda_asm/ -scaffolding no -read_trkg no -ins_length auto -ins_length_sd auto -exp_cov 50 -cov_cutoff 5 -min_contig_lgth 50 > lambda_asm.velvetg_de.log 2>&1

There will now be a subdirectory called lambda_asm with a file called contigs.fa in it. These are the colorspace contigs we will use in the next section.

Aligning reads to a reference

For this example, you need the short read aligner BWA (version <= 0.5.10). It can be used to align the de-duplicated and trimmed lambda FASTQ reads against the lambda-phage colorspace contigs. IMPORTANT: This quickstart example will not work with versions of BWA version 0.6.0 or newer because colorspace support was removed from BWA at that point. For your own work, if you are using nucleotide reads, you are free to use any alignment software you wish that produces standard SAM files.

# Because these are colorspace contigs and BWA expects nucleotide reference sequences,
# we need to convert the colorspace contigs to nucleotides using a single starting
# nucleotide (this will be wrong 3/4 of the time, but it doesn't matter because
# BWA immediately converts the reference back to colorspace internally...)

csfasta2ntfasta.awk lambda_asm/contigs.fa > lambda_contigs.fna

bwa index -a is -c lambda_contigs.fna
bwa aln -c -n 0.001 -l 18 lambda_contigs.fna lambda_trim.read1.fastq.gz > lambda_trim.read1.sai
bwa samse -n 1000000 lambda_contigs.fna lambda_trim.read1.sai lambda_trim.read1.fastq.gz 2>lambda_trim.read1.samse.log > lambda_trim.read1.sam
bwa aln -c -n 0.001 -l 18 lambda_contigs.fna lambda_trim.read2.fastq.gz > lambda_trim.read2.sai
bwa samse -n 1000000 lambda_contigs.fna lambda_trim.read2.sai lambda_trim.read2.fastq.gz 2> lambda_trim.read2.samse.log > lambda_trim.read2.sam
bwa aln -c -n 0.001 -l 18 lambda_contigs.fna lambda_trim.single.fastq.gz > lambda_trim.single.sai
bwa samse -n 1000000 lambda_contigs.fna lambda_trim.single.sai lambda_trim.single.fastq.gz 2>lambda_trim.single.samse.log > lambda_trim.single.sam

The resulting alignment files are in SAM format, which is what the next operation requires.

Constructing an assembly graph

Now we will use the ref_select tool to convert the colorspace contig sequences to nucleotides and output the matepair assembly graph JSON file.

ref_select -q -m --mp_mate_cnt=10 -r lambda_trim.read1.fastq.gz -r lambda_trim.read2.fastq.gz -r lambda_trim.single.fastq.gz lambda_trim.read1.sam lambda_trim.read2.sam lambda_trim.single.sam > lambda_asm.json

Producing scaffolded sequence

# The following chain of graph_ops commands accomplish the following steps:
# MST - Convert the assembly graph into a directed spanning tree
# PLUCK - Remove short (stub) branches from the tree, leaving a mostly linear structure
# PRUNE - Cleave any remaining branches from the tree, leaving linear backbones
# PUSH - Put back contigs removed from the ends of the scaffolds by PLUCK
# INSERT - Put back small contigs that fit between the backbone contigs
# SCAFF - Re-linearize the scaffolds based on constraints added by INSERT
# FASTA - Join all neighboring contigs together (looking for overlaps) and write FASTA to STDOUT

graph_ops lambda_asm.json MST PLUCK PRUNE PUSH INSERT SCAFF FASTA '{"scaff":true}' > lambda_scaffs.fna

Visualizing the sequence graph, and using SCRIPT files with graph_ops.

If you rerun graph_ops using the DOT command at each stage of the pipeline, you can then visualize the assembly graph as it progresses through each stage of processing using Graphviz.

First, create a text file called lambda_viz.go with the following contents:

# Sample script file for SEAStAR graph_ops tool
DOT {"file":"lambda_asm#.dot"}
MST
DOT {"file":"lambda_asm#.dot"}
PLUCK
DOT {"file":"lambda_asm#.dot"}
PRUNE
DOT {"file":"lambda_asm#.dot"}
PUSH
DOT {"file":"lambda_asm#.dot"}
INSERT
DOT {"file":"lambda_asm#.dot"}
SCAFF
DOT {"file":"lambda_asm#.dot"}
FASTA {"scaff":true,"file":"lambda_scaffs.fna"}

Now run the following command:

graph_ops lambda_asm.json lambda_viz.go

This will produce the same scaffolded assembly as the previous example, but in addition, it will write a DOT (GraphViz) format file for each step in the scaffolding pipeline (with an incrementing number at the end of the filename: lambda_asm0.dot, lambda_asm1.dot, etc.) This next command will convert all of these DOT files to corresponding PDF figures using the neato layout engine in Graphviz.

# After this command, load the output PDF files in a viewer to see the scaffold layout
# process, step by step. Circles are contigs with area proportional to sequence length
# and color by %GC. Black arrows are mate-pair connections with thickness indicating
# bitscore. Red arrows are added dependencies to produce a fully ordered layout for SCAFF.

for i in lambda_asm?.dot; do neato -Tpdf $i > ${i%.*}.pdf; done

You should now be able to load the PDF files with your favorite PDF viewer in order from lambda_asm0.pdf through lambda_asm6.pdf to see the effect of each graph_ops command in the script above. Each graph represents the state of the assembly before/after each successive step, showing the operation of graph_ops as it operated on this example dataset.

Estimating 16S sequence abundance

The tools in SEAStAR can be used to create a pipeline for estimating sequence abundance in a short-read data set. This is particularly useful for assessing community composition in a metagenomic sample through alignments to a 16S database. See vignettes/RDP/RDP_vignette.html for a walk-through of an implementation of this pipeline using the RDP 16S database.


Tools described in this user guide:

  • solid2fastq -- Conversion of SOLiD specific CSFASTA/QUAL or FASTQ files to SEAStAR style FASTQ format files
  • fastq_nodup -- Reference-free PCR deduplication of paired reads
  • samplefastq -- Subsamples reads randomly
  • trimfastq -- Tunable, pairing aware read-trimmer
  • ref_select -- Converts SAM reference alignments into JSON "sequence graph" files for assembly and abundance analyses
  • graph_ops -- A toolkit for operating on the ref_select "sequence graph" files to produce assemblies and data tables
  • seq_scaffold -- Converts pre-ordered and oriented contigs into scaffolded sequence, filling gaps where possible
  • tetracalc -- Clusters sequence scaffolds into "species-level" taxonomic bins for final assembly
  • misc_scripts -- Miscellaneous scripts

This tool converts SOLiD colorspace .csfasta and .qual input files (raw reads transferred off the instrument) to output files that are an interoperable variant of the FASTQ file format (see SEAStAR format details in section FASTQ). It can also convert nucleotide FASTQ files produced by runs using Exact Call Chemistry to SEAStAR style FASTQ files. This is conversion necessary to ensure that read pairs are synced in .read1.fastq and .read2.fastq.

Usage: solid2fastq [options] <in_prefix> <out_prefix>

Where <in_prefix> denotes the input prefix which is the part of the input filename shared in common between all .csfasta and .qual or .fastq files generated by the same sequence library. It is typically shown in the Title: comment line near the top of .csfasta format input files. For example:

# Title: GG050409_20090604_matepair_50

is found near the top of input files:

GG050409_20090604_matepair_50_R3.csfasta, GG050409_20090604_matepair_50_R3_QV.qual, GG050409_20090604_matepair_50_F3.csfasta and GG050409_20090604_matepair_50_F3_QV.qual.

So use of GG050409_20090604_matepair_50 for <in_prefix> will specify these four files for input to the solid2fastq tool.

Note that solid2fastq relies on the SOLiD naming conventions for the suffixes of these files (e.g. _R3.csfasta). You are free to rename these files with different prefixes, but the suffixes must remain the same. You may, however, compress these files using gzip and add the customary .gz suffix at the very end of the filename, which tells solid2fastq to decompress the files as it reads them.

The <out_prefix> parameter is similar to the <in_prefix> as it is used to name the output .fastq file(s) generated by this tool. The output prefix may be any text, except you should generally avoid whitespace and punctuation characters other than "_". This is because, by default, the names of all output reads are also prefixed with this string and the variant of the FASTQ format used by the SEAStAR tools depends on certain punctuation characters to quickly parse these read names (see SEAStAR FASTQ format ). This file naming constraint may be relaxed through the use of optional parameters described below.

Note that <in_prefix> and/or <out_prefix> may contain directory path information, but in the case of <out_prefix> this will necessitate the use of the --prefix or --no_prefix options (described below).

solid2fastq writes a single output .fastq file for each matching pair of .csfasta / .qual input files or for each single .fastq input file. For example, given a set of non-barcoded colorspace input files, the names of these output files are constructed as:

<in_prefix>_R3[.csfasta|_QV.qual]<out_prefix>.read1.fastq

<in_prefix>_F3[.csfasta|_QV.qual]<out_prefix>.read2.fastq

singlets from F3/R3 → <out_prefix>.single.fastq

Note that for paired-end libraries, alternate naming suffix conventions for the R3 input are also recognized (substituting F5-BC, F5-P2, F5-DNA, F5-RNA, for R3).

Additionally, for paired libraries (mate-paired or paired-end) should solid2fastq encounter any singlet reads (those with a missing mate), they are by default given the appropriate suffix and then written to a file named: <out_prefix>.single.fastq

The complete list of file name patterns that will be recognized by solid2fastq are:

<in_prefix>_<BC>_R3.fastq<out_prefix>.read1.fastq <in_prefix>_[R3|F5-BC|F5-P2|F5-DNA|F5-RNA][.csfasta|_QV.qual|.QV.qual]<out_prefix>.read1.fastq <in_prefix>_<BC>_[R3|F5-BC|F5-P2|F5-DNA|F5-RNA][.csfasta|.QV.qual]<out_prefix>.read1.fastq <in_prefix>_[R3|F5-BC|F5-P2|F5-DNA|F5-RNA]_[<BC>.csfasta|_QV_<BC>.qual]<out_prefix>.read1.fastq <in_prefix>_<BC>_F3.fastq<out_prefix>.read2.fastq <in_prefix>_F3[.csfasta|_QV.qual|.QV.qual]<out_prefix>.read2.fastq <in_prefix>_<BC>_F3[.csfasta|.QV.qual]<out_prefix>.read2.fastq <in_prefix>_F3_[<BC>.csfasta|_QV_<BC>.qual]<out_prefix>.read2.fastq

solid2fastq optional parameters [options]:

[-h|--help] [--version] [--prefix=<string>] [-n|--no_prefix] [-x|--no_suffix] [-z|--gzip] [-b <BC>|--bc=<BC>] [-s|--singles]

-h / --help

Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.

--version

Print the version number of the executing program and then exit.

--prefix=<string>

Specify the prefix string to add to read identifiers. For example, .csfasta read identifier: >1_68_381_R3 will become: @xx+<string>:1_68_381/1 in the .fastq output file. By default this is <out_prefix> described above.

The use of a terse, but unique, prefix string is advisable because it helps identify the source, and processing steps, performed on a set of reads as they flow through an analysis pipeline. Keep in mind that this text will be added to the output file for every read, so short strings are preferable. The output prefix may be any text you would like, except you may not use whitespace or punctuation characters other than "_".

-n / --no_prefix

Overrides the default read prefixing behavior and instead preserves the read identifiers as they are in the input .csfasta file(s).

For example: >1_68_381_R3 simply becomes @xx+:1_68_381/1

-x / --no_suffix

Suppress /1 or /2 suffix additions to read IDs. This ensures matching read IDs for paired reads in FASTQ files and non-BWA aligner generated SAM files.

-z / --gzip

Write output .fastq files compressed in the gzip format, and with filenames suffixed .gz.

-b <BC> / --bc=<BC>

Used only for barcoded SOLiD libraries, in addition to <in_prefix>, to specify input files for a specific barcode (<BC>).

-s / --singles

Write two separate singlet files that segregate the singlets by input files. This is useful for doing strand-specific analyses using paired-end libraries, where the mated reads are sequenced from opposite strands. For example:

<in_prefix>_R3[.csfasta|_QV.qual]<out_prefix>.read1.fastq

singlets from R3<out_prefix>.single1.fastq

<in_prefix>_F3[.csfasta|_QV.qual]<out_prefix>.read2.fastq

singlets from F3<out_prefix>.single2.fastq


This tool reads paired .fastq files (with associated singlets), and removes the lowest quality pair(s) of presumed PCR duplicate reads. This operation assumes that multiple read-pairs sharing substantially the same sequences for both mates are statistically unlikely to occur frequently at random (due to the distribution of insert sizes), and therefore represent likely artifactual duplications resulting from PCR over-amplification during library construction. If this assumption does not hold for a given library, then use of this tool is inappropriate.

Notably, this tool is most appropriate for metagenomes (or any other sample type) where no reference genome is available for performing duplicate removal through reference alignments. This is because fastq_nodup works entirely through error-tolerant internal comparisons of read-pairs to each other within an entire sequenced library.

To accomplish this, fastq_nodup builds a large table of sequence prefixes, requiring a substantial amount of memory. The amount of memory required can be controlled through tuning the various parameters, but a minimum of 32Gb will probably be necessary for most read datasets.

Note that singlet reads are also randomly removed in the same proportion as matching mated reads. This is done assuming that PCR duplicated pairs are as likely to produce singlets as non-duplicated pairs. For example:

S1, S1, M1 -- M2, M1 -- M2, M1 -- M3

In the above example Sn = singlet reads, Mn = mated reads, and the numeric suffixes indicate matching sequences. So the lowest quality pair of M1 -- M2 mates will be removed from the output, and each of the S1 reads will have a 1/3 probability of being removed (disregarding quality). The 1/3 probability is calculated from the fact that 1/3 of the pairs containing sequence M1 were presumed PCR duplicates and therefore removed.

Usage: fastq_nodup [options] <in_prefix> <out_prefix>

where <in_prefix> and <out_prefix> specify the input and output filename prefixes to use for input files and naming output files. For example:

<in_prefix>.read1.fastq<out_prefix>.read1.fastq <in_prefix>.read2.fastq<out_prefix>.read2.fastq <in_prefix>.single.fastq<out_prefix>.single.fastq

The input files may be gzip format compressed files with the customary .gz suffix at the very end of the filename, which tells fastq_nodup to decompress the files as it reads them. Multiple single files (i.e. .single1.fastq and .single2.fastq suffixes) are also recognized.

fastq_nodup optional parameters [options]:

[-h|--help] [--version] [-v|--verbose] [-z|--gzip] [--prefix=<string>] [--no_prefix] [-l <u>|--index_len=<u>] [-m <u>|--match_len=<u>] [-d <u>|--index_err=<u>] [-e <u>|--match_err=<u>] [--seed=<n>]

-h / --help

Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.

--version

Print the version number of the executing program and then exit.

-v / --verbose

Print additional statistics and diagnostic information about the run.

-z / --gzip

Write output .fastq files compressed in the gzip format, and with filenames suffixed .gz.

--prefix=<string>

Specify the prefix string to add to read identifiers. For example: @<string>:1_68_381

Note, this will replace any existing prefix. By default this is <out_prefix> described above.

--no_prefix

Overrides the default read prefixing behavior and instead preserves the read identifiers exactly as they are in the input fastq file(s).

-l <u> / --index_len=<u>

Controls the read sequence prefix length of the search index. This should be set long enough so that random prefix collisions are infrequent, but short enough so that a large number of sequence errors are unlikely to accumulate. This parameter also impacts the memory use of this tool, with each increment increasing the index size by a factor of 4. This parameter defaults to <u>=14 nucleotides and may not exceed 32 nucleotides.

Both mates of each read-pair are indexed, and this indexed look-up is used to find a list of mate sequences corresponding to each sequence prefix. Given that the mates also must match, it is acceptable for a low-level of prefix collisions to occur in this index. That is, assuming the level of prefix collision is low (only a small number of non-duplicated reads share any given prefix), it will be highly improbable that reads matching a given prefix will share mates with matching sequence at random.

-m <u> / --match_len=<u>

Controls the required match length of the unindexed mate in detecting a duplicate pair. Like the previous parameter (--index_len), this setting has an impact on memory use, although memory requirements only increase linearly with match length. This parameter should be set as high as memory permits to improve specificity.

This parameter defaults to the full read length, although lower settings are acceptable assuming that the --index_len is appropriately set, and the sum of --index_len and --match_len exceeds about 35 nucleotides (that is, random 35 nucleotide joint index-match collisions are extremely unlikely assuming each match is sufficiently long).

-d <u> / --index_err=<u>

Controls the number of mismatches tolerated in the indexed sequence prefix while still permitting a sequence match. This value defaults to <u>= 1 and may be set in the range 0-2. This setting has a major impact on run time, with each increment increasing it by a factor of about 4. Duplicated reads with more than --index_err mismatches in the --index_len prefix may still be found by the mate's indexed prefix, assuming that it does not also have more than --index_err errors.

-e <u> / --match_err=<u>

Controls the number of mismatches tolerated in a matched mate sequence (in the --match_len prefix). This value defaults to 3 and must be > 0. This setting has little impact on run time. Setting --match_err to higher values will reduce the specificity of the mate-matching and should therefore be accompanied by correspondingly larger values of --match_len.

-seed=<n>

Integer seed used by the internal pseudo-random number generator. This only affects the random selection of matching singlets (as described above). Defaults to <n>= 12345.


This tool randomly samples reads from a given set of fastq files, producing a corresponding set of output fastq files. This is useful for more quickly tuning parameters for the trimfastq tool (described below) and for producing simulated read datasets for various other purposes.

Usage: samplefastq -f|--frac_sample=<float>|<int> [options] <in_prefix> <out_prefix>

where the mandatory -f parameter indicates how the input files should be sampled and <in_prefix> and <out_prefix> specify the input and output filename prefixes to use for input files and naming output files. For example:

<in_prefix>.read1.fastq<out_prefix>.read1.fastq <in_prefix>.read2.fastq<out_prefix>.read2.fastq <in_prefix>.single.fastq<out_prefix>.single.fastq

The input files may be gzip format compressed files with the customary .gz suffix at the very end of the filename, which tells samplefastq to decompress the files as it reads them. Multiple single files (i.e. .single1.fastq and .single2.fastq suffixes) are also recognized.

Note that for paired sequences (as shown above) read pairs are always sampled together; that is, each pair of reads counts at a single read unit for sampling purposes, as does each singlet read.

samplefastq mandatory parameter:

[-f|--frac_sample=<float>|<int>]

-f <float> / -f <int> / --frac_sample=<float> / --frac_sample=<int>

The -f parameter tells samplefastq, either directly or indirectly, how many output reads to randomly sample from the input .fastq files. Values < 1.0 are interpreted as a fraction of the input reads to retain, whereas values > 1.0 are rounded to an integer count of reads to retain.

Note that when a read count is provided, the input files must be read twice by samplefastq; once to count the number of input reads, and then again to randomly sample those reads. Once the number of input reads is known, the requested read count is converted into a fraction and operation proceeds as though that fraction had been initially provided to the -f parameter. Therefore, if a given set of inputs is to be sampled repeated, it will be considerably more efficient to obtain the count of reads once and calculate the fraction to retain for each subsequent sampling. The requested (or calculated) fraction of reads to sample is used internally as the probability of retaining each read (or pair) and each is considered an independent trial. Because of this, the precise fraction (or count) of reads requested is unlikely to be generated, although the sampled output will typically be very close to that requested.

samplefastq optional parameters [options]:

[-h|--help] [--version] [--seed=<int>] [-z|--gzip]

-h / --help

Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.

--version

Print the version number of the executing program and then exit.

-z / --gzip

Write output fastq files compressed in the gzip format, and with filenames suffixed .gz.

--seed=<n>

Integer seed used by the internal pseudo-random number generator. To obtain different samples from the same input file(s), set this parameter to different values. Repeated runs with all parameters the same will produce identical outputs. Defaults to <n>= 12345.


This tool is used to trim or reject individual reads on the basis of quality scores. Reads may also be rejected based on low information content (entropy). When run on paired reads, this operation is performed on mates independently, and when one mate of a pair is rejected, the other becomes a singlet.

Trimming based on quality scores is performed with the goal of producing trimmed reads that have a calculated probability of error below some threshold. That is, bases are trimmed off the 3' end of the read until the remaining bases have a joint probability of error below a threshold. For example, if a threshold of 0.5 is selected, that means the output reads will be expected to have, on average, 0.5 erroneous nucleotides per read.

For colorspace (SOLiD differentially encoded) reads, the probability of error is calculated as the joint probability of error in each of two consecutive color positions. That is, because single color errors will typically be correctable, the probability of a nucleotide error at a given position is the probability that both corresponding color positions are erroneous (i.e. the pairwise product of color error probabilities yields the probability of uncorrectable nucleotide errors).

Minimum read lengths may also specified such that reads trimmed shorter than the specified minimum to meet an error probability threshold will instead be rejected.

Reads may also be independently rejected for failing to have a minimum mean information content. trimfastq optionally calculates the mean entropy of each read (as average bits per dinucleotide) and reject reads that fail to meet a given threshold. A common error for next-generation sequencers is to produce junk reads where all or part of the read contains highly repetitive (low information) sequence. Random DNA sequence contains 4 bits per dinucleotide of entropy, and we have observed that (non-repeat) natural sequences typically contain more than 3 bits per dinucleotide.

Usage: trimfastq [options] <in_prefix> <out_prefix>

where <in_prefix> and <out_prefix> specify the input and output filename prefixes to use for input files and naming output files. For example:

<in_prefix>.read1.fastq<out_prefix>.read1.fastq

<in_prefix>.read2.fastq<out_prefix>.read2.fastq

<in_prefix>.single.fastq<out_prefix>.single.fastq

The input files may be gzip format compressed files with the customary .gz suffix at the very end of the filename, which tells trimfastq to decompress the files as it reads them. Multiple single files (i.e. .single1.fastq and .single2.fastq suffixes) are also recognized. It is also possible to write FASTA format files (i.e. lacking quality scores) and interleaved paired-read files (for the Velvet assembler). See the options below for more information.

trimfastq optional parameters [options]:

[-h|--help] [--version] [-c|--color_space] [-z|--gzip] [-v|--invert_singles] [-s|--singles] [-p <d>|--correct_prob=<d>] [-l <u>|--min_read_len=<u>] [-m <u>|--min_mate_len=<u>] [-f <u>|--fixed_len=<u>] [--prefix=<string>] [--add_len] [--no_prefix] [-e <d>|--entropy_filter=<d>] [--entropy_strict] [--mates_file] [--no_rev] [--force_rev] [--only_mates] [--fasta]

-h / --help

Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.

--version

Print the version number of the executing program and then exit.

-z / --gzip

Write output fastq files compressed in the gzip format, and with filenames suffixed `.gz+.

--prefix=<string>

Specify the prefix string to add to read identifiers. For example: @<string>:1_68_381

Note, this will replace any existing prefix. By default this is <out_prefix> described above.

-n / --no_prefix

Overrides the default read prefixing behavior and instead preserves the read identifiers exactly as they are in the input fastq file(s).

--add_len

Add the final trimmed length value to the read prefix. For example:

@lambda:1_81_912
TGTTGTGCGCGCCATAGCGAGGGGCTCAGCACGCGTCCCTCCGCCCCAC
+
5/0)358)%57*)%881/(/4'.'53*&#91*-'&,&%$''%&'46-+#`

Trims to: ("37" in the first line indicates the final trimmed length)

@37|lambda_trim:1_81_912
TGTTGTGCGCGCCATAGCGAGGGGCTCAGCACGCGTC
+
5/0)358)%57*)%881/(/4'.'53*&#91*-'&,&

-v / --invert_singles

Causes singlet file(s) output to be the opposite configuration of the input. That is:

<in_prefix>.single.fastq<out_prefix>.single[12].fastq

or

<in_prefix>.single[12].fastq<out_prefix>.single.fastq

Note that this option is only valid when there are input singlet reads.

-s / --singles

Write two singlet files (.single1.fastq and .single2.fastq), one for new singlets generated from each paired input file. Note that this option is only valid when there are no input singlet reads. The default behavior in this case is to write a single combined singlet file (.single.fastq).

-p <d> / --correct_prob=<d>

Minimum mean probability that trimmed output reads are free of uncorrectable errors (or all errors with -c). Default value is <d>= 0.5 Setting -p to 0.0 completely disables quality trimming, which is sometimes appropriate for file format conversion and interleaving operations (e.g. see --fasta and --mates_file).

-l <u> / --min_read_len=<u>

Minimum length of a singlet or longest-mate of a pair, in nucleotides. Default value is <u>= 24 bases.

-m <u> / --min_mate_len=<u>

Minimum length of the shortest mate of a pair, in nucleotides. Default value is the value provided for --min_read_len. This value must be < the value used for --min_read_len. Use of this parameter allows paired reads shorter than --min_read_len to be retained as long as their mate satisfies --min_read_len.

-f <u> / --fixed_len=<u>

Trim all reads to a fixed length, still rejecting reads that don't meet specified quality thresholds at this length. Default is no fixed length.

-e <d> / --entropy_filter=<d>

Reject reads with mean per position measured information (entropy) below the given value (in bits per dinucleotide). The range of valid values is 0.0-4.0 inclusive. By default this filter is off.

--entropy_strict

Reject reads for low entropy overall, not just the retained part after trimming. By default this setting is off. Use of this setting requires use of --entropy_filter.

--mates_file

In addition to other outputs, produce a Velvet compatible interleaved paired output file (named <out_prefix>_mates.fastq) with read2 mates reversed by default (to support SOLiD mate-paired reads.)

--no_rev

By default, for SOLiD colorspace read-pairs, the second read of each pair is reversed in the mates file produced by --mates_file (to put the mates on opposite strands, as Velvet expects). --no_rev disables this reversing, which is useful for SOLiD colorspace paired-end files. Nucleotide inputs are by default not reversed in the output generated by --mates_file, so this option has no effect in that case.

--force_rev

By default, for nucleotide (non-colorspace) read-pairs, the second read of each pair isn't reversed in the mates file produced by --mates_file (to keep the mates on opposite strands, as Velvet expects, assuming Illumina sequence). --force_rev enables this reversing, which is useful for SOLiD 5500+ nucleotide mate-paired files (nucleotide sequences are correctly reverse complemented). Colorspace inputs are reversed by default in the output generated by --mates_file, so this option has no effect in that case.

--only_mates

Surpress writing <out_prefix>.read1.fastq and <out_prefix>.read2.fastq output files. --only_mates may only be used with the --mates_file option. This option is useful when reads are being trimmed exclusively for assembly, to save unnecessary processing for files that will not be used. See also the --fasta option below to save even more space.

--fasta

Write FASTA format files instead of FASTQ files for all outputs. FASTA files are about half the size of FASTQ, so for some purposes (e.g. assembly, alignment) that may not take quality scores into account, FASTA may be a preferable output format to save disk space. When --fasta is used, no .fastq files will be written, and the output files will instead end with .fasta[.gz].


ref_select is a utility for calculating a wide variety of useful information from alignments of next-generation sequence reads to large reference sequence datasets. For example, it can be used to build a meta-genomic assembly graph from paired-reads aligned back to de novo assembled contigs. It can also be used to precisely select (and accurately estimate the relative abundances of) sequences from an alignment with a reference library (e.g. metagenomic reads to 16S rDNA database, or mRNA transcript reads to a database of genes from one or more genomes.)

ref_select processes SAM format alignment files into a JSON format "sequence graph" file ready to be processed by the graph_ops tool. Optionally, ref_select will also read FASTQ files and filter out reads (and mates) which are or aren't present in the provided SAM alignment. The JSON output file is structured as a graph of "nodes" (selected sequences) and "edges" (connections between nodes based on read-pairing) A variety of useful information for the nodes and edges is included in the output JSON file, which can be accessed using the graph_ops tool or custom code written in any language with a JSON library.

Usage: ref_select [options] <sam_file1> [<sam_file2>]... > <outfile.json>

ref_select has a lot of options, and depending on what type of analysis you are doing, you may use quite a few of them. However, at its simplest, ref_select reads one or more SAM alignment files and writes a JSON "sequence graph" to STDOUT:

ref_select alignment.single.sam > seq_graph.json

ref_select requires any SAM (or FASTQ) format input files to be named using the SEAStAR naming convention used for FASTQ files. Crucially, individual SAM alignment files must not mix alignments for reads from different categories (read1, read2, single), and the corresponding filename suffixes of the SAM files must match those of the FASTQ files used to generate them:

sample.read1.fastqalignment.read1.sam sample.read2.fastqalignment.read2.sam sample.single.fastqalignment.single.sam

Note that ref_select may also write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when saving the output to a file, or the resulting saved JSON data file syntax may be corrupted by messages written to STDERR.

Of course, since these types of files may be very large, you may prefer to keep everything compressed:

ref_select alignment.single.sam.gz | gzip -c > seq_graph.json.gz

Because ref_select has many options, and produces many different kinds of outputs, we made the decision to organise all of the output data into a single output stream in a standard text-based format (JSON), that is flexible and straightforward to process in any language. The JSON formatted output produced by ref_select is documented in the appendix: JSON Sequence Graph File Format Reference. However, for most tasks you will not need to look directly into these files because the graph_ops tool will handle that work for you (including the work of extracting data into other standard file formats).

ref_select has four main pieces of functionality:

  1. Quantitatively analyzing reads aligned to a known reference database (e.g. metagenomic reads to a 16S rDNA database, or reads from mRNA transcripts to the gene models of one or more genomes):

     # Select only reference sequences with reads aligned scoring 100 or more bits
     ref_select -t 100.0 alignment.single.sam > ref_stats.json
    
     # Convert ref_select output to a tab separated (TSV) table for use downstream
     graph_ops ref_stats.json TABLE > ref_stats.tsv
    
  2. Analyzing reads mapped back to de novo assembled contigs, to build a "sequence graph" for downstream quality checking, scaffolding, binning, visualization, etc.

     # Construct a sequence graph from the alignment, inserting the assembled contig seqs into the graph
     ref_select --ref=contigs.fna -m alignment.read1.sam alignment.read2.sam > seq_graph.json
    
     # Produce scaffolded sequence (where .go file is graph_ops SCRIPT)...
     graph_ops seq_graph.json run_scaffold_pipeline.go > scaffolds.fna
    
  3. Converting SOLiD colorspace contigs into nucleotide contigs based on aligned reads

     # Rather than inserting contigs as above, they are reconstructed from the alignment
     # and the provided read FASTQ files
     ref_select -q -m -r mates.read1.fastq -r mates.read2.fastq alignment.read1.sam alignment.read2.sam > seq_graph.json
    
     # Produce scaffolded sequence (where .go file is graph_ops SCRIPT)...
     graph_ops seq_graph.json run_scaffold_pipeline.go > scaffolds.fna
    
  4. Filtering out a sub-set of reads (and their mates) based on the results of an alignment.

     # Write two new read files for all reads (and their mates) that did not map in the alignment
     # NOTE: this is the one exception to the "all data goes into the JSON stream" principle described above.
     # Filtered read output data is written to `FASTQ` format files, with a prefix specified using --read_output
     ref_select --read_output=filtered_ --output_nomatch -r mates.read1.fastq -r mates.read2.fastq alignment.read1.sam alignment.read2.sam
    

The approaches used by ref_select to accomplish these tasks are described in more detail in the methods section of (Iverson, et al. 2012).

The [options] reference below organizes the various parameters into general categories related to the above tasks:



[-h|--help] [-v|--version] [--verbose] [-d <seq_id>|--detail=<seq_id>]... [--detail_file=<detail_file>] [-q|--recon_seq] [--ref=<ref_file>] [-m|--mate_pair] [--split=<n>] [--separate_strands] [--rollup] [--seed=<n>] [--num_threads=<n>]

-h / --help

Request help.

-v / --version

Print the build version and exit.

--verbose

Print detailed diagnostic status to stderr during run. Off by default.

-d <seq_id> / --detail=<seq_id>

Only produce per-base statistics and reconstructed sequence for listed seq ids. By default these things are done for all selected sequences. For very large reference sequence databases, ref_select needs to perform a lot of work and can use a lot of memory and produce a very large output. If you want to analyze the entire reference dataset, but only produce detailed results for a small subset of sequences, the --detail option allows you to specify just those sequences (by <seq_id>, one per use of -d / --detail). To specify more than a few sequences, use --detail_file instead.

--detail_file=<detail_file>

Filename of file containing sequence ids to treat as in --detail. Format is a plain text file with one <seq_id> per line.

-q / --recon_seq

Enable reconstruction of contig sequences using alignment data (SAM) and reads (FASTQ). This works for both colorspace and nucleotide data (or a combination of the two), and the resulting sequence is always nucleotide sequence. The reconstructed contigs may contain ambiguity codes (e.g. for SNPs or variable positions) and the sensitivity of this process is controlled with the ambig_tol parameter. NOTES: Requires use of --read_file. If --detail is used, only specified sequences are reconstructed.

--ref=<ref_file>

For assemblies where the contigs and alignments are already nucleotide sequences, this option allows you to specify a FASTA file containing the reference contigs used in the alignment (as an alternative to using --recon_seq). In this case, the reference sequences are simply copies into the output JSON sequence graph for use in downstream analysis (scaffolding, binning, etc).

-m / --mate_pair

This option is required to enable the pairing analysis between mated reads (.read1 and .read2 alignments). When paired reads are available and --mate_pair is used, "edges" will be included in the output JSON data file, connecting the "nodes" into a full sequence graph structure. Enable this option when you want to produce scaffolds for a de novo assembly.

--split=<n>

Split reference sequences into n-base pieces for analysis. The default is no splitting. This is a specialized option that performs all ref_select analyses as though large reference sequences (e.g. whole genomes) are actually split into many separate chunks. This can be useful for testing, but also for performing genome connection and rearrangement analyses when pairing information is available and the sequence read data is from a different organism (e.g. strain or closely related species) than the reference sequence used.

--separate_strands

Top and bottom stands of reference sequences are considered separately. Each reference sequence effectively becomes two references, with reads mapping to each strand contributing to the analysis for that strand's reference only. NOTE: --separate_strands may not be used with: mate-pairing (--mate_pair), sequence reconstruction (--recon_seq) or splitting (--split)

--rollup

When --split is used, or when a --catalog file is specified (e.g. to join reference scaffolds or chromosomes into whole genomes), the output JSON data will include a rollup_stats object with information about each parent reference sequence (e.g. whole genome) calculated from the stats of the individual underlying sequences.

--seed=<n>

SEAStAR uses its own random number generator (for reproducibility among different platforms). You may use a different seed to investigate what difference randomized decisions made in the analysis are having on the results. By default --seed=12345.

--num_threads=<n>

ref_select is highly multithreaded to support modern multi-core processors. Sometimes you will want to restrict the number of cores it uses (e.g. to prevent resource competition on clusters or shared computers, or when you are running more than one instance of ref_select at a time on a given computer.) By default --num_threads is the number of cores on the machine (including "hyper-threads"). The minimum number of threads required for proper operation is 2 or 3 depending on the parmeters being used, and ref_select checks to ensure that the proper minimum number of threads is allocated.

[-t <n>|--bit_thresh=<n>] [-f <n>|--bit_fraction=<n>] [-l <n>|--read_map_limit=<n>] [-s|--second_chance_reads] [--relax_read_sharing] [--all_taxa] [-a|--absolute_bitscores] [--no_rand] [--sim_frac=<n>] [-e <seq_id>|--exclude=<seq_id>]... [--exclude_file=<exclude_file>] [--invert_exclude]

-t <n> / --bit_thresh=<n>

Minimum bitscore value for a ref sequence to be selected for output. Default > 0.0, all sequences with a positive bitscore. NOTE: if both --bit_fraction and --bit_thresh are used, the highest resulting threshold will be used. Use of selection thresholds is useful for selecting a "noise floor" to exclude sequences that have only a small number of reads uniquely aligning, which in a large sequence database (e.g. 16S rDNA databases) may only reflect sequencing errors.

-f <n> / --bit_fraction=<n>

Minimum bitscore value for a ref sequence to be selected for output, as a fraction of the bitscore for the top scoring sequence. NOTE: if both --bit_fraction and --bit_thresh are used, the highest resulting threshold will be used.

-l <n> / --read_map_limit=<n>

Maximum number of reference mappings before a read is rejected from consideration. Default is 50000 reference sequences. Reads that map with a large fraction of the sequences in a reference database (e.g. low complexity reads, or highly conserved sequence) carry very little information but require a lot of computational effort and memory to process. If your sequence database has a lot of conserved or redundant sequence (e.g. 16S rDNA), then this parameter will save a lot of computational expense with little if any effect on the results. If desired, a 2-pass approach can be used where reads are aligned to the full database and run through ref_select first to eliminate the vast majority of the database from consideration. Then in a second pass, reads are re-aligned to just the sequences selected in the first round, so that all aligning reads can be accounted for (e.g. in coverage and relative abundance calculations).

-s / --second_chance_reads

Allow reads to be mapped to edit dist +1 mappings when their best ref is eliminated. By default, if a read maps to a sequence that ultimately doesn't meet the bitscore threshold being used limit selection, that read is removed from further downstream consideration, if it didn't also map at least as well with some other sequence that is ultimately selected for output. When --second_chance_reads is used, such reads gain a "second chance" to be reallocated to reference sequence(s) where they map with edit distance +1 from the eliminated best match(es). This parameter causes ref_select to use significantly more memory. Note: this has no effect when --bit_fraction and --bit_thresh are at their default values, or when --relax_read_sharing is used.

--relax_read_sharing

Allow reads to be mapped to all sequences, regardless of edit distance. By default, reads are only associated with the sequence(s) that they align with at the minimum edit distance. --relax_read_sharing causes reads to be associated with all sequence alignments, even if they are suboptimal relative to the best match.

--all_taxa

Use count of all taxa in catalog to calculate bitscores. By default, any taxa in the reference database that have zero read alignments are excluded from consideration in the information theoretic bitscore calculations (ie. the calculations proceed as though those sequences didn't exist in the reference database at all).

-a / --absolute_bitscores

Use absolute bitscores to score and select reference sequences. By default length normalized scores (bits/nt) are used to control for differences in reference sequence length.

--no_rand

By default, reads that map equally well to multiple locations within a single reference sequence are randomly assigned to one of the possible positions. The --no_rand option changes this behavior so that such a read will be mapped to the first position in the reference seen in the alignment file. Counterintuitively, it is likely that using --no_rand will make the output of ref_select less deterministic. This is because the pseudo-random number generator used by SEAStAR is cross-platform deterministic, whereas the process by which such multiple alignments are ordered in an alignment SAM file may be at least paritally random (say through multi-thread race contingencies in the alignment software.)

--sim_frac=<n>

Fraction of aligned reads to use. Reads (and their mates) are randomly sampled from the input alignment(s) in this proportion. Primarily for use with simulated datasets.

-e <seq_id> / --exclude=<seq_id>

This option in conceptually similar to the --detail option, except that reference sequences identified with --exclude are discarded from the analysis entirely, as though they were not in the reference database to begin with. This is useful if a long, expensive alignment to a large reference sequence database has been completed, and then some downstream analysis calls for processing alignments with only a subset of that database.

--exclude_file=<exclude_file>

Filename of file containing sequence ids to treat as in --exclude. Format is a plain text file with one <seq_id> per line.

--invert_exclude

--exclude or --exclude_file sequences are inverted; that is, exclude all sequences except those specified.

[--per_base] [-w <n>|--cov_window=<n>] [--gc_window=<n>] [--detect_dups]

--per_base

Per base statistics (coverage; and when --mate_pair is enabled, physical coverage and mean insert length) will be generated and output for all selected reference sequences (or only those specified by --detail).

-w <n> / --cov_window=<n>

Fixed read length for use in calculating per position coverage. By default the actual length of each read is used when available in the FASTQ readnames (see the SEAStAR FASTQ format guide for details.) If the readnames do not include length information and --cov_window isn't specified, then 49 nucleotides is the default fixed read length used.

--gc_window=<n>

Enables per position %GC moving average calculation, within the specified window size. The results are output in the per_nt_mean_gc array in the JSON output for each selected node (or only those specified by --detail). Use of this option requires --per_base statistics and either --recon_seq or --ref to provide the sequence data nrcessary for the calculation. If --mp_circular is used, then the %GC calculation wraps around at the sequence ends with no discontinuity, otherwise the N/2 %GC values at each end of the sequence all have the same value as the first (or last) valid window.

--detect_dups

Perform additional analysis to detect likely collapsed duplications. Assumes relatively uniform coverage (i.e. a single isolate genome). Results are reported in the contig_problems section of the JSON sequence graph, a summary of which is available via the graph_ops PROBS command.

[--ambig_tol=<n>] [--read_output=<prefix>] [--output_nomatch] [--read_output_gzip]

--ambig_tol=<n>

Tolerance adjusting the sensitivity for generating ambiguity codes in output sequence reconstructed from read alignments. In the range: 0.0 - 1.0. 1.0 = no ambiguities, majority rules base calling. Default is 0.2

--read_output=<prefix>

Enables output of reads aligning with the refernce database (and their mates, regardless of whether the mates align.) <prefix> indicates the FASTQ output filename prefix to use in writing the new ouput files. Must be used with --read_file and/or rev_read_file.

--output_nomatch

Inverse the meaning --read_output to write only reads that do not align with any reference sequence (whose mates also do not align).

--read_output_gzip

Modifies --read_output to write gzip compressed output files.

[--mp_mate_lim=<n>] [--mp_share_lim=<n>] [--mp_strict] [--mp_inserts] [--mp_circular] [--mp_cutoff=<n>]

--mp_mate_cnt=<n>

Minimum count of mapping read-pairs for generation of a mate-pair linking edge in the output graph (normal or internal). Default is 2, meaning that no edges consisting of a single read-pair will be output.

--mp_mate_lim=<n>

Minimum bitscore threshold for generation of a mate-pair linking edge in the output graph (normal or internal). Default is 0.0 bits, however --mp_mate_count above will put a floor on lowest scoring edges output.

--mp_share_lim=<n>

Minimum bitscore threshold for generation of a shared sequence edge in the output graph. Default is 500.0 bits.

--mp_strict

Do not consider paired reads that span reference sequences when the same pair of reads also map entirely within some single reference sequence.

--mp_inserts

Perform insert size estimation for all reference sequences where at least one pair of reads both map within that sequence.

--mp_circular

Allow circular self-linking mate-pairs to join the opposite ends of a single sequence.

--mp_cutoff=<n>

Insert size cutoff for inclusion of an individual mate-pair in per base statistics. Setting this option for some reasonable upper limit of the length of a valid insert will prevent sequence duplications from improperly skewing estimation of the true mean insert size.

[-r <read_file>|--read_file=<read_file>]... [--rev_read_file=<read_file>]... [--old_bwa_samse] [-c <catalog_file>|--catalog=<catalog_file>] [--rev_align=<sam_file>]...

NOTE! all inputs to ref_select involving per read data (FASTQ and SAM files) must be named using the SEAStAR FASTQ naming conventions, also see the introductory description of ref_select for an example.

-r <read_file> / --read_file=<read_file>

Input filename(s) for FASTQ read files used for sequence reconstruction and/or read filtering.

--rev_read_file=<read_file>

Same as --read_file above, but used for paired reads that are on the opposite strand from each other. By default, ref_select assumes that paired reads are from the same DNA strand. When this is not true, then one of the two readsets (read1/read2) must be specified using --rev_read_file. Note, this must be done consistently with the use of --rev_align.

--old_bwa_samse

Use the old BWA-style "samse" alignment format instead of SAM. This format was used by the BWA samse -n command until BWA version 0.5.6.

-c <catalog_file> / --catalog=<catalog_file>

Filename of a reference catalog file. Required for use --old_bwa_samse. This is optional otherwise, although it is required to associate sequence descriptions with reference sequences in the JSON output, or for use of the --rollup functionality when reference sequences are actually subsequences of parent sequences (e.g. scaffolds or chromosomes of a whole genome). A catalog is a tab separated (TSV) plain text file with the following columns:

  seq_id     seq_len     parent_id   seq_description

For example:

  ABCD123    1102938     ORG1        Organism 1, Scaffold 1
  EFGH987    482273      ORG1        Organism 1, Scaffold 2
  ZYXW765    193475      ORG2        Organism 2, Scaffold 1
  RTSP345    782273      ORG2        Organism 2, Scaffold 2

Note that each seq_id must be unique, but duplicated parent_id values indicate relationships among sequences (a shared parent sequence).

--rev_align=<sam_file>

Same as the main <sam_file> input filenames, but used for paired reads that are on the opposite strand from each other. By default, ref_select assumes that paired reads are from the same DNA strand. When this is not true, then alignments for one of the two readsets must be specified using --rev_align. Note, this must be done consistently, and also with the use of --rev_read_file when input FASTQ files are also used.


graph_ops reads JSON sequence graph files produced by the ref_select program. It implements a variety of commands for manipulating this data for assembly, visualization or output to a variety of file formats (as detailed for each subcommand below). Because the JSON sequence graph objects are often very large, and therefore may take significant time to read and write, we designed graph_ops as a federated suite of tools that can pass the current sequence graph state from one tool to the next in-memory. This saves considerable time and I/O traffic relative to an alternative design that used separate command line tools for each operation performed by graph_ops. A consequence of this design is that graph_ops may be used in either a scripted mode, where all commands are supplied via the command-line or a script (.go) file; or in an interactive mode, where commands are manually entered one at a time.

Usage: graph_ops [-h|--help] [<input.json>] [<script.go>] [<command> ['{params}']]...

where:

  • -h / --help provides command line help regarding program usage and version information

      # Provide help about shell command line usage
      graph_ops --help
      # Provide help about graph_ops commands
      graph_ops HELP   # HELP is actually a graph_ops command
    
  • <command> is some valid command and '{params}' optionally specifies parameters for each command, in the form: '{"parm1":value1,"parm2":value2...}'. Multiple commands with optional parameters may be provided in succession for execution. NOTE: the parameters for each command must be single quoted to ensure that it is passed to graph_ops as a single item, and to prevent interactions with special characters used by the UNIX shell.

      # Read a sequence graph and output all of the sequences to a FASTA file
      graph_ops LOAD '{"file","assembly.json"}' FASTA '{"file":"sequence.fna"}'
    

    Like most of the other tools in SEAtAR, graph_ops can read and write gzip compressed versions of all of its input and output files:

      # Read a compressed sequence graph and output all of the sequences to a compressed FASTA file
      graph_ops LOAD '{"file","assembly.json.gz"}' FASTA '{"file":"sequence.fna.gz"}'
    
  • <input.json> is an optional shortcut specifying a sequence graph file to initially LOAD

      # Load the sequence graph file and go interactive
      graph_ops assembly.json
    
      # Equivalent to
      graph_ops LOAD '{"file":"assembly.json"}' SCRIPT
    
  • <script.go> is an optional SCRIPT to initially execute

      # Load the sequence graph file and run the scaffold script
      graph_ops assembly.json write_fasta.go
    
      # Equivalent to
      graph_ops LOAD '{"file":"assembly.json"}' SCRIPT '{"file":"write_fasta.go"}'
    
      # If the write_fasta script contains the LOAD command, then just this will suffice
      graph_ops write_fasta.go
    

    In this last case, the write_fasta.go script could be (note that single quotes around the parameters in a .go file are not necessary):

      # This is the write_fasta.go file
      LOAD {"file":"assembly.json"}
      FASTA {"file":"sequence.fna"}
    

Running graph_ops without any options will launch in interactive mode, equivalent to: graph_ops SCRIPT

Note that many graph_ops commands may write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when using commands such as DUMP, DOT, TABLE, or FASTA (without specifying an output filename) and redirecting the STDOUT of graph_ops to a file. Likewise, only one such output command may use STDOUT per run of graph_ops (e.g. redirecting the output of both FASTA and DUMP to the same file will result in a file that is both invalid JSON and invalid FASTA format.)

graph_ops <command> summary:

File I/O commands

  • LOAD -- Input current data structure from a file, or by default from STDIN
  • DUMP -- Output current data structure to a file, or by default to STDOUT
  • TABLE -- Write ref_select statistics for selected reference sequences to a TSV file
  • FASTA -- Write sequences contained in the selected graph to the FASTA format
  • DOT -- Write the current assembly graph to the graphviz DOT format

Assembly pipeline commands (in this order)

  • MST -- Calculate the Maximal Spanning Tree of all connected components
  • SST -- Calculate the improved Scaffold Spanning Tree of all connected components. SST is an optional replacement for MST, useful for metagenomes or relatively short contigs.
  • PLUCK -- Remove all leaf contig nodes (in or outdegree == 0) from the graph
  • PRUNE -- Split the assembly graph at all contig nodes with in or out degree > 1
  • SLICE -- Break linear scaffolds at a GC / coverage discontinuity. SLICE is optional, but recommended for metagenomes.
  • PUSH -- Extends scaffold ends (reversing the action of PLUCK at scaffold ends)
  • INSERT -- Reconnect contigs with ambiguous placement within a scaffold using mapped pair position information to resolve ambiguities
  • SCAFF -- Lay out a fully linear scaffold from contigs in unambiguously ordered connected components
  • CLUST -- Cluster scaffolds using the tetracalc tool

Assembly graph filter / edit utilities

  • CIRCLE -- Find circular sequences among scaffolds.
  • CCOMPS -- Calculate the current graph connected components
  • SELCC -- Select specific connected components for further processing
  • SELCLUST -- Select clusters of scaffolds for further processing
  • SELND -- Select contigs from the connected neighborhoods of the given contigs
  • SCAFLNK -- Find edge connections linking scaffold ends
  • RELINK -- Reconnect previously removed edge connections between contigs
  • EDGFLT -- Remove all edges scoring less than some number of bits
  • PROBS -- Generate a report of contigs with likely internal assembly issues
  • EDIT -- Make manual explicit edits to the selected graph structure
  • CUTND -- Create a new node from an existing node using the given coordinates

Information and control

  • GC -- Generate statistics about the assembly graph
  • GCC -- Generate statistics about the assembly graph, with details about each connected component
  • STASH -- Copy (push) the current graph to the top of a stack in memory
  • UNSTASH -- Restore (pop) the current graph from the top of the stack (undo changes)
  • SCRIPT -- Use contents of file as a series of commands, or read from the console if no file provided
  • HELP -- List all valid commands, or provide detailed help for a specific command with HELP <COMMAND>

graph_ops detailed <command> reference:

Input JSON sequence graph data from a file, or by default from STDIN.

On the command line, if the first parameter isn't a valid command string, and it ends in .json[.gz], then it is assumed to be the name of a JSON file, and an implicit LOAD command will be run using that filename.

Parameters:

  • file : "filename.json[.gz]"

    Specify the name of a JSON format sequence graph file.

    Examples:

      # Read data from the file my_assembly.json
      LOAD {"file":"my_assembly.json"}
    
      # Read data piped from STDIN (default)
      LOAD {"file":"-"}
    
  • files : ["filename1.json[.gz]",...]

    Specify the names of multiple JSON format sequence graph files to merge together.

    NOTE: The files must have uniquely named sequence nodes (contigs) with no nodes in common among the multiple files being loaded. Only the processing history from the first JSON file is retained (the other histories may still be found in their original files.) Other processing output such as scaffolds and clusters are lost.

    Example:

      # Read and merge data from the files my_assembly_1.json and my_assembly_2.json
      LOAD {"files":["my_assembly_1.json","my_assembly_2.json"]}
    
  • edges : "edge_file.json[.gz]"

    Specify the name of a JSON format sequence graph file to load edges and mate-pair statistics from.

    This option works with the file option, and requires that the edge file contain a graph with a subset of the nodes of the main graph being loaded with file. Any edges and mate-pair statisics present in the main graph are discarded in favor of those from the edge file.

    Example:

      # Read nodes and sequence data from main_file.json, and edges and connection
      # data from edge_file.json
      LOAD {"file":"main_file.json","edge_file":"edge_file.json"}
    

Output current sequence graph data to a JSON file, or by default to STDOUT

Parameters:

  • file : "filename.json[.gz]"

    Specify a file name to write the JSON format sequence graph to. If the filename contains one or more # characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no # characters are present, then this command overwrites any existing file of the same name.

    Examples:

      # Write data to the file my_assembly.json
      DUMP {"file":"my_assembly.json"}
    
      # Write data piped to STDOUT (default)
      DUMP {"file":"-"}
    
      # Write data to the file: my_assembly_000.json (first time this is run)
      DUMP {"file":"my_assembly_###.json"}
    
      # Run again, write data to the file: my_assembly_001.json
      DUMP {"file":"my_assembly_###.json"}
    

Write ref_select statistics for selected reference sequences to a TSV file. The fields are (left to right by column):

  • bitscore - Information content of reads aligning with the ref sequence
  • read_cnt - Number of (possibly fractional) reads aligning with the ref sequence
  • norm_cnt - Read_cnt normalized to ref sequence length
  • rel_abun - Relative fractional abundance of (copy number of) the ref sequence to all selected sequences (those with bitscores above some thresh)
  • mean_cov - Mean coverage of the reference sequence by (possibly fractional) reads
  • read_len - Mean length of reads aligning with this sequence
  • seq_len - Length of the ref sequence
  • pct_gc - Percent GC content of ref sequence (NA if not calculated)
  • name - Catalog name of the ref sequence
  • desc - Catalog description of the ref sequence

Parameters:

  • file : "filename.tsv[.gz]"

    Specify a file name to write the TSV format stats table to. If the filename contains one or more # characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no # characters are present, then this command overwrites any existing file of the same name. (See DUMP command an example using this behavior)

    Examples:

      # Write stats to the file my_seq.tsv
      TABLE {"file":"my_seq.tsv"}
    
      # Write stats to STDOUT (default)
      TABLE {"file":"-"}
    
  • header : true

    Write a header row in the output table with labels for each column.

    Example:

      # Write a header row (false is default).
      TABLE {"header":true}
    

Write sequences contained in the current graph data to a FASTA format file

Parameters:

  • file : "filename.fasta[.gz]"

    Specify a file name to write the FASTA format sequence data to. If the filename contains one or more # characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no # characters are present, then this command overwrites any existing file of the same name. (See DUMP command an example using this behavior)

    Examples:

      # Write stats to the file my_seq.fasta
      FASTA {"file":"my_seq.fasta"}
    
      # Write sequence to STDOUT (default)
      FASTA {"file":"-"}
    
  • scaff : true or scaff : "[options]"

    Output fully scaffolded sequences (using the seq_scaffold tool). Using this parameter with the value true run seq_scaffold with its default settings. As an advanced option, this parameter can also accept a string argument, which will be passed along to the external seq_scaffold tool as its [options] parameter string. Run with option string "--help" for help with the settings offered by seq_scaffold and its default values.

    Examples:

      # Output scaffolded contig sequences using default settings.
      FASTA {"scaff":true}
    
      # Get seq_scaffold help
      FASTA {"scaff":"--help"}
    
      # Pass seq_scaffold some options
      FASTA {"scaff":"--overlap=7 --heal=othercontigs.fna"}
    
  • no_merge_scaffs : true

    Write contig sequences in scaffold order with a scaffold ID in each header, ready to be provided to the seq_scaffold tool (but do not run it).

    Example:

      # Output ordered contig sequences
      FASTA {"no_merge_scaffs":true}
    
  • abundance : true

    Append relative abundance values to the FASTA sequence IDs. This is for use with downstream classification and visualization tools, such as for the 16S rDNA analysis pipeline vignette.

    Example:

      # Attach abundance values
      FASTA {"abundance":true} -- Attach abundances
    

Write the current graph data to a graphviz format DOT file for visualization or other processing. Some of the parameters listed below allow you to change node and edge parameters that will affect the output style of graphs rendered using the dot or neato layout tools that are part of the graphviz package. The parameters provided here give only very limited control of the most common style choices available. The gvpr language supplied by graphviz is much better suited for more detailed manipulation of the myriad of rendering options available.

Parameters:

  • file : "filename.dot[.gz]"

    Specify a file name to write the DOT format graph data to. If the filename contains one or more # characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no # characters are present, then this command overwrites any existing file of the same name. (See DUMP command an example using this behavior)

    Example:

      # Write stats to the file my_graph.dot
      DOT {"file":"my_graph.dot"}
    
      # Write graph to STDOUT (default)
      DOT {"file":"-"}
    
  • detail : true

    Draw labelled contig nodes

    Example:

      # Draw contigs as ovals containing the node names of each contig.
      # Note that this disables scaling the node size by contig length.
      DOT {"detail":true}
    
  • arrowtype : "arrowtype_string"

    Control the type of arrowheads drawn. See graphviz documentation at: http://www.graphviz.org/doc/info/attrs.html#k:arrowType

    Example:

      # Draw normal arrows (default)
      DOT {"arrowtype":"normal"}
    
  • pen_scale : <float>

    Control the relative thickness of the arrow lines. See graphviz documentation at: http://www.graphviz.org/doc/info/attrs.html#d:penwidth

    Example:

      # Draw normal arrows (default)
      DOT {"pen_scale":0.1}
    
  • const_edge : true

    Draw connection arrow lines at constant width

    Example:

      # Draw edge arrows of constant width regardless of bitscore.
      DOT {"const_edge":true}
    
  • colored_edges : true

    Draw edges colored by the GC% of the connected contigs. Note that this overrides any other edge coloring that would have otherwise have been applied.

    Example:

      # Draw colored arrows
      DOT {"colored_edges":true}
    
  • problems : true

    Draw nodes with potential assembly problems as red double circles with the node label text inside. Note that this overrides the normal node text that would otherwise have been rendered.

    Example:

      # Highlight problem nodes
      DOT {"problems":true}
    

Calculate the Maximal Spanning Tree of all connected components. This command reduces the connection graph to a tree-structure that optimally maximizes the total remaining edge bitscores, preserving all previously connected components of the graph, and directs all remaining edges of the graph based on the pairing information.

Parameters:

  • bits : true

    Use raw connection bitscores instead the default heuristically adjusted scores (based on GC% / coverage differences)

    Example:

      # Use raw connection bitscores from mate-pairing
      MST {"bits":true}
    

As an alternative to the MST command, calculate the Scaffold Spanning Tree of all connected components, producing a sub-optimal tree from the perspective of maximizing edge bitscores, but with important properties for successful contig scaffolding in the following steps. This command is generally preferable to the MST command when median contig length is less than the mean distance between paired reads. That is, when a relatively large mate-pair insert size was selected for the sequencing library, relative to the size of the contigs that could ultimately be assembled. This condition is often the case for metagenomic samples, where inter-strain variability adversely affects the median contig length that can be achieved.

Parameters:

  • bits : true

    Use raw connection bitscores instead the default heuristically adjusted scores (based on GC% / coverage differences)

    Example:

      # Use raw connection bitscores from mate-pairing
      SST {"bits":true}
    

Remove all leaf contig nodes (in- or out-degree == 0) from the (tree-structured) graph. Two (or sometimes three) iterations are usually sufficient to achieve the goal of removing all non-terminal short branches from the tree. These contigs will be added back to the assembly in later operations once the "backbone" scaffolds have been determined. NOTE: PLUCK does not remove two "leaf" contig nodes that form a connected pair; that is, multiple iterations of PLUCK will not completely remove contigs that happen to form short chains. Also, PLUCK preserves entirely unconnected contigs with more than 20000 bases of sequence by default. See the min_len parameter below.

Parameters:

  • iterate : <int>

    Specify the number of iterations of PLUCK to run. Each is equivalent to running PLUCK again as a separate command. By default <int>= 3.

    Examples:

      # Default is {"iterate":2}
      PLUCK
    
      # This is the equivalent of the above command
      PLUCK {"iterate":1}
      PLUCK {"iterate":1}
    
  • min_len : <int>

    Minimum length of isolated nodes to keep. Default <int>= 20000 bases.

    Example:

      # Do not "pluck away" an unconnected contig containing 20000 or more bases of sequence.
      PLUCK  {"min_len":20000}
    

Split the assembly graph at all contig nodes with in or out degree > 1. This command prunes remaining branches from the assembly tree by selectively removing the lowest scoring extra in- or out- edges from a node. After this operation is complete, all remaining connected components will be linear and properly directed.

Parameters:

  • strict : true

    Always cut all but the strongest link when there are more than two in + out links

    Examples:

      # The default case: attempt to determine which branch(es) to remove in order to
      # preserve one high weight path across a multiply linked contig node.
      PRUNE {"strict":false}
    
      # Always strictly prune. Equivalent to {"ratio":0.0} below
      PRUNE {"strict":true}
    
  • ratio : <float>

    Ratio of the strongest to next strongest links to trigger strict pruning

    Examples:

      # Always strictly prune.
      PRUNE {"ratio":0.0}
    
      # Never strictly prune.
      PRUNE {"ratio":1.0}
    
      # Default. Don't strictly filter when the score of the highest scoring edge
      # is >= 5x greater than the next highest scoring link of the same direction
      # (for both in and outlinks).
      PRUNE {"ratio":0.2}
    
  • verbose : true

    Output diagnostics on STDERR

    Example:

      # Generate extra output information
      PRUNE {"verbose":true}
    

Break linear scaffolds at a %GC / Coverage discontinuity. Each run of SLICE will break a given scaffold at no more than one position. SLICE should be run multiple times if multiple misassemblies per scaffold are suspected.

Parameters:

  • thresh : <float>

    Threshold used to determine whether or not to break a connection

    Example:

      # Default. Medium strength heuristic score based on GC% / Coverage statistics of a
      # scaffold on either side of a given contig's connection edge. The lower the
      # threshold, the more sensitive the filter is to such discontinuities.
      SLICE '{"thresh":0.5}'
    

Extends scaffold ends, reversing the (undesired) action of PLUCK at scaffold ends. Because the PLUCK command can't distinguish short-branches from what will ultimately become scaffold ends, it essentially erodes a contig off of the each end with every iteration. To reverse this, PUSH should be run the same number of iterations as was PLUCK previously.

Parameters:

  • iterate : <int>

    Number of iterations of PUSH to run. Each is equivalent to running PUSH again as a separate command. By default <int>= 3.

    Examples:

      # Default case.
      PUSH {"iterate":3}
    
      # Equivalent to above.
      PUSH {"iterate":2}
      PUSH {"iterate":1}
    

Reconnect contigs with ambiguous placement within a scaffold using relative mean read-pair mapping position information to resolve ambiguities. By default this will run on all scaffolds. INSERT works by fully connecting contigs within the scaffold when their relative position can be unambiguously determined, and adding new "constraint" edges (supported only by the relative mapped position information) between such contigs so that they can be ordered correctly by the SCAFF command.

Parameters:

  • ccname : "contig_name"

    Use only the connected component (scaffold) containing the named contig.

    Example:

      # Insert removed contigs within the connected component containing this contig
      INSERT {"ccname":"NODE_1234"}
    
  • ccnames : ["contig_name1","contig_name2",...]

    Insert removed contigs within only the connected components (scaffolds) containing these contigs.

    Example:

      INSERT {"ccnames":["NODE_1234","NODE_567"]}
    
  • thresh : <float>

    Minimum contig connection bitscore to consider

    Example:

      # Default. Only connections scoring at least 250.0 bits
      # will count in the calculations
      INSERT {"thresh":250.0}
    
  • min_pos_diff : <int>

    Minimum mapping position difference (within a neighboring contig) considered reliable enough to use for resolving positional ambiguities.

    Example:

      # Default. Only pairing information yielding a relative position difference of
      # 75 nucleotides (between candidate contigs and a neighboring existing contig)
      # will be considered significant enough to use in determining the relative
      # ordering of the candidate contigs
      INSERT {"min_pos_diff":75}
    
  • dup_kmer : <int>

    Length of k-mers to use for detecting variant duplicate contigs

    Examples:

      # Default. Use 14-mers to detect duplicate variant contigs before inserting
      # them between scaffold backbone contigs.
      INSERT {"dup_kmer":14}
    
      # Disable duplicate variant contig checks.
      INSERT {"dup_kmer":0}
    
  • dup_thresh : <float>

    Fraction of kmers from a contig with hits to scaffold backbone.

    Example:

      # Default. If more than 15% of kmers for a contig hit kmers from the scaffold
      # backbone contigs, then do not insert that contig.
      INSERT {"dup_thresh":0.15}
    
  • external : true

    Controls whether contigs may be inserted outside of an input scaffold.

    Examples:

      # Default.
      # Contigs will not be added outside of the first and last contigs in the input
      # scaffold; that is, the only inserts performed will be between existing connected
      # contigs.
      INSERT {"external":false}
    
      # Contigs may be added to the scaffold before the first, or after the last, contig
      # in an input scaffold.
      INSERT {"external":true}
    
  • verbose : true

    Output diagnostics on STDERR

    Example:

      # Generate extra output information
      INSERT {"verbose":true}
    

Lay out a fully linear scaffold from contigs in unambiguously ordered connected components (as produced by the INSERT command)

Parameters: NONE.

Cluster scaffolds using the external tetracalc tool

Parameters:

  • options : <string>

    Command line options for the tetracalc tool, otherwise tool defaults used. Run with --help for help with the settings offered by tetracalc.

    Examples:

      CLUST {"options":"--merge_tar=0.95 -m 7500"}
    
      CLUST {"options":"--help"}
    
      CLUST {"options":"--fixed -t 0.9 -s 0.8 -r 0.9"}
    

Find circular sequences among scaffolds

Parameters:

  • circular : true

    Return circularized scaffolds.

    Default (false): Produce linear scaffolds placing the ends within the largest contig of each connected component. The node with the longest sequence will be divided equally into two and the mate-pair connections also divided, such that they form two clean ends for the new scaffold.

    Example:

      # Produce circular scaffolds.
      CIRCLE {"circular":true}
    
  • thresh : <float>

    Minimum bitscore of end-connecting pairing information

    Example:

      # Default. Circle completing end connections must have a positive bit score to qualify.
      CIRCLE {"thresh":0.0}
    

Calculate the current assembly graph connected components. Note, this is a low-level operation which is typically done automatically by other commands as required.

Parameters:

  • sortby : <string>

    Specify how to sort the resulting list of CCs. Must be: "nodes" or "sequences".

    Examples:

      # Default. Sort CCs in descending order of number of nodes.
      CCOMPS {"sortby":"nodes"}
    
      # Sort CCs in descending order of amount of sequence.
      CCOMPS {"sortby":"sequences"}
    

Select specific connected components for further processing. "Unselected" connected components are saved in a "removed" data structure and may be recalled with other commands.

NOTE: There are two types of parameters below. "Selection parameters" allow a subset of connected components to be selected by contig names or component number. "Filter parameters" are applied to the set of "selected" connected components (or by default, to the set of all connected components.) These filters may also be used in combination, resulting in a logical "AND" relationship (ie. connected components not satisfying all of the filters are removed).

Selection Parameters:

  • ccname : "contig_name"

    Select the connected component containing the named contig

    Example:

      # Select the connected component containing the contig named NODE_1234
      SELCC {"ccname":"NODE_1234"}
    
  • ccnames : ["contig_name1","contig_name2",...] Select the connected component(s) containing the named contigs

    Example:

      # Select the connected components containing the contigs named NODE_1234 and NODE_5678
      SELCC {"ccnames":["NODE_1234","NODE_5678"]}
    
  • ccnum : <int>

    Select connected component number <int>

    Example:

      # Default. Select connected component number 0 (the first CC).
      SELCC {"ccnum":0}
    
  • ccnums : [<int>,<int>,...]

    Select the connected components from the list of numbers

    Example:

      # Select the second and third connected components (numbering is zero based)
      SELCC {"ccnums":[1,2]}
    
  • ccrange : [<int1>,<int2>]

    Select the connected components numbered in the range <int1>..<int2> (inclusive). <int> may be negative, indicating positions at the end of the list of connected components.

    Examples:

      # Select the first 6 connected components
      SELCC {"ccrange":[0,5]}
    
      # Select the last 5 connected components
      SELCC {"ccrange":[-5,-1]}
    
  • shift : true

    Select all connected components except the first one. This is like SELCC {"ccrange":[1,-1]} except it doesn't generate an error when there is only one remaining connected component, allowing processing to potentially continue in any calling SCRIPT commands.

    Example:

      # Drop the first connected component.
      SELCC {"shift":true}
    

Filter Parameters:

  • min_nodes : <int>

    Select connected components with or more nodes.

    Example:

      # Select connected components with 2 or more nodes.
      SELCC {"min_nodes":2}
    
  • min_seqlen : <int>

    Select connected components with or more sequence within nodes.

    Example:

      # Select connected components containing at least 1000 bases of sequence.
      SELCC {"min_seqlen":1000}
    
  • sequence : <string>

    Select connected components containing the provided DNA sequence. NOTE! This isn't BLAST, the sequence must match exactly. Any differences, including ambiguity codes, will prevent matching. The only extra thing that is done is the reverse complement of the provided sequence is also searched.

    Example:

      # Select connected components containing the provided sequence
      # (or its reverse complement).
      SELCC {"sequence":"AGACTAGCAGATATACGATAACGATACGATACGAT"}
    
  • sequences : [<string>,...]

    Like sequence parameter above, but using a list of sequences

    Example:

      # Select connected components containing the provided sequences
      # (or their reverse complements).
      SELCC {"sequences":["AGACTAGCAGATATACG","ATAACGATACGATACGAT"]}
    
  • soft : true

    Determines whether unselected ccomp nodes are deleted or kept as "removed"

    Examples:

      # The contigs of first connected component will remain selected, all
      # others are retained as unselected.
      SELCC {"soft":true}
    
      # The contigs of first connected component will remain selected,
      # neighboring contigs (one hop away) are unselected and added to the
      # "removed" pool. All others are permanently deleted.
      SELCC {"soft":false}  # Default
    

Select clusters of scaffolds for further processing

Parameters:

  • clustnum : <int>

    Select a specific cluster

    Example:

      # select the first cluster of scaffolds
      SELCLUST {"clustnum":0}
    
  • clustnums : [<int1>, <int2>, ...]

    Select multiple specific clusters

    Example:

      # Select these three clusters
      SELCLUST {"clustnums":[0,3,5]}
    
  • clustrange : [<int1>, <int2>]

    Select a range of clusters

    Examples:

      # Select the first six clusters
      SELCLUST {"clustrange":[0,5]}
    
      # Select the last five clusters
      SELCLUST '{"clustrange":[-5,-1]}'
    
      # select all clusters except the last one
      SELCLUST '{"clustrange":[0,-2]}'
    
  • shift : true

    Select all clusters except the first one. This is like SELCLUST {"clustrange":[1,-1]} except it doesn't generate an error when there is only one remaining cluster, allowing processing to potentially continue in any calling SCRIPT commands.

    Example:

      # Drop the first cluster.
      SELCLUST {"shift":true}
    
  • exclusive : true

    Remove all scaffolds outside of the selected clusters

    Example:

      # Default. Keep all unselected scaffolds for subsequence processing
      SELCLUST '{"exclusive":false}'
    

Select contigs from the connected neighborhood(s) of the given contig(s)

Parameters:

  • name : "contig_name"

    Use a single contig by name.

    Example:

      # Select NODE_1234 (and its neighbors)
      SELND {"name":"NODE_1234"}
    
  • names : ["contig_name1","contig_name2",...]

    Use multiple contigs by name.

    Example:

      # Select these two contigs (and their neighbors...)
      SELND {"names":["NODE_1234","NODE_5678"]}
    
  • radius : <int>

    Size of the neighborhood of contigs to select

    Examples:

      # Select all neighbors and neighbors of neighbors
      SELND {"radius":2} --
    
      # Default. Select only the named contig(s)
      SELND {"radius":0}
    
  • sequence : <string>

    Select contig(s) found to contain the provided DNA sequence.

    NOTE! This isn't BLAST, the sequence must match exactly. Any differences, including ambiguity codes, etc. will prevent matching. The only extra thing that is done is the reverse complement of the provided sequence is also searched.

    Example:

      # Select contig(s) containing the provided sequence (or its reverse complement).
      SELND {"sequence":"AGACTAGCAGATATACGATAACGATACGATACGAT"}
    
  • sequences : [<string>,...]

    Like sequence parameter above, but using a list of sequences

    Example:

      # Select contigs containing the provided sequences (or their reverse complements).
      SELCC {"sequences":["AGACTAGCAGATATACG","ATAACGATACGATACGAT"]}
    

Find connections linking scaffold ends

Parameters:

  • scaff_names : ["Scaffold_name1","Scaffold_name2",...]

    Select a subset of scaffolds to attempt to link to other scaffolds. Default: use all scaffolds.

    Example:

      # Select only the scaffolds named Scaffold_1234 and Scaffold_5678 for processing.
      # Only scaffold links involving one of these scaffolds will be considered.
      SCAFLNK {"scaff_names":["Scaffold_1234","Scaffold_5678"]}
    
  • exclusive : true

    Only look for connections within the scaff_names list. Default: look for connections between those in scaff_names and all other scaffolds.

    Example:

      # Only look for connections between Scaffold_1234 and Scaffold_5678.
      # All other scaffolds are ignored.
      SCAFLNK {"exclusive":true,"scaff_names":["Scaffold_1234","Scaffold_5678"]}
    
  • thresh : <float>

    Bitscore threshold

    Example:

      # Default. Only reconnect scaffold ends with connections scoring
      # at least 1000.0 bits
      SCAFLNK {"thresh":1000.0}
    

Reconnect previously removed connections between contigs. RELINK can move contigs from out of the "unselected" pool.

Parameters:

  • name : "contig_name"

    Use a single contig by name.

    Example:

      # Restore removed edges to this contig (reconnecting with the contigs on the
      # other side of the edges)
      RELINK {"name":"NODE_1234"}
    
  • names : ["contig_name1","contig_name2",...]

    Use multiple contigs by name.

    Example:

      # Restore edges to these two contigs (though not necessarily between them...)
      RELINK {"names":["NODE_1234","NODE_5678"]}
    
  • ccname : "contig_name"

    Use all contigs in the connected component containing the named contig.

    Example:

      # Restore edges to all contigs in the connected component containing this contig
      RELINK {"ccname":"NODE_1234"}
    
  • ccnames : ["contig_name1","contig_name2",...]

    Restore edges to all contigs in the connected components containing these contigs.

    Example:

      # Restore edges to all contigs within the connected compontent(s) containing
      # these two contigs
      RELINK {"ccnames":["NODE_1234","NODE_5678"]}
    
  • radius : <int>

    Expand the sphere of restored connections to neighbors.

    Examples:

      # Default value. Restore edges only to neighboring contigs
      RELINK {"radius":0}
    
      # Recursively restore edges to all contigs within two hops of any selected contigs
      # (including along newly restored edge paths)
      RELINK {"radius":2}
    
  • complete : true

    Restore connections among all types of contigs.

    Examples:

      # Restore connections among contigs which are both removed from, and part of,
      # currently selected connected components. This is like running with both modes of
      # the 'existing' parameter (below) simultaneously.
      #
      RELINK {"complete":true}
    
      # Default case.
      # The 'existing' parameter (below) determines which type of connections are restored.
      RELINK {"complete":false}
    
  • existing : true

    Only restore connections between nodes currently part of selected connected components. This parameter has no effect when the 'complete' parmeter (above) is true.

    Examples:

      # Only restore connections between currently selected contigs
      RELINK {"existing":true}
    
      # Default case.
      # Only restore connections between currently selected and currently unselected
      # contigs (this causes the unselected contigs to become part of the selected set
      # again). That is, no new connections within the contigs of currently selected
      # connected components are restored.
      RELINK {"existing":false}
    
  • problems : true

    Restore all connections to contigs that are marked with potential assembly problems. Uses radius:1 and existing:true

    Example:

      # Reconnect all problem contigs to all of their potential neighbors.
      RELINK {"problems":true}
    

Remove all edges scoring less than thresh bits

Parameters:

  • thresh : <float>

    Bitscore threshold which must be met to keep an edge.

    Example:

      # Default. Remove all edges scoring less than 500.0 bits.
      EDGFLT {"thresh":500.0}
    

Generate a report of contigs with likely internal assembly issues from the contig_problems section of the JSON graph file. This information is optionally generated by using the ref_select --detect_dups parameter.

Parameters:

  • file : "filename.txt[.gz]"

    Name of txt format file to write statistics to

    Examples:

      # Write report to the file problems.txt
      PROBS {"file":"problems.txt"}
    
      # Default. Write report to STDOUT
      PROBS {"file":"-"}
    
  • detail : true

    Provide extra per contig detail

    Example:

      PROBS {"detail":true}
    

Make manual explicit edits to the selected graph structure. Using this function is preferable to directly editing the JSON structure because it provides a direct record of what was done, and makes it easily repeatable in case changes to upstream analysis are made. If more than one node-independent sets of edges are to be added or removed (that is, those not sharing any contig(s) in common), then multiple calls to EDIT are required to accomplish this task.

Parameters:

  • rem_nodes : ["contig1","contig2",...]

    Remove the specified contig nodes by name

    Example:

      # Remove these two contigs from the graph
      EDIT {"rem_nodes":["NODE_1234","NODE_5678"]}
    
  • add_nodes : ["contig1","contig2",...]

    Add back the specified contig nodes by name

    Example:

      # Move these two contigs from the "unselected" pool back to the selected graph
      EDIT {"add_nodes":["NODE_1234","NODE_5678"]}
    
  • rem_edges : ["contig1",["contig2",...]]

    Remove the specified connections between contig1 and any number of other contigs.

    Example:

      # Remove two connection edges from the graph, both connected to NODE_1234
      EDIT {"rem_edges":["NODE_1234",["NODE_5678","NODE_9"]]}
    
  • add_edges : ["contig1",["contig2",...]]

    Add back the specified connections between contig1 and any number of others.

    Example:

      # Add two connection edges from the graph, both connected to NODE_1234
      EDIT {"add_edges":["NODE_1234",["NODE_5678","NODE_9"]]}
    

NOTE: for rem_edges and add_edges above, if only a single edge is involved, then the parameter syntax may optionally be flattened.

For example:

EDIT {"add_edges":["NODE_1234",["NODE_9"]]}  # is equivalent to
EDIT {"add_edges":["NODE_1234","NODE_9"]}

Create a new node from an existing node using the given coordinates. Using this function is preferable to directly editing the JSON structure because it provides a direct record of what was done, and makes it easily repeatable in case changes to upstream analysis are made.

Parameters:

  • name : "contig_name"

    Name of the contig node to be copied

    Example:

      # Use sequence from NODE_1234
      CUTND {"name":"NODE_1234"}
    
  • new_name : "new_contig_name"

    Name for the newly created contig node

    Example:

      # New node will be named NODE_1234a
      CUTND {"new_name":"NODE_1234a"}
    
  • include : ["contig_name1","contig_name2",...]

    Move mate-pair edges from the listed contig(s) to the newly created "cut" version of the named contig.

    Example:

      # Edges between NODE_1234 and the "included" contigs will be moved to NEW_1234.
      CUTND {"name":"NODE_1234","new_name":"NEW_1234","include":["NODE_567","NODE_234"]}
    
  • begin : <int> and end : <int>

    begin is the beginning sequence coordinate for the new node (from within the original). end is the ending sequence coordinate for the new node (from within the original).

    NOTE: To facilitate trimming sequences to a fixed maximum length, it is allowable for negative begin values and positive end values to be longer than a sequence. In such cases, the values are set to the beginning and end of the sequence, respectively. Positive begin and negative end positions must fall within the sequence, otherwise the begin position will be after the end position.

    Deprecated with SEAStAR version v0.4.13: start may be used as a synonym for begin.

    Examples:

      # The new node will include sequence from positions 123 to 456
      CUTND {"begin":123,"end":456}
    
      # The new node will include sequence from position 0 (implied) to 456.
      # end may also be omitted, implying the last position
      CUTND {"end":456}
    
      # The new node will include at most the last 1000 bases of sequence.
      # If the sequence is shorter than 1000 bases, it will be copied without
      # modification.
      CUTND {"begin":-1000}
    

Generate a report of statistics about the current assembly graph

Parameters:

  • file : "filename.txt[.gz]"

    Examples:

      # Write stats to the file my_graph.txt
      GC {"file":"my_graph.txt"}
    
      # Default. Write stats to STDOUT
      GC {"file":"-"}
    
  • ccdetail : true

    Write connected component and cluster details. Note the GCC command is shorthand for this.

    Example:

      GC {"ccdetail":true}   # Equivalent to GCC
    

Generate a report of detailed statistics about the current assembly graph, with details about each connected component and scaffold cluster (if present)

Parameters:

  • file : "filename.txt[.gz]"

    Filename for writing statistics report

    Examples:

      #  Write stats to the file my_graph_ccs.txt
      GCC {"file":"my_graph_ccs.txt"}
    
      # Default. Write stats to STDOUT
      GCC {"file":"-"}
    

Copy (push) the current graph to the top of a stack in memory. This allows you to quickly save the current graph state (without erasing any previous "stashes") so that operations can later be undone with the UNSTASH command. WARNING! Each time the STASH command is run (without a corresponding UNSTASH) a new copy of the current sequence graph data is made and stored in memory. This is convenient, but can add up quickly and rapidly lead to out-of-memory errors and other such problems if not managed carefully.

Parameters: None.

Restore (pop) the current graph from the top of the stack (undo changes since most recent STASH). This command removes (and frees) this state from the stack while restoring it.

Parameters:

  • free : true

    Remove the stashed data from the top of the stack without restoring it (do not replace current state)

    Examples:

      STASH
      # do something destructive here
      UNSTASH
      # undo destructive action...
    
      STASH
      # do something destructive here
      UNSTASH {"free":true}
      # keep effects of destructive action, but free up the STASHed state
    

Use contents of a file as a series of graph_ops commands to run. Commands and their parameters are specified one per line. Lines begging If no file is provided, enter interactive mode, reading commands one at a time from a command prompt. Note that while any extension may be used with the file option below, the extension .go is required if you wish for the script file to be recognized directly by graph_ops at the command line:

graph_ops write_fasta.go   # Execute commands in write_fasta.go

graph_ops SCRIPT '{"file":"anything.txt"}'  # This doesn't require ".go"

graph_ops SCRIPT   # Enter an interactive prompt environment

graph_ops   # Same as above

Parameters:

  • file : "filename.go[.gz]"

    Read commands from the named file. A .go file is a plain text format file with one valid graph_ops command and its parameters per line. Blanks lines and other whitespace are not significant. Comments may be added by prefacing them with #. Note that unlike on the UNIX command line, it is unnecessary to single quote the command parameters in a .go file.

      # This is the write_fasta.go file
      LOAD { "file" : "assembly.json" }
    
      # scripts can use SCRIPT (with a file)
      SCRIPT { "file" : "do_something.go" }
    
      FASTA { "file" : "sequence.fna" }
    
  • self : true

    When used with a SCRIPT command inside a SCRIPT file, causes that script to call itself.

      # When called inside the script file "file.go"
      SCRIPT { "self" : true }
      # Is exactly equivalent to:
      SCRIPT { "file" : "file.go" }
      # The only difference is that if file.go is renamed, the "self" version will
      # continue to work as expected
    
  • tag : <string>

    Substitute the provided string in place of the '@' symbol in any output file names used by commands (e.g. DUMP) within the provided SCRIPT file. NOTE: this tag renaming is not used within interactive SCRIPT command sessions.

      # Add the string "Run1" in place of the '@' character in the filenames of any
      # output files written by commands in the script my_script.go (e.g. "@_output.json"
      # will become "Run1_output.json" in a DUMP command). Works with any command that
      # can write its output to a file.
    
      SCRIPT {"file" : "my_script.go", "tag" : "Run1"}
    

    The tag string will also substitute for runs of consecutive '@' characters, and for multiple distinct positions anywhere in an output file path. Note that as with all graph_ops output filenames, any resulting directory path must already exist. graph_ops will not create new directories in the course of writing output files.

      # String \"Sample_A\" will replace all runs of '@' characters anywhere in the
      # file path of files written by commands in the script my_script.go, for example:
      # ~/data/samples/@@@/@@@_output.json
      # will become:
      # ~/data/samples/Sample_A/Sample_A_output.json
    
      # NOTE! In this case, the ~/data/samples/Sample_A/ directory must already exist.
    
      SCRIPT {"file" : "my_script.go", "tag" : "Sample_A"}
    

    Note that the SCRIPT command and the associated .go scripts are not intended to replace the UNIX shell. Notably, the scripts are entirely declarative, there are no general variables, or ability to branch (if/then/else) or explicitly loop. However, using a combination of graph_ops commands, it is still possible to do some unexpectedly powerful things very efficiently:

      # This script is called: iterate.go
      STASH
      SELCC
      FASTA {"file":"@_ccseq_####.fna"}
      UNSTASH
      SELCC {"shift":true}
      SCRIPT {"file":"iterate.go"}
    

    If the above script is invoked using this command:

      graph_ops assembly.json CCOMPS SCRIPT '{"file":"iterate.go", "tag":"Run1"}'
    

    graph_ops will read in assembly.json, calculate the connected components of the sequence graph once, and then write the sequence from each individual connected component to a separate FASTA file, with output filenames automatically incrementing from Run1_ccseq_0000.fna to Run1_ccseq_0234.fna (assuming there are 235 connected components in the assembly).

List all valid commands, or provide detailed help for a specific command with HELP <COMMAND>. The HELP command when used as the first and only command on the command-line (or within an interactive SCRIPT session) may include the topic command directly as the second command line parameter rather than using the topic : "command" parameter described below. For example:

graph_ops HELP GCC

# Is shorthand for:

graph_ops HELP '{"topic" : "GCC"}'

Parameters:

  • topic : "command"

    Provide detailed help for a specific command. Using the special word ALL (which is not a valid command) in place of the command string will cause detailed help to be printed for all commands.

    Examples:

      # Provide help about the HELP command itself
      HELP {"topic":"HELP"}
    
      # Provide a dump of detailed help about all commands
      HELP {"topic":"ALL"}
    
  • cmd_line : true

    Provide detailed help for UNIX shell command line use. This help can also be obtained from the shell by running with the -h or --help options.

    Example:

      # UNIX command line usage help.
      HELP {"cmd_line":"true"}
    
      # Prints the same information as (at the command line):
      graph_ops -h
      # OR
      graph_ops --help
    

This program reads a nucleotide FASTA file with already properly oriented and ordered contig sequences (in scaffold order, as produced by the graph_ops FASTA command run with the parameter setting "no_merge_scaffs":true), and produces a FASTA format output with the contigs joined into scaffolds. NOTE: this command can be run automatically (with default settings) from within the graph_ops FASTA command, using the parameter setting "scaff":true).

In accomplishing this task, this tool attempts to do three main things:

  1. Look for short sequence overlaps at the ends of neighboring contigs, while possibly disregarding a few potentially erroneous bases at the ends of either contig, in order to close the gap between the contigs. For example:

     AGATACATCTACGCGATCGACGATCGATCCCATCGAcgt
                            aCGATCCCATCGATCGAAGTCGCGCGCATGCAGACTACGC
    
  2. Optionally, use alternatively assembled contigs (e.g. generated using a different de novo assembler, or with different settings) in an attempt to "heal" the remaining gaps between contigs that persist after step 1. For example, the middle sequence below represents part of a contig from an alternate assembly that bridges a gap between neighboring contigs:

     ATCTACGCGATCGACGATCGATCCCATCGA
              ...GACGATCGATCCCATCGAAGACATACCGAAGTCGCGCGCATGC...   <-- Alternative assembly contig
                                           CGAAGTCGCGCGCATGCAGACTACGC
    
  3. Where a gap cannot be bridged, fill it with a sequence of n (unknown bases), with the sequence length possibly adjusted (+1, or +2) to preserve the longest open reading frame that could potentially span the gap (i.e. avoid adding an artifactual frame-shift). Resulting gene models that span gaps should be thoroughly checked for occasional improper gene fusions.

     AGATACATCTACGCGATCGAC
                          nnnnnnnnnnnnnnn  <- Possibly pad with 1 or 2 more n's
                                         TCGAAGTCGCGCGCATGCAGACTACGC
    

Usage: seq_scaffold [options] <infile.fna> > <outfile.fna>

where <infile.fna> is the scaffold ordered contigs and <outfile.fna> is the output scaffolds. All sequence files are in FASTA format, and the input files may be compressed in the gzip format if the file extension ends with .gz. The output is written to STDOUT, and so may be saved to a file via redirection or piped to another command (such as gzip). To further facilitate pipelining, <infile.fna> may instead be supplied from STDIN by substituting "-" for the filename. Some examples:

seq_scaffold contigs.fna > scaffolds.fna   # Simplest case, run with default settings

seq_scaffold contigs.fna.gz alt_contigs.fna.gz | gzip -c > scaffolds.fna.gz   # gzipped inputs and outputs

bunzip2 -c contigs.fna.bz2 | seq_scaffold - > scaffolds.fna   # Piped input from compressed bzip2 file

The FASTA format for the scaffold ordered contigs follows a simple convention:

>contig1 scaffold_name1
ATAGACGA....
>contig2 scaffold_name1
GTCAGCTC....
>contig3 scaffold_name2
CGAGCAAC....
>contig4 scaffold_name2
TTAGCGAC....

The above example shows an input FASTA file with two scaffolds of two contigs each. The output of seq_scaffold for this file will be the two joined scaffolds:

>scaffold_name1
ATAGACGA....GTCAGCTC....
>scaffold_name2
CGAGCAAC....TTAGCGAC....

In <healfile.fna> the sequences do not need to be in any particular order and the FASTA header sequence identifiers do not need to be unique. Note, that it is possible to simultaneously use more than one alternate assembly for "healing" by simply concatenating those assemblies into a single file.

Note that seq_scaffold may also write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when saving the output to a file, or the resulting saved FASTA data may be corrupted by messages written to STDERR.

seq_scaffold optional parameters [options]:

[-h|--help|] [--overlap=<n>] [--trim=<n>] [--max=<n>] [--heal=<healfile.fna>] [--heal_overlap=<n>] [--heal_trim=<n>] [--heal_max=<n>] [--verbose]

-h / --help

Print a guide to valid command line parameters and their correct usage to the terminal, and then exit. Also prints the version string for this tool.

--overlap=<n>

<n> is the minimum number of bases of overlap required to join contigs.

--max=<n>

<n> is the maximum number of bases of overlap to look for.

--trim=<n>

<n> is the maximum number of non-ambiguous bases to try trimming from contig ends when looking for overlap.

--heal=<healfile.fna>

Use alternatively assembled contigs in the referenced FASTA file to try to "heal" gaps between neighboring contigs.

--heal_overlap=<n>

<n> is the minimum number of bases of overlap required to join contigs.

--heal_max=<n>

<n> is the maximum number of bases of gap to try to heal.

--heal_trim=<n>

<n> is the maximum number of non-ambiguous bases to try trimming from contig ends during healing.

--gap=<n>

<n> is the bases number of ambiguous 'n' bases to insert between contigs in the scaffold when a gap remains. Note that this base value may be adjusted by +1 or +2 to avoid introducing frame-shifts into long open reading frames that span a particular gap.

--verbose

Output extra diagnostic information to STDERR.


tetracalc is a high-performance tool for taxonomically clustering sequences using tetranucleotide (and trinucleotide) usage statistics. It works by calculating pairwise correlation coefficients of the "Z-statistics" (Teeling et al., 2004) for all input sequences. Sequences are then binned using the pairwise nearest neighbor (Equitz, 1989) greedy clustering approach. As sequences are clustered together, the Z-statistics are intelligently recalculated for the merged bins (Iverson & Riskin, 1993) so that as sequences are clustered they build statistical power. The pairwise clustering of sequence bins proceeds until no remaining pairwise combinations meet a cutoff threshold criteria. While it is possible to use fixed correlation thresholds, tetra_calc contains a large table of optimized sequence-length dependent thresholds, which have been carefully tuned against a large number of finished prokaryotic genomes taken from the RefSeq database, and this is the default (and recommended) way to use the tool in most cases. More information about SEAStAR scaffold clustering may also be found in the methods section of (Iverson, et al., 2012).

In most cases you will run the tetracalc tool indirectly from within graph_ops by using the CLUST command. However, it can be run independently.

Usage: tetracalc [options] <infile1.fna> [<infile2.fna>...] > <outfile.json>

tetracalc input_seqs.fna > output.json  # Cluster sequences in input_seq.fna using default settings

tetracalc produces a simple custom JSON output file format:

{ "clusters" :                                    # An object with a single attribute
       [                                          # Which is a list
             ["NODE_123", "NODE_456", ... ],      # Of lists, of contig node ids
             ["NODE_987", "NODE_654", ... ],      # that clustered together
             ...
       ]
}

Note that tetracalc may also write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when saving the output to a file, or the resulting saved JSON data may be corrupted by messages written to STDERR.

tetracalc optional parameters [options]:

[-h|--help] [--version] [--merge_tar=<n>] [-m <n>|--min_len=<n>] [--num_threads=<n>]

-h / --help

Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.

--version

Print the version number of the executing program and then exit.

--merge_tar=<n>

Target fraction of each genome merging to be merged into a single output bin. The valid of values for <n> is in the range 0.0-1.0 and the default is <n>= 0.95, which is a fairly conservative value, unlikely to lead to much sequence ending up in the wrong main genome bin. The value of --merge_tar represents the fraction of all sequence from a single genome that is likely to end up in the main (ie. largest by sequence length) bin corresponding to that genome in the output. Put more concretely, the default value means that ~95% (on average) of the sequence belonging to each distinct species-level genome in the input will cluster together in a single output bin, on average. Most of the remainder (%5 of the sequence in this case) will cluster into one or more smaller genome-specific bins; which can potentially be reconnected later using other information (e.g. pairing information). Lower values of <n> are more conservative because they are associated with higher correlation cutoff thresholds in the tuning tables. Conversely, values of <n> that are very close to 1.0 increase the likelihood of obtaining nearly complete genomes in each individual cluster bin, but at the expense of an increased chance that some sequences will be misclassified in the main genome bins in the output (as opposed to remaining unclustered, or in much smaller "birds-of-feather" bins of sequences from different genomes that are more similar to one another than they are to any single genome as a whole).

-m <n> / --min_len=<n>

Minimum sequence length for consideration. Default <n>= 5000 nucleotides. Shorter --min_lengths are possible, but very short sequences have poor statistical power for correlation using tetranucleotide statistics (the fall-back use of trinucleotides helps somewhat, but is less specific). There are 136 possible reverse-compliment unique tetranucleotide sequences (not 128, because 16 tetranucleotide sequences are palindromic); so for a pair of sequences to correlate well in their tetranucleotide usage, they must be long enough that the occurrence of relatively rare tetranucleotides can be accurately estimated; and that noise in the statistics due to undersampling is minimized.

Additional experimental / tuning [options]:

[--fixed] [-t <n>|--cor_thresh=<n>] [-s <n>|--sec_thresh=<n>] [-r <n>|--tri_thresh=<n>] [-c <n>|--sec_len=<n>] [-f|--seq_files] [-g|--greedy] [--chunk=<n>] [-z <filename>|--z_stats=<filename>] [--verbose]

--fixed

Disable use of the --merge_tar tuned threshold tables, and instead use default fixed thresholds (perhaps modified by the parameters below).

-t <n> / --cor_thresh=<n>

Minimum tetra-nuc correlation (R value) for a link to be reported. Default <n>= 0.9

-s <n> / --sec_thresh=<n>

Secondary minimum tetra-nuc correlation for a sequence to be linked via a tri-nucleotide correlation. Default <n>= --cor_thresh

-r <n> / --tri_thresh=<n>

Minimum tri-nuc correlation for a sequence to be linked via a secondary correlation. Default <n>= --cor_thresh

-c <n> / --sec_len=<n>

Maximum sequence length for consideration of tri-nuc secondary correlation. Default <n>= --min_len

-f / --seq_files

Treat the sequences in each input file as a single scaffold.

-g / --greedy

Perform a greedy recalculation when merged seqs were the best correlation for other seqs.

--num_threads=<n>

Number of threads to use on multicore systems. Default <n>= NUM_PROCESSORS

--chunk=<n>

Break input sequences into chunks of N bases for analysis. Default: Do not chunk.

-z <filename> / --z_stats=<filename>

Output the Z-statistics for each input sequence/scaffold as a table to <filename>.

The format of the output table is tab separated values, with a header row:

 Sequence AAAA AAAC AAAG AAAT ... TTTA TTTC TTTG TTTT

Where the column of each tetramer is one plus an 8-bit number composed of 2-bit encoded nucleotide values concatenated in sequence order:

 A --> 00b   // 'b' denotes binary
 C --> 01b
 G --> 10b
 T --> 11b

 AAAA --> 00000000b + 1 = 1
 AAAC --> 00000001b + 1 = 2
 ...
 GCTA --> 10011100b + 1 = 157
 ...
 TTTG --> 11111110b + 1 = 255
 TTTT --> 11111111b + 1 = 256

Following the header, there is one row of tab separated values for each input sequence/scaffold, beginning with the sequence identifier and followed by the 256 Z-statistic values corresponding with the tetramers in the header row.

--verbose

Output extra diagnostic information to STDERR.

There are a number of utility GAWK scripts that can be found in the source scripts directory and get copied to the build bin directory. Their names should give a reasonable summary of purpose and further descriptions can be found in header comments of the source.


FASTQ Format Reference

The SEAStAR tools use certain conventions for organizing read data in .fastq files. This section describes these conventions in sufficient detail to understand the operation of the tools described in this document.

.fastq files are named as: <prefix>.[single|single1|single2|read1|read2].fastq

<prefix> is a sample specific string used to identify a set of related read files.

.read1.fastq and .read2.fastq files contain the first and second reads in a pair respectively. Each read pair should be at the same position in both files.

.single.fastq files contain reads without a corresponding mate, either because the mate never existed (fragment run) or the mate was discarded during filtering.

.single1.fastq and .single2.fastq files contain reads without a corresponding mate and for which the original orientation of the read is known. This orientation information is useful for distinguishing sense from anti-sense transcription in strand-specific RNA-Seq data sets.

Each .fastq read ID should be formatted as @<read_prefix>:<read_ID>, where <read_prefix> is an identifier for this set of reads and <read_ID> is a read identifier which in combination with <read_prefix> forms a unique ID for this read. This combination only needs to be unique within reads of a single type, e.g. within all .read1.fastq reads.

.fastq files may be processed though a variety of quality control tools that modify (e.g. trim, error correct) and discard reads. Trimming can result in reads that vary in length, and since variable read length information is not always preserved through some downstream analyses (e.g. some alignment outputs), it is occasionally desirable to (optionally) preserve this information in the read header as well. The read length can be added to the beginning of read header as below, where 49| indicates the actual number of nucleotides:

@49|lambda:1_68_381
CGAAACCGCCGGAACCAGCTTGTGCGGACGCGCCGGCCCCTGGACCGGA
+
9'4,)9,*%4/'#,32)*.1-1(5&3,%,$:*&#'8,+2'40$)05(,&

Quality scores are encoded as ASCII characters starting at a character value offset of 33, following the standard Sanger format for FASTQ files.

SOLiD

SOLiD reads and associated quality statistics from the instrument are packaged in FASTA-like format files called .csfasta and .qual files. Fragment (non-paired) libraries produce a single matched set of files named by convention as: <prefix>_F3.csfasta and <prefix>_F3_QV.qual

Paired libraries (either mate-paired or paired-end) additionally produce a second matched set of files named: <prefix>_R3.csfasta and <prefix>_R3_QV.qual

For some paired-end libraries, the R3 in these (second mate) files may be replaced by F5-BC or F5-P2.

There is no guarantee that every read in an F3 file has a corresponding mate in an R3 file, or vice-versa. These unmated reads are called "singlets".

Most open source tools do not support these off-instrument SOLiD FASTA-like files, and so they must be converted to FASTQ format files for compatibility. This is complicated by the fact that SOLiD reads do not represent nucleotides, but are instead differential encoded colorspace reads, where each numbered color represents a coded di-nucleotide transition (e.g. AA=0, AC=1, etc.). For colorspace reads to be translated into nucleotide space, an initial known base, the primer base, is required. The following example shows the colorspace sequence for one read prefixed with primer base T:

>1_68_381_F3
T01200011211220011021332321220121211221111322011220

One simple approach to using these reads would be to simply convert them to nucleotide space .fastq files and from that point onwards ignore the colorspace source of the reads. Unfortunately, this does not work well because colorspace errors at any position in a read will effectively corrupt all of the resulting converted nucleotide sequence from the erroneous position(s) onwards. In addition, there are distinct potential advantages to processing (aligning and assembling) SOLiD reads in colorspace, delaying conversion to nucleotide space until later in the analysis pipeline when it can be done much more accurately (due to the error correction potential of the di-nucleotide encoding). For these reasons, the SEAStAR tools preserve the colorspace information in the FASTQ representation.

To accomplish this, we follow the convention used by MAQ, BWA and Velvet, mapping colorspace numbers to colorspace letters as [0123.][ACGTN]. Because the initial primer base and its following color value are not from the sampled DNA, and thus will not correspond to the actual sequence ~75% of the time, these two values need to be removed from the colorspace read data that is actually used for alignment or assembly. However, they need to be kept in some form so that the original nucleotide sequence of the read can be reconstructed at a later stage in the pipeline. To handle this, we have developed the following convention:

Original colorspace .csfasta read:

>1_68_381_F3
T012000112112200110213323212201212112211113220..2.0

Convert to colorspace letters:

>1_68_381_F3
TACGAAACCGCCGGAACCAGCTTGTGCGGACGCGCCGGCCCCTGGANNGNA

Convert to FASTQ format (with quality information added), relocating the primer base and following color TA to the header line (discarding the initial quality value).

@TA+lambda:1_68_381/2
CGAAACCGCCGGAACCAGCTTGTGCGGACGCGCCGGCCCCTGGACCGGA
+
9'4,)9,*%4/'#,32)*.1-1(5&3,%,$:*&#'8,+2'40$)05(,&

The read prefix lambda is added as an identifier for this set of reads, by default, from the .csfasta filename prefix. The read name suffix changes from the SOLiD specific F3 to /2‚ designating it as the second mate or the only read in a fragment run. Similarly for mated reads, R3 (or F5-BC/F5-P2) suffixes are changed to /1, designating them as the first mates.

If the resulting .fastq files are then be processed by trimfastq with --add_len, read lengths will be added to the read header after the primer base and first color. Below, the 49| indicates the actual number of color encoded nucleotides:

@TA+49|lambda:1_68_381/2
CGAAACCGCCGGAACCAGCTTGTGCGGACGCGCCGGCCCCTGGACCGGA
+
9'4,)9,*%4/'#,32)*.1-1(5&3,%,$:*&#'8,+2'40$)05(,&

Reads conforming to these FASTQ conventions are used (and assumed) throughout the SEAStAR analysis pipeline, and any additional tools (custom scripts, etc.) that modify FASTQ read data will need to preserve these conventions in their outputs.


JSON Sequence Graph File Format Reference

A JSON format "sequence graph" file is a data representation describing sequences as nodes (or vertices) in a mathematical graph. When pairing information is available, there are also edges. Edges are a summary of many connections between nodes formed by read pairs in which the two mates align with different sequences (connecting those sequence nodes in the graph).

The top level structure of a JSON sequence graph is:

{
    ### Attributes created by ref_select

    "SEASTAR_tool" : "",         #  Path to ref_select executable used
    "SEASTAR_version" : "",      #  Version of ref_select used
    "runtime_parameters" : {
          "parm" : "val", ...    #  Object containing the ref_select parameters used
     },
    "nodes" : {
          <node_id> : { }, ...   #  Object containing sequence node objects
     },
    "edges" : [                  #  List of pairing edges connecting sequence nodes
          { }, ...               #  These edges are from mates spanning two sequences
     ],
    "internal_edges" : [         #  List of pairing edges within nodes. These edges
          { }, ...               #  are from mates aligning to the same sequence
     ],
    "shared_seq_edges" : [       #  List of edges formed by multi-node mapping reads
          { }, ...               #  These edges are from reads mapping to multiple
     ],                          #  sequences, indicating duplication
    "rollup_stats" : {           #  Optional, catalog per parent sequence statistics
          "seq_id" : {},         #  A catalog file may link sequences together, into
             ...                 #  parent seqs, (e.g. scaffolds of a genome) Rollup
     },                          #  calculates summary stats for the parent seq.
    "run_stats" : {
          "stat" : val, ...      #  Run statistics generated by ref_select
     },

     ###  In addition, graph_ops commands add other attributes to the sequence graph
     ###  Some or all of the attributes below may be absent depending on which
     ###  commands have been run

    "processing" : [             #  List of graph_ops commands that have been run
          [ ], ...               #  Each sublist contains details for one command
     ],
    "removed_nodes: : {          #  Same as "nodes" above, but currently unselected
          "node_id" : { }, ...   #  Object containing sequence node objects
     },
    "removed_edges" : [          #  Same as "edges" above, but currently unselected
          { }, ...               #  These edges are from mates spanning two sequences
     ],
    "digraph" : true,            #  Indicates that the graph is now directed
    "connected_comps" : [
          ["node1", ... ], ...   #  List connected components. Sublists of node ids
     ],
    "scaffolds" : {              #  Object containing scaffolds, keyed by scaffold id
          "scaf0" : { }, ...     #  Each scaffold is a sub object
     },
    "clusters" : [               #  Same as the JSON output of tetracalc
          [ ], ...               #  Each cluster is a list of scaffold ids
     ],
    "merged_edges" : <bool>      #  When present (and true) indicates that the edges
                                 #  and mate-pairing statistics are originally from a
                                 #  separate (merged) JSON dataset. See the graph_ops
                                 #  LOAD command "edges" option.
}

The nodes (and removed_nodes) object contains one attribute for each sequence node selected by ref_select. These attributes are keyed by <node_id>, and are themselves objects:

"nodes" : {
     <node_id> : {
          "bits" : <float>,      #  Total bitscore of reads mapping to this sequence
          "rd_cnt" : <float>,    #  Number of (perhaps fractional) reads mapping
          "int_cov" : <float>,   #  Mapped reads per nucleotide (rd_cnt / seq_len)
          "rel_ab" : <float>,    #  Relative abundance of this sequence, fraction of 1
          "cov" : <float>,       #  Coverage, mean reads at each position in sequence
          "rd_len" : <float>,    #  Mean read length of mapping reads
          "seq_len" : <int>,     #  Reference sequence length in nucleotides
          "pct_uncov" : <float>, #  Percent of reference sequence uncovered by reads
          "mp_pairs" : <int>,    #  Number of paired reads mapping within sequence
          "mp_sh" : <int>,       #  Number of shared reads, mapping to other sequences
          "mp_fwd" : <int>,      #  Number of reads with mates off 3' end of sequence
          "mp_bwd" : <int>,      #  Number of reads with mates off 5' end of sequence
          "mp_ins_mean" : <float>,  #  Mean insert length of pairs mapping within
          "mp_ins_stdev" : <float>, #  Standard deviation of above insert lengths
          "name" : "",                  #  Name / unique node_id for this node
          "desc" : "",                  #  Text description of this sequence
          "adj_cov" : <float>,          #  Adjusted coverage, omits near seq ends
          "adj_cov_stddev" : <float>,   #  Standard deviation of adj_cov within seq
          "adj_cov_min" : <float>,      #  Minimum adjusted coverage
          "adj_cov_max" : <float>       #  Maximum adjusted coverage
          "short_seq" : <bool>   #  True if adj_* stats are omitted due to short seq
          "seq" : ""             #  DNA sequence, optional, either reconstructed or provided
          "pct_gc" : <float>,    #  GC composition as %, may be absent or NA
          "circular" : <bool>,   #  True if this node represents circular DNA sequence
          "per_nt" : { ... }     #  Sequence and per nucleotide stats are in this object
          "contig_problems" : [
               { }, ...          #  List of potential problems found with this node
           ]
          "ref_str" : <bool>,       #  Set by graph_ops. True if this sequence is on
                                    #  the "reference strand"
          "merged_stats" : { ... }  # Present when the "merged_edges" attribute is true.
     }, ...
}

The per_nt attribute of a node object is optional. If present it contains the per nucleotide information calculated for the node, including the DNA sequence itself, when present.

"per_nt" : {                    #  Per nucleotide statistics, all of these are optional.
          "mean_gc" : [         #
               <float>, ...     #  Mean %GC content for nucleotides within a window
           ],                   #
          "cov" : [             #
               <float>, ...     #  Read coverage for each nucleotide position
           ],                   #
          "phys_cov" : [        #
               <int>, ...       #  Physical coverage of pairs spanning each position
           ],                   #
          "mp_ins" : [          #
               <float>, ...     #  Mean insert size for pairs spanning each position
           ],                   #
}

The contig_problems attribute of a node object is optional, but if present it flags a number of different types of potential problems and pinpoints their location(s) within the sequence. These may also be viewed in graph_ops using the PROBS command.

"contig_problems" : [            #  A node may have multiple problems, so it's a list
     {                           #  of objects, one per problem region
          "start" : <int>,       #  Staring coordinate of this problem
          "end" : <int>,         #  Ending coordinate of this problem
          "type" :                          #  What kind of problem?
               "Physical coverage break" |  #  Contig should be split here...
               "Insert size anomaly" |      #  Extra or missing sequence detected
               "Collapsed duplication" |    #  There are two/more repeated sequences
               "Uncovered end" |            #  Bad assembly / reconstruction of end
               "Small insert"               #  Sequence doesn't belong here...
     }, ...
]

The following attributes are present when the graph "merged_edges" attribute is true. These attributes preserve the node statistics of the graph that the edges in this graph came from. "merged_stats" : { "bits" : , # Total bitscore of reads mapping to this sequence "rd_cnt" : , # Number of (perhaps fractional) reads mapping "int_cov" : , # Mapped reads per nucleotide (rd_cnt / seq_len) "rel_ab" : , # Relative abundance of this sequence, fraction of 1 "cov" : , # Coverage, mean reads at each position in sequence "pct_uncov : , # Percent of reference sequence uncovered by reads "rd_len" : , # Mean read length of mapping reads "adj_cov" : , # Adjusted coverage, omits near seq ends "adj_cov_stddev" : , # Standard deviation of adj_cov within seq "adj_cov_min" : , # Minimum adjusted coverage "adj_cov_max" : , # Maximum adjusted coverage "cov" : [ # This may be present when "merged_edges" is true , ... # Read coverage for each nucleotide position from ] # an alternative mate-pairing alignment }

The different edges lists (edges, removed_edges, internal_edges, shared_seq_edges) have similar syntax, but with different semantics:

# Note that internal and shared_seq edges will not have dir, org_dir or score attributes

"edges" : [                   #  Each edges list holds multiple edge objects
     {                        #  each of which represents a relationship between:
          "n1" : "",          #  A first node, referenced by name, and a
          "n2" : "",          #  second node. This are the same for internal_edges
          "dir" : "",         #  Direction of this edge: "FB", "FF", "BB" or "forward"
          "num" : <int>,      #  Number of mapped read-pairs represented by this link
          "bits" : <float>,   #  Total bitscore of read-pairs participating in edge
          "p1" : <int>,       #  Mean mapping coordinate in the first node
          "p2" : <int>,       #  Mean mapping coordinate in the second node
          "score" : <float>,  #  Heuristic score for edge, calculated by graph_ops
          "org_dir" : ""      #  Preserves original "dir" when edge becomes "forward"
     }, ...                   #  "dir" codes, where --> is:  5' --> 3'
]                             #  "FB" :      --> edge -->
                              #  "FF" :      --> edge <--
                              #  "BB" :      <-- edge -->
                              #  "forward" : directed (all "forward" edges are "FB"
                              #              and on the "reference strand")

Rollup statistics are generated by ref_select --rollup parameter. They require use of a catalog file with ref_select that connects sub-sequences to their parents. A common use case is aligning reads to genome sequences, where some of the genomes may be unfinished and remain in numerous scaffolds. The rollup statistics summarize the stats by parent (genome) sequence, "rolled-up" from the underlying node sub-sequences.

"rollup_stats" : {
     "seq_id" : {
        "node_list" : [
             "node_id", ...     #  List of node ids of the sub-seqs rolled into this
         ]
        "bits" : <float>,       #  All of these statistics are the same as their
        "rd_count" : <float>,   #  named counterparts in the "node" object described
        "int_cov" : <float>,    #  above, except they represent a proper amalgamation
        "rel_ab" : <float>,     #  of the underlying sequence statistics, as though
        "cov" : <float>,        #  all of the "parent" sequences were simply a
        "rd_len" : <float>,     #  concatenation of their child subsequences. So for
        "seq_len" : <flaot>,    #  example, "rel_ab" is relative abundance of this
        "pct_gc" : <float>,     #  parent sequence compared to all other parent
        "pct_uncov" : <float>,  #  sequences.
        "desc" : ""
     }, ....
}

Run statistics are ref_select overview statistics for the run that created this file

"run_stats" : {
     "num_selected_nodes" : <int>,               #  Number of nodes in output
     "original_nodes" : <int>,                   #  Number of original ref seqs
     "total_node_abundance" : <float>,           #  Should be very close to 1.0!
     "mean_coverage" : <float>,                  #  Optional: mean of all nodes
     "coverage_duplication_threshold" : <float>, #  Thresh used for single genomes
     "reads_assigned_to_selected" : <int>,       #  Reads mapping to output nodes
     "original_reads_aligned" : <int>,           #  Reads mapping in alignment
     "second_chance_assigned_reads" : <int>      #  Reads assigned to suboptimal
}                                                #  (but selected) node

The processing list is added by graph_ops to create a command history of changes made to this sequence graph.

"processing" : [
     [
          "script",      #  Filename of the SCRIPT, '$' for cmd line or '>>' for STDIN
          "version",     #  SEAStAR version number of graph_ops used for this command
          "command",     #  Name of the graph_ops command executed
          { "parm" : val, ... }   #  JSON parameters passed to the command
     ], ...
]

The scaffolds list is added by the graph_ops SCAFF command. It details the proper ordering of nodes in each scaffold, and links them back to connected components

"scaffolds" : {
     "scaffold_id" : {           #  One object per scaffold
          "ccnum" : <int>,       #  Reference to associated connected component
          "nodes" : [
               "node_id", ...    #  This is an *ordered* list, 5' -> 3' of sequence
          ]                      #  nodes, all on the same strand.
     }, ...
}

The clusters object is identical to the contents of the object output from the tetracalc tool:

"clusters" : [                         # A list...
     ["NODE_123", "NODE_456", ... ],   # of lists, of contig node ids
     ["NODE_987", "NODE_654", ... ],   # that were clustered together
     ...
]