Tutorial 3: Assess results

This tutorial is a guide to calculating various summary statistics for 1) the cleaned reads; 2) finished assemblies; 3) sample specificity and 4) sample sensitivity.

Sample read statistics

This function will calculate some basic statistics of your cleaned reads (e.g., total nucleotides, total GigaBases, total read pairs). This function can be run on any directory of reads and the same statistics will be calculated. For the cleaned reads, run the pipeline workflow-1 to completion to obtain directories named for each sample that contain the cleaned reads. You will want to use the "cleaned-reads" directory as these will be free of adaptors, low quality bases/reads, duplicates, and other organism contamination, but not merged or reduced in any other way. If you used another pipeline, what you will need is a main directory that contains sub-directories named for your samples, with the reads in each sub-directory.

Everything is now ready to run the function, which you can do as follows:

readStats(read.directory = NULL,
          output.directory = "read-stats",
          overwrite = FALSE)

Where the parameters for the function are:

read.directory: main directory of reads; organized by sub-directories named for each sample containing reads (read1 and read2) within the main directory. 
output.directory: output directory name for raw statistics for each sample and the summary table.
overwrite = TRUE to overwrite output directories

Sample assembly statistics

This function will calculate some basic statistics of your assembled contigs (e.g., total nucleotides, total MegaBases, number of contigs, mean contig length, median contig length, minimum contig length and maximum contig length). This function can be run on any directory of assemblies and the same statistics will be calculated. To obtain assemblies from PhyloCap, run the pipeline workflow-1 to completion to obtain the directory "data-analysis/draft-assemblies" which contains the draft Spades assembly from each sample. If you used another pipeline, what you will need is a main directory that contains fasta files named for each of your samples (e.g., Genus-species.fa).

Once you have assemblies, the function can be used as follows:

assemblyStats(assembly.directory = "directory/of/your/assembled/contigs",
              output.directory = "assembly-stats",
              overwrite = FALSE)

Where the parameters for the function are:

assembly.directory: your main directory of assemblies which contain the contigs for each sample. 
output.directory: output directory name for raw statistics for each sample and the summary table.
overwrite = TRUE to overwrite output directories

Sample paralog statistics

This function will calculate some basic statistics of the paralogs within your assembled contigs (e.g., total nucleotides, total MegaBases, number of contigs, unique paralogs, mean number of copies, median number of copies, min and max copies, mean contig length, median contig length, minimum contig length and maximum contig length). This function can be run on any directory of assemblies along with the target marker file and the same statistics will be calculated. To obtain assemblies from PhyloCap, run the pipeline workflow-1 to completion to obtain the directory "data-analysis/draft-assemblies" which contains the draft Spades assembly from each sample. If you used another pipeline, what you will need is a main directory that contains fasta files named for each of your samples (e.g., Genus-species.fa).

Once you have assemblies, the function can be used as follows:

paralogStats(assembly.directory = "directory/of/your/assembled/contigs",
             target.file = "path/to/target/marker/file,
             output.directory = "paralog-stats",
             threads = 1,
             memory = 1,
             overwrite = FALSE,
             quiet = FALSE,
             overwrite = FALSE,
             blast.path = NULL)

Where the parameters for the function are:

assembly.directory: your main directory of assemblies which contain the contigs for each sample. 
target.file: the marker/locus file that was used for the sequencing capture probe design (i.e., your targets)
output.directory: output directory name for raw statistics for each sample and the summary table.
threads = number of threads
memory = amount of memory in GigaBytes
overwrite = TRUE to overwrite output directories
quiet = TRUE to display verbose output
blast.path = path to blast, or anaconda environment directory (i.e., usr/local.bin/)

Sample Specificity

This function will calculate the specificity for each sample. The specificity is defined as the percent of cleaned reads that map to the target markers from probe capture.

What is needed to get started:

Run the pipeline workflow-1 to completion to obtain the assembled contigs and directories containing cleaned reads. You will want to use the "cleaned-reads" directory as these will be free of adaptors, low quality bases/reads, duplicates, and other organism contamination, but not merged or reduced in any other way. If you used another pipeline, what you will need is a main directory that contains sub-directories named for your samples, with the reads in each sub-directory.

EXAMPLE

The second component needed is the "draft-assemblies" sub-directory found in the "data-analysis" main directory. These are assembled contigs from sPades from the PhyloCap pipeline or they can be assembled contigs from any other source. The assembled contig files are named after each sample that should correspond exactly in name to the sub-directories of cleaned reads from above (i.e. no "Sample-Name_contigs.fa", should be "Sample-Name.fa". The PhyloCap pipeline names them in this fashion, ready to use for this function.

Everything is now ready to run the function, which you can do as follows:

summary.sampleSpecificity(read.directory = NULL,
                          target.file = NULL,
                          output.name = "sample-specificity",
                          threads = 1,
                          memory = 1,
                          overwrite = FALSE,
                          quiet = FALSE,
                          bwa.path = NULL,
                          picard.path = NULL,
                          samtools.path = NULL)

Where the parameters for the function are:

read.directory: directory of your sequence capture alignments 
target.file: the marker/locus file that was used for the sequencing capture probe design (i.e., your targets)_
output.name: output name for the directory of sample raw data and summarized for each sample
threads = number of threads
memory = amount of memory in GigaBytes
overwrite = TRUE to overwrite output directories
quiet = TRUE to display verbose output
bwa.path = path to boa (i.e., usr/local.bin/)
picard.path = path to picard (i.e., usr/local.bin/)
samtools.path = path to samtools (i.e. usr/local/bin/)

Sample Sensitivity

This function will calculate the sensitivity for each sample. The sensitivity is defined as the percent in basepairs of post assembly contigs that overlap with the target markers. The inverse of this value can also be used to quantify missing data per sample.

What is needed to get started:

Run the pipeline workflow-1 to completion to obtain the assembled contigs. Using the assembled contigs, run workflow-2 to obtain alignments for each marker. From the PhyloCap pipeline you will need the directory "untrimmed_all-markers".
The second necessary component is the target marker fasta file used in probe design. This should contain only the region of the marker used for probes to accurately quantify sequence capture success.

Once alignments are ready, you can calculate sensitivity by:

summary.sampleSensitivity(alignment.directory = NULL,
                          target.file = NULL,
                          output.directory = "sample-specificity",
                          threads = 1,
                          memory = 1,
                          overwrite = FALSE,
                          quiet = FALSE,
                          mafft.path = NULL)

Where the parameters for the function are:

alignment.directory: directory of your sequence capture alignments 
target.file: the marker/locus file that was used for the sequencing capture probe design (i.e., your targets)
output.directory: output name for the directory of sample raw data and summarized for each sample
threads = number of threads
memory = amount of memory in GigaBytes
overwrite = TRUE to overwrite output directories
quiet = TRUE to display verbose output
mafft.path = path to mafft (i.e., usr/local.bin/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tutorial 3: Assess results

Sample read statistics

Sample assembly statistics

Sample paralog statistics

Sample Specificity

Sample Sensitivity

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally