Skip to content

Tutorial 1: PhyloProcessR configuration

chutter edited this page Apr 15, 2023 · 2 revisions

This tutorial will guide you through the processing of your raw data into assembled contigs for each of your samples. PhyloProcessR can be run either as a continuous pipeline that executes each R function and automatically allocates resources and generates job scripts from the command line. Alternatively, the functions can be run on their own in a normal R environment if you would like to construct your own pipeline or if only a few functions are needed.

Configuration Files

The PhyloProcessR pipeline is separated into "workflows", which are distinct pipelines to accomplish bioinformatic tasks across a whole set of samples. The set of workflows include: 1) a raw read preprocessing step to ready the reads for de novo assembly; 2) an assembly step, which assembles the reads using the program Spades and optionally tries to recover and assemble missing target markers; 3) A variant calling step to refine assemblies and provide SNP data for other types of analyses; 4) an alignment workflow that aligns target markers and trims them, making them ready for phylogenetic analyses. There are also a couple phylogenetic workflows, such as 1) generating gene trees from target markers in different configurations (genes, exons, introns, etc) to be used for species tree analyses.

The configuration file is quite simple for PhyloProcessR to function properly and a separate configuration file is needed for each workflow, where the parameters can be modified and customized to run the entire pipeline in a single command. An example file is included in the setup-configuration_files from the main branch ("configuration-file.R") with the default settings already applied and should work well as-is in most cases. Otherwise, other preferences can be modified to your projects specifications.

The most important parameters to set in the configuration file are the files paths and directories in the first block.

#Directories and input files
#########################
# *** Full paths should be used whenever possible
#The main working directory
work.dir = "/project/directory"
#The file rename (File, Sample columns) for organizing reads. Set to NULL if not needed
file.rename = "/project/directory/file_rename.csv"
#The file for the contaminant genomes (Genome, Accession columns); 
#use NULL if download.contaminant.genomes = F
contaminant.genome.list = "/project/directory/decontamination_database.csv"
#The sequence capture target marker file for extraction from contigs
target.file = "/project/directory/Ranoidea_All-Markers_Apr21-2019.fa"
#The input raw read directory
read.dir = "/project/directory/raw-reads/"
#The name for the dataset
dataset.name = "dataset-name"
#The name for the processed reads folder
processed.reads = "processed-reads"

Create renaming file

Often the case with multiplexed samples in sequence capture projects, you will find that the names of the reads often are not the desired final names for the sample. PhyloProcessR offers a function to rename and organize all your samples given a spreadsheet of the file name and desired sample name. To create the renaming file, a .csv file is needed with only two columns: "File" and "Sample". An example is included in the setup-configuration_files folder in the main branch ("file_rename.csv").

The "File" column: the unique string that is part of the file name for the two read pairs, while excluding read and lane information. Example:

CRH111_AX1212_L001_R1.fastq.gz

CRH111_AX1212_L001_R2.fastq.gz

Are the two sets of reads for a given sample. Your "File" column value would then be:

CRH111 or CRH111_AX1212

The "Sample" column: What you would like your sample name to be. This will be used up to alignments and trees. Ensure that your samples all have unique names and are not contained within each other (e.g. Genus_species_0, Genus_species_01). Also exclude special characters and replace spaces with underscores. Hyphens are also ok. In this example, the "Sample" Column would be:

Spinomantis_elegans_CRH111

All put together:

File Sample
CRH111 Spinomantis_elegans_CRH111
CRH1644 Aglyptodactylus_securifer_CRH1644
CRH0481 Boophis_burgeri_CRH0481
CRH2340 Mantidactylus_femoralis_CRH2340

To include multiple sets of reads for a single sample that you would like to combine together, ensure that their file names are different (they can be named anything) and are included under the same sample name in the spreadsheet. For example:

File Sample
CRH111_L001 Spinomantis_elegans_CRH111
CRH111_AX1212 Spinomantis_elegans_CRH111
CRH111_LIB03 Spinomantis_elegans_CRH111

Each Lane or set of reads pairs are processed separately for the entire pipeline and are only combined at the end for de novo assembly.

Create decontamination file

If used, the decontamination function will remove any read pairs that match to a database of possible contaminants from bacteria, parasites, human, and other common organisms found at sequencing facilities. A spreadsheet of reference genomes is provided "decontamination-genomes.csv" in the "setup-configuration_files" folder, and can be edited by including the organism and the GenBank accession number. PhyloProcessR will automatically download these genomes from GenBank and create a reference directory for them.

Directory structure

The beginning directory structure should be:

     /Project_Name
      ├── /workflows
      ├── target_markers.fa
      ├── file_rename.csv
      ├── decontamination_files.txt
      └── configuration-file.R

*/ denotes directory

Where raw-reads can be located anywhere. If the file_rename.csv file is used, then the following will be generated in the first step of the pipeline in workflow-1 (see Tutorial 2):

 /Project_Name
  ├── /workflows
  ├── /processed-reads
  ├─────  /organized-reads
  │          ├── Spinomantis_elegans_CRH111_R1.fastq.gz
  │          ├── Spinomantis_elegans_CRH111_R2.fastq.gz
  │          ├── Boophis_burgeri_CRH0481_R1.fastq.gz
  │          ├── Boophis_burgeri_CRH0481_R2.fastq.gz
  │          ├── Aglyptodactylus_securifer_CRH1644_R1.fastq.gz
  │          ├── Aglyptodactylus_securifer_CRH1644_R2.fastq.gz
  │          ├── Mantidactylus_femoralis_CRH2340_R1.fastq.gz
  │          └── Mantidactylus_femoralis_CRH2340_R2.fastq.gz
  ├── target_markers.fa
  ├── file_rename.csv
  ├── decontamination_files.txt
  └── configuration-file.txt
Clone this wiki locally