-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial 2: PhyloProcessR pipeline workflows
After PhyloProcessR and its dependencies are install, you can either use one of R scripts set up with the functions pre-filled (and SnakeMake in the future to manage job submissions), called "workflows", or create your own pipeline by selecting different combinations of functions (see Tutorial 3 for the advances usage).
Functions are included to automatically download raw read files stored on Dropbox and process them in one.
- The first item needed is a generated Dropbox token file that should be kept on the file system you wish to download the raw reads. The code for this in R (outside any PhyloProcessR workflows) is:
library(rdrop2)
token <- drop_auth()
saveRDS(token, file = "token.rds")
- Next, in configuration file for workflow-1 "workflow-1_configuration-file.R", edit the following section to set up dropbox download:
# For downloading reads from dropbox
#########################
#TRUE to download files from personal dropbox folder using token and read path
dropbox.download = TRUE
#the dropbox directory your files are all contained within if dropbox.download = TRUE
dropbox.directory = "/Dropbox/Path/to/Reads"
#token file previously generated by the rdrop2 package
dropbox.token = "/Local/Path/to/token.rds"
- Last, the "file_rename.csv" file (explained in more detail later in this tutorial), the "File" column should correspond to the filenames in your dropbox.directory. The reads can be in sub-directories as that dropbox folder is recursively searched for the files in the "File" column.
#Raw read locations
#########################
#The file rename (File, Sample columns) for organizing reads or dropbox download. NA if reads already organized
sample.file = "file_rename.csv"
- For the quick-start R script, you first need to follow Tutorial 1 in setting up your working directory and configuration files. Initially, you should have the following files before you begin:
/Project_Name
├── workflows/
├── your_target_markers.fa
├── file_rename.csv
└── decontamination_database.csv
- Some things that should be done before running the workflows is making sure your files downloaded properly and completely from the sequencing center. This should be done quickly to ensure the center still has your data in case something goes wrong in the file transfer. PhyloProcessR provides two functions for this check: checkMD5() and checkFileSize().
First, checkMD5() will compare the MD5 hash sequence (a unique sequence of characters your file will have) provided by the sequencing center (most should do this) with the actual hash sequence calculated from the file. The function here will use the "md5" program that included with all Unix-based operating systems, and if not it can be installed separately for any OS. To run the function:
checkMD5(read.directory = NULL,
md5.file.name = NULL,
md5.type = c("single-file", "many-files"),
output.name = "md5-check",
overwrite = FALSE)
Where the function parameters are as follows:
read.directory = the main directory that contains all your reads
md5.file.name = MD5.txt, or whatever the file name for the MD5 hash sequence is
md5.type = "single-file" is where the MD5.txt is a single file for all the samples
"many-files" is where the MD5.txt is provided separately for each sample
output.name = name of the output .csv file with the data
overwrite = whether to overwrite the output file or not
The output file will have the sample name read in the first column and "PASS" or "FAIL" in the second column, where "FAIL" the checksum does not match which means your file is either corrupted or did not download completely.
Second, as an added check, the checkFileSize() function is provided to also check the file size of each of the read files. A file of file sizes is often provided by sequencing centers as a single checkSize.txt file.
checkFileSize(read.directory = NULL,
check.file.name = NULL,
output.name = "check-fileSize",
overwrite = FALSE)
Where the function parameters are as follows:
read.directory = the main directory that contains all your reads
check.file.name = checkSize.txt, or whatever the file name for the file sizes
output.name = name of the output .csv file with the read file sizes
overwrite = whether to overwrite the output file or not
The output file will have the sample name read in the first column and "PASS" or "FAIL" in the second column, where "FAIL" the file size does not match which means your file is either corrupted or did not download completely.
Once both of these look good, Workflow 1 is ready to begin!
- Once everything is ready, you can run the pipeline with the command line "Rscript" function after activating the PhyloProcessR anaconda environment, which is especially necessary for cluster environments that lack a GUI. This command can be placed in a job script along with the PhyloProcessR anaconda environment activation.
conda activate PhyloProcessR
Rscript workflow-1_preprocess.R
Alternatively, you can open "workflow-1_preprocess.R" and the subsequent workflows and run the R script function-by-function in your favorite R IDE (e.g. stock R, RStudio).
- The results you should expect are as follows:
If "organize.reads" is set to TRUE in the configuration file, and the file_rename.csv file is provided, PhyloProcessR will create the "processed-reads" directory that contain the reads from different processing steps, where which sets of reads that are saved can be selected in the configuration file. If this is not provided, then your samples will be named after your file names, which may not look pretty. This, however, can be changed later with manual rename functions from PhyloProcessR. The following will be generated in the first step of workflow-1:
/Project_Name
├── /workflows
├── /processed-reads
├───── /organized-reads
│ ├── Spinomantis_elegans_CRH111_R1.fastq.gz
│ ├── Spinomantis_elegans_CRH111_R2.fastq.gz
│ ├── Boophis_burgeri_CRH0481_R1.fastq.gz
│ ├── Boophis_burgeri_CRH0481_R2.fastq.gz
│ ├── Aglyptodactylus_securifer_CRH1644_R1.fastq.gz
│ ├── Aglyptodactylus_securifer_CRH1644_R2.fastq.gz
│ ├── Mantidactylus_femoralis_CRH2340_R1.fastq.gz
│ └── Mantidactylus_femoralis_CRH2340_R2.fastq.gz
├── target_markers.fa
├── file_rename.csv
└── decontamination_database.csv
Next, if the decontamination function is used, PhyloProcessR will download the relevant contaminations to the "contaminant-references" folder in the main directory.
/Project_Name
├── /contaminant-references
├── /workflows
├── /processed-reads
├───── /organized-reads
├── target_markers.fa
├── file_rename.csv
└── decontamination_database.csv
After the reads are organized and decontamination genomes are downloaded, PhyloProcessR will create additional analysis folders in "processed-reads" directory that contain the reads from different processing steps, where which sets of reads that are saved can be selected in the configuration file.
/Project_Name
├── /contaminant-references
├── /workflows
├── /processed-reads
│ ├── /organized-reads
│ ├── /adaptor-removed-reads
│ ├── /decontaminated-reads
│ ├── /normalized-reads
│ └── /pe-merged-reads
├── target_markers.fa
├── file_rename.csv
└── decontamination_database.csv
Double-check to see that there are reads in pe-merged-reads, which means the preprocessing worked completely and your reads are ready for assembly. At minimum, the pe-merged-reads and decontaminated-reads directories should be kept, as the pe-merged-reads are used for assembly and decontaminated-reads are used for the optional variant calling step after assembly.
After assembly, PhyloProcessR will create a "data-analysis" directory that contains the draft assemblies. The "data-analysis" folder will be the main products of the pipeline that include the raw assembled contigs ("draft-assemblies"), consensus assemblies ("consensus-assemblies"), iupac SNP called assembled ("iupac-assemblies") and haplocontig assemblies ("haplo-assemblies").
/Project_Name
├── /contaminant-references
├── /processed-reads
│ ├── /organized-reads
│ ├── /adaptor-removed-reads
│ ├── /decontaminated-reads
│ ├── /pe-merged-reads
│ └── /spades-assembly
├── /data-analysis
│ └── /draft-assemblies
├── target_markers.fa
├── file_rename.csv
├── decontamination_files.txt
└── configuration-file.txt
The final steps of the pipeline will match your sequence capture targets to your chosen directory of assembled contigs. With the matched targets, PhyloProcessR will align and trim alignments of the targets and finally run gene trees. The directory structure should have the following:
/Project_Name
├── /contaminant-references
├── /processed-reads
├── /data-analysis
│ ├── /matched-targets
│ ├── /alignments
│ ├── /trimmed-alignments
│ ├── /gene-trees
│ └── /concatenation-trees
├── target_markers.fa
├── file_rename.csv
├── decontamination_files.txt
└── configuration-file.txt
After assembly, PhyloProcessR will create a "data-analysis" directory that contains the draft assemblies. Variants will be called directly on these assemblies.
(... rest coming soon)