Skip to content

Project plan

erik-burger edited this page May 26, 2020 · 14 revisions

erik_burger_genome_analysis

This project is based on the paper The draft genome of tropical fruit durian (Durio zibethinus) (aka group 5) and have chosen to work with scaffold 6.

The goal of this project is to assemble the genome of from the fruit Durian (Durio zibethinus) and then to annotate the genome. The transcriptome from different parts of the fruit will be assembled to study the differences in expression and find the genes that differ the most between the different parts.

This diagram has been created to display with an overview of the methods that will be used and how everything is connected in this project:

The raw PacBio reads will be assembled using the program Canu which will also do quality control, correction and trimming to the data. In the end it will output a FASTA file with an assembled genome based only on the long reads along with other files reporting on the results of the quality control and other statistics. To be able to run the next assembly step we need a file with sequences aligned to our long reads assembly based on the short reads data. The DNA illumina reads will first be run through fastQC to check the quality of the reads. Then the reads will be aligned to the assembly made from the long reads using the program BWA. This will result in a BAM file with the aligned sequence. This BAM file together with the assembly from the long reads will be the input to the program Pilon which will create an improved assembly as an FASTA file. To be able to go on from here we need the assembled transcriptome. To do this we first test the quality of the reads that aren't trimmed with FastQC and then trim them using trimmomatic. Then all the reads will get their quality tested with FastQC. The reads will then be aligned using the program STAR which takes the trimmed RNA sequence together with the assembled genome as input. The output will be aligned RNA in a BAM file. One of these BAM files will then be run through the program trinity that will assemble the RNA and result in a FASTA file. This FASTA file together with the assembly of the DNA and some protein sequences from closely related species will be the input to the program MAKER2 that will annotate the genome. This will result in a GFF file with the genome annotated. The FASTA file with the assembled RNA will then also be used by the program HTseq to calculate the expression of the different RNA in a GFF file. The result from this analysis will then be run though the program DESeq2 to find statistically significant result. This will be done for multiple different transcriptomes and the result will be compared between them.

One bottleneck that can slow down the process of the project is if Canu is run late. Because this program will take ca 17 hour to run it is smart to do it early so that in the meantime other smaller analyses can be run.

Time schedule 2020:

End date Acitvity
17/4 Genome assembly Canu, BWA and Pilon
28/4 Transcriptome assembly Trinity
5/5 Annotation Maker2
8/5 Expression analysis Tophat and HTseq

Project Organization

In the root on github there are three folders related to this analysis, data, code, analysis. In the data folder the raw data and metadata will be stored in separate folders. In the code folder the code will be stored in folders with names that tells to what analysis the codes are related to. In the analysis folder the output files from each program/analysis are stored in folders with names that are related to that program or analysis.