This project involved the genome assembly and annotation of a single accession of Arabidopsis thaliana, aiming to assess assembly quality of different assemblying tools, identify genomic features, and annotate repetitive elements and gene content. Sequencing data from Q. Lian et al. (DOI: 10.1038/s41588-024-01715-9).
This repository is part of the final project for the course 473637-HS2024-0 Genome and Transcriptome Assembly at the University of Bern.
The analysis pipeline consists of the follwing steps:
- Preparation and Quality Control of Input Data
- Genome Assembly
- Assembly Evaluation
- Repetitive Element (Transposable Elements) Annotation and Classification
- Phylogenetic Analysis of TEs
- Gene Annotation
- Comparative Genomics
All .sh scripts were executed on the IBU cluster using the SLURM workload manager (sbatch).
R scripts were executed locally.
-
Download Reads: Retrieve the sequencing data.
./00_download_reads.sh
-
Quality Control:
- DNA FastQC analysis:
./01_run_dna_FASTQC_module.sh
- RNA FastQC analysis:
./01_run_rna_FASTQC_module.sh
- General FASTP quality control:
./02_run_fastp.sh
- DNA FastQC analysis:
-
Count K-mers for preliminary analysis:
./03_count_kmers_jellyfish.sh
- Assemble using multiple tools for comparison:
- Flye:
./04_assemble_flye.sh
- HiFiASM:
./05_assembly_hifiasm.sh
- LJA:
./06_assembly_LJA.sh
- Trinity (for RNA assembly):
./07_assembly_trinity.sh
- Flye:
-
Assess assembly single-copy ortholog completeness using BUSCO:
./08_evaluate_asm_BUSCO.sh
-
Calculate key assembly metrics using QUAST (with and without reference genome):
./09a_evaluate_asm_nref_QUAST.sh ./09b_evaluate_asm_ref_QUAST.sh
-
Evaluate the consensus quality value (QV) and validate k-mer spectrum completeness using Merqury:
./10_prepare_meryl_db.sh ./11a_merqury_flye.sh #Run for each assembly
-
Create dotplots to compare assemblies to reference and to each other:
.12_run_mummer.sh
-
Extract coords files for analysis of missalignments:
./12_b_mummer_for_coords.sh
-
Use EDTA to annotate transposable elements:
./13_run_EDTA.sh
-
Classify long terminal repeats (LTRs) using TEsorter:
./14_a_run_tesorter_LTR_classification.sh
-
Visualize the LTR clades and families:
./14_b_visualize_clades_and_fams.R
-
Index assembled genome to obtain scaffold lengths, and process annotations in R:
./15_a_generate_faidx_samtools.sh ./15_b_visualize_annotations.R
- Classify TEs specific to A. thaliana and Brassicaceae and analyze with SeqKit:
16_classify_TE_TEsorter.sh
- Parse RepeatMasker output (from EDTA) to estimate divergence from consensus sequence:
17_estimate_insertion_age.sh 18_plot_divergence.R
- Phylogenetic Analysis using SeqKit, Clustal Omega and FastTree:
19_a_phylogenetic_analysis.sh
- Optionally, use the following scripts to generate datasets to add features to the trees on iTol:
19_b_add_colors_to_abundance.sh 19_c_extract_TE_abundance.sh
- Annotate genes using MAKER pipeline:
20_a_create_maker_CTLfile.sh 20_b_run_maker.sh 20_c_prepare_maker_output.sh
- Validate completeness using BUSCO:
21_get_longest_prot_transcripts.sh 22_run_BUSCO_transcriptome_proteins.sh
- Run BLAST to confirm protein homology:
23_run_BLAST.sh
- Orthology based gene annotation quality control using OMArk:
24_a_create_OMArk_env.sh 24_b_run_OMArk_QC.SH
-
Run GENESPACE for comparative genomics:
./25_create_Genespace_folders.R ./26_Genespace.R ./27_run_Genespace.sh
-
Parse Orthofinder results:
./28-parse_Orthofinder.R
-
Contextualize OMA ortholog results:
Contextualize_OMA.ipynb
-
Summarize the overall analysis:
./Final_summary.sh