This study is focused on the pan-genome analysis of Plasmodium falciparum and Plasmodium vivax species from complete proteomes of the strains from different origins of the world in order to identify common core virulence genes/proteins, accessory genes and unique strain specific genes. These genes are further used for downstreatm analysis.
This reproducible python code is PART of the main study on a standard pangenome analysis.If you wish to see the complete study,please refer to " The initial concept of code is taken from "https://github.com/microDM/Utility-codes/blob/master/parseOrthoMCLOutput.py" and modified.
Before initiaing the code to run, the following steps must be performed to get the input files for this analysis.
This is done using a custom made database of species of interest using a BLAST+ standalone application (version 2.10.0).The final output is moved into step 2 for orthoMCL analysis.Keep the "goodProteins.fasta" generated within these steps saved for input in 3a step.
This is done by standalone OrthoMCL v2.0.9 and MCL. Two resultant output files "groups.txt" and "Singletons.txt" are used in the next steps.
This script majorly performs the following functions:
3a) Extraction of core, accessory and unique genome/gene families.
3b) Extraction of Sequences for the Core Genome. Prerequisite file:The whole genome of the Plasmodium reference strain 3D7 downloaded from PlasmoDB.The protein IDs from the FASTA headers were scanned along with the gene IDs present in the FASTA headers of the 3D7 annotated file. If protein ID was the same as gene ID, the corresponding gene sequence was extracted and saved in another file.
3c) Extraction of Single-Copy Orthologues (SCOs) aka 1:1 or true orthologues: This step selects the orthologue groups from the core groups that have precisely one gene per organism.
3d) Removal of Duplicate Proteins: The script was used to calculate and generate a file representing the number of duplicated proteins in one genome, two genomes and three genomes along with the strain names in which these were duplicated. Furthermore, duplicates from the core, accessory and unique genomes were also filtered out into another output file mentioning the names of multiple strains in which the protein had appeared. This resulted in the exclusive list of proteins encoded by the core, accessory, and unique genomes, respectively, to be considered for functional analysis.
3e) Generate count and binary matrices.
The Pangenome-Analysis script is run in an IDE with the input files as described below:
- python3 Pangenome-Analysis.py -g groups.txt -f goodProteins.fasta -n names.txt -s singletons.txt
where inputs and arguments specify:
• groups.txt file generated by MCL output was passed with –g flag.
• goodProteins.fasta containing all protein sequences of all strains combined generated via BLAST DB was passed with –f flag.
• Names.txt is the manually created list of names of genomes/organisms used passed with –n flag.
• Singletons.txt file generated by OrthoMCL Singletons was passed with –s flag.
The output files can be opened in Notepad++ for clear view.
If you are using any part of this code,please use citations for this code as " " and published paper of the project as " ".
This script is written by Ms. Farhana Riaz, a postgraduate in Bioinformatics as part of her MS thesis.