Skip to content

PalMuc/congeneric_synteny

Repository files navigation

This is the data repository for the following publication

Genomic changes are varied across congeneric species pairs

Francis, Warren R.1, Vargas, Sergio1, Wörheide, Gert 1,2,3,*

1 Department of Earth and Environmental Sciences, Paleontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany.
2 GeoBio-Center, Ludwig-Maximilians-Universität München, Munich, Germany
3 Staatliche Naturwissenschaftliche Sammlungen Bayerns (SNSB)–Bayerische Staatssammlung für Paläontologie und Geologie, Munich, Germany

*corresponding author

ABSTRACT

Synteny, the shared arrangement of genes on chromosomes between related species, is a marker of shared ancestry, and synteny-breaking events can result in genomic incompatibilities between populations and ultimately lead to speciation events. Despite its pivotal role as a driver of speciation, the role of synteny breaks on speciation is poorly studied due to a lack of chromosome-level genome assemblies for a taxonomically broad sample of organisms. Here, using 22 con-generic animal genome pairs, we find a link between protein identity, microsynteny, and macrosynteny, but no evidence for a universal path of genomic change during speciation. We observed varied trajectories of synteny conservation relative to protein identity in non-model organisms, with many species’ pairs showing no karyotypic changes and others displaying large genomic rearrangements. This contrasts with previous studies on model organisms and indicates that the genomic changes preceding or resulting from speciation are likely very contextual between clades.

Analytical approach

For each pair of genomes (congeneric species), microsynteny and macrosynteny are both analysed.

The pipeline processor run_synteny_analysis.py is coded in Python, and run simply as:

run_synteny_analysis.py -i species_pair_list.tab

For each species pair, for example the tuna, this begins with the scaffolds, proteins, and GFF downloaded from NCBI:

GCF_910596095.1_fThuMac1.1_genomic.fna.gz
GCF_910596095.1_fThuMac1.1_genomic.gff.gz
GCF_910596095.1_fThuMac1.1_protein.faa.gz
GCF_914725855.1_fThuAlb1.1_genomic.fna.gz
GCF_914725855.1_fThuAlb1.1_genomic.gff.gz
GCF_914725855.1_fThuAlb1.1_protein.faa.gz

and this generates the following files for each species:

  • get_genbank_longest_isoforms.py filtered proteins with isoforms removed .x.faa, like: GCF_910596095.1_fThuMac1.1_protein.x.faa and GCF_914725855.1_fThuAlb1.1_protein.x.faa
  • get_genbank_longest_isoforms.py filtered GFFs corresponding to the proteins .x.gff, like: GCF_910596095.1_fThuMac1.1_genomic.x.gff , GCF_914725855.1_fThuAlb1.1_genomic.x.gff
  • DIAMOND results fThuAlb1_vs_fThuMac1.blastp.tab and fThuAlb1_vs_fThuMac1.renamed.blastp.tab
  • scaffold_synteny.py results fThuAlb1_vs_fThuMac1.scaffold_synteny.tab and fThuAlb1_vs_fThuMac1.scaffold_synteny.pdf
  • microsynteny.py results fThuAlb1_vs_fThuMac1.microsynteny.tab and fThuAlb1_vs_fThuMac1.microsynteny.pdf
  • fastarenamer.py renamed versions of proteins for clustering .x.n.faa, like: GCF_910596095.1_fThuMac1.1_protein.x.n.faa , GCF_914725855.1_fThuAlb1.1_protein.x.n.faa
  • makehomologs.py clustering outputs fasta_clusters.H.thunnus_clusters_v1.tab clusters_thunnus_clusters_v1.tar.gz and log thunnus_clusters_v1.2023-08-02-010624.mh.log
  • alignment_conserved_site_to_dots.py accumulated tabular output fThuAlb1_vs_fThuMac1.homologs_identity.tab

Subsequent processing occurs using several R scripts, for analysis and plotting.

Full citation

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.