- Prepare data
- Installation
- Run OrthoFinder for orthology
- Run WGDI for synteny
- Run SOI for orthologous synteny and species tree
Download and pre-process the example data:
# git-lfs has been installed
git lfs clone https://github.com/zhangrengang/evolution_example
cd evolution_example
chmod +x src/*
find . -name "*gz" | xargs gunzip
cat OrthoFinder/*fasta > pep.faa
cat CDS/* > cds.fa
cat wgdi/*gff > all_species_gene.gff
If it is hard to download from GitHub, altenatively you can try to download from BaiduYun. Now, the directory structure like this:
$ tree
├── all_species_gene.gff # all GFF
├── cds.fa # all CDS sequences
├── chr.list $ all chromosome information
├── pep.faa # all protein sequences
├── species.design # speceis list
├── OrthoFinder # input for OrthoFinder
│ ├── Angelica_sinensis.fasta
│ ├── Apium_graveolens.fasta
│ ├── ......
└── wgdi # input for WGDI
├── Angelica_sinensis-Angelica_sinensis.blast
├── Angelica_sinensis-Angelica_sinensis.conf
├── Angelica_sinensis-Angelica_sinensis.ctl # for `soi dotplot`
├── Angelica_sinensis.gff
├── Angelica_sinensis.lens
├── ......
......
Note: the GENE ID is needed to label with SPECIES ID (e.g. Angelica_sinensis|AS01G00001
) for
uniqueness and compatibility (legacy from OrthoMCL).
The CHROMosome ID should also be unique (e.g. As1
, As2
) to avoid conflicts (legacy from MCscanX).
The IDs can be easily labeled for your own data; for example:
SP=Angelica_sinensis
# using sed to label fasta files:
sed 's/>/>'$SP'|/' $SP.pep > $SP.fasta # always using the separator '|'
SP2=As
# using awk and perl to label MCscanX/WGDI gff files or JCVI bed files:
cat $SP.gff0 | awk -v sp=$SP -v OFS="\t" '{$2=sp"|"$2;print $0}' | perl -pe 's/^\D+/'$SP2'/' > $SP.gff
cat $SP.bed0 | awk -v sp=$SP -v OFS="\t" '{$4=sp"|"$4;print $0}' | perl -pe 's/^\D+/'$SP2'/' > $SP.bed
SP
and SP2
can be the same. It will be convenient to use a for/while
loop to process all the species.
If you have installed OrthoIndex and activated the conda environment, all the commands used in this pipeline should have been installed and can be executed in the shell.
To infer 'orthology' using OrthoFinder2:
orthofinder -f OrthoFinder/ -M msa -t 60
orthofinder
can by replaced by Proteinortho6,
Broccoli and
SonicParanoid2.
This repo has provided all input files (*.lens
, *.gff
, *.blast
, *.conf
, *.ctl
) for the downstream pipeline.
But for your own data, these input files can be generated by the following commands:
indir=.
outdir=wgdi
sdesign=species.design
# to generate *.lens and *.gff files for `wgdi`, and *.ctl files for `soi dotplot`
soi-syn to_wgdi indir=$indir outdir=$outdir species=$sdesign gff=all_species_gene.gff chrLst=chr.list min_genes=100
# *.blast files for `wgdi`
soi-orth to_wgdi $indir/OrthoFinder/OrthoFinder/Results_* species=$sdesign outdir=$outdir
cd $outdir
chrl=../chr.list
../src/comb2 `cat ../species.design` | while read LINE
do
arr=($LINE)
SP1=${arr[0]} # y
SP2=${arr[1]} # x
conf=$SP1-$SP2.conf
ctl=$SP1-$SP2.ctl
# *.conf file for `wgdi`
sh ../src/wgdi-conf.sh $SP1 $SP2 > $conf
# # ctl file for `soi dotplot`
# soi-ctl by_genes $chrl $SP1 $SP2 > $ctl
done
cd -
To detect 'synteny' by WGDI, with visualization by SOI
:
cd wgdi
../src/comb2 `cat ../species.design` | while read LINE
do
arr=($LINE)
SP1=${arr[0]}
SP2=${arr[1]}
prefix=$SP1-$SP2
conf=$prefix.conf
# blast (can be skipped)
# diamond blastp -q ../OrthoFinder/$SP1.fasta -d ../OrthoFinder/$SP2.fasta -o $prefix.blast --more-sensitive -p 10 --quiet -e 0.001
# call synteny
wgdi -icl $conf
# dot plot colored by Orthology Index
soi dotplot -s $prefix.collinearity \
-g ../all_species_gene.gff -c $prefix.ctl \
--xlabel $SP1 --ylabel $SP2 \
--ks-hist --max-ks 1 -o $prefix.io \
--plot-ploidy --gene-axis --number-plots \
--ofdir ../OrthoFinder/OrthoFinder/Results_*/ \
--of-color
# to show only orthology
soi dotplot -s $prefix.collinearity \
-g ../all_species_gene.gff -c $prefix.ctl \
--xlabel $SP1 --ylabel $SP2 \
--ks-hist --max-ks 1 -o $prefix.io \
--plot-ploidy --gene-axis --number-plots \
--ofdir ../OrthoFinder/OrthoFinder/Results_*/ \
--of-color --of-ratio 0.6
done
cd ..
If you also need Ks-based visualization:
cd wgdi
../src/comb2 `cat ../species.design` | while read LINE
do
arr=($LINE)
SP1=${arr[0]}
SP2=${arr[1]}
prefix=$SP1-$SP2
conf=$prefix.conf
# calculate Ks
wgdi -ks $conf
# dot plot colored by Ks
soi dotplot -s $prefix.collinearity \
-g ../all_species_gene.gff -c $prefix.ctl \
--kaks $prefix.collinearity.ks \
--xlabel $SP1 --ylabel $SP2 \
--ks-hist --max-ks 1.5 -o $prefix \
--plot-ploidy --gene-axis --number-plots
# to show only orthology
soi dotplot -s $prefix.collinearity \
-g ../all_species_gene.gff -c $prefix.ctl \
--kaks $prefix.collinearity.ks \
--xlabel $SP1 --ylabel $SP2 \
--ks-hist --max-ks 1.5 -o $prefix \
--plot-ploidy --gene-axis --number-plots \
--ofdir ../OrthoFinder/OrthoFinder/Results_*/ \
--of-ratio 0.6 # filtering by OI
done
cd ..
wgdi
can be replaced by JCVI and
MCscanX.
To cluster syntenic orthogroups (SOGs) and construct phylogenomic analyses:
mkdir -p phylogenomics && cd phylogenomics
# to filter collinearity
soi filter -s ../wgdi/*.collinearity -o ../OrthoFinder/OrthoFinder/Results_*/ -c 0.6 -stat collinearity.ortho.stats > collinearity.ortho
# to cluster SOGs excluding outgroups that do not share the lineage-specific WGD
soi cluster -s collinearity.ortho -outgroup Lonicera_japonica Ilex_polyneura Vitis_vinifera -prefix cluster
# to add outgroups
soi outgroup -s collinearity.ortho -og cluster.mcl -outgroup Lonicera_japonica Ilex_polyneura Vitis_vinifera > cluster.mcl.plus
# to build multi-copy or single-copy gene trees
soi phylo -og cluster.mcl.plus -pep ../pep.faa -cds ../cds.fa -both -pre sog -fast -mm 0.4 -p 80 -tmp tmp.mc.0.4
soi phylo -og cluster.mcl.plus -pep ../pep.faa -cds ../cds.fa -both -pre sog -fast -mm 0.2 -p 80 -tmp tmp.sc.0.2 -sc -concat -trimal_opts " -gappyout" -iqtree_opts " -B 1000"
# to infer coalescent‐based species tree
astral-pro --root Vitis_vinifera sog.mc.cds.mm0.4.genetrees > sog.mc.cds.mm0.4.genetrees.astral
astral-hybrid --root Vitis_vinifera sog.sc.cds.mm0.2.genetrees > sog.sc.cds.mm0.2.genetrees.astral
# to infer concatenation‐based species tree
iqtree2 -s sog.sc.cds.mm0.2.concat.aln -T 60 -B 1000 -mset GTR -o Vitis_vinifera
Note: although we set a unified cutoff (0.6) in the pipeline, users should manually check the resulted dot plots for confirmation, and some extremely complex cases showing unexpected patterns need to be investigated on a case-by-case basis.
If all species share the WGD(s), -outgroup
should not be used, and just run soi cluster
and skip soi outgroup
:
soi cluster -s collinearity.ortho -prefix cluster