Skip to content

zhangrengang/evolution_example

Repository files navigation

Table of Contents

Prepare data

Download and pre-process the example data:

# git-lfs has been installed
git lfs clone https://github.com/zhangrengang/evolution_example
cd evolution_example

chmod +x src/*
find . -name "*gz" | xargs gunzip
cat OrthoFinder/*fasta > pep.faa
cat CDS/* > cds.fa
cat wgdi/*gff > all_species_gene.gff

If it is hard to download from GitHub, altenatively you can try to download from BaiduYun. Now, the directory structure like this:

$ tree
├── all_species_gene.gff        # all GFF
├── cds.fa                      # all CDS sequences
├── chr.list					$ all chromosome information
├── pep.faa                     # all protein sequences
├── species.design              # speceis list
├── OrthoFinder                 # input for OrthoFinder
│   ├── Angelica_sinensis.fasta
│   ├── Apium_graveolens.fasta
│   ├── ......
└── wgdi                        # input for WGDI
    ├── Angelica_sinensis-Angelica_sinensis.blast
    ├── Angelica_sinensis-Angelica_sinensis.conf
    ├── Angelica_sinensis-Angelica_sinensis.ctl   # for `soi dotplot`
    ├── Angelica_sinensis.gff
    ├── Angelica_sinensis.lens
    ├── ......
 ......

Note: the GENE ID is needed to label with SPECIES ID (e.g. Angelica_sinensis|AS01G00001) for uniqueness and compatibility (legacy from OrthoMCL). The CHROMosome ID should also be unique (e.g. As1, As2) to avoid conflicts (legacy from MCscanX). The IDs can be easily labeled for your own data; for example:

SP=Angelica_sinensis
# using sed to label fasta files:
sed 's/>/>'$SP'|/' $SP.pep > $SP.fasta  # always using the separator '|'

SP2=As
# using awk and perl to label MCscanX/WGDI gff files or JCVI bed files:
cat $SP.gff0 | awk -v sp=$SP -v OFS="\t" '{$2=sp"|"$2;print $0}' | perl -pe 's/^\D+/'$SP2'/' > $SP.gff
cat $SP.bed0 | awk -v sp=$SP -v OFS="\t" '{$4=sp"|"$4;print $0}' | perl -pe 's/^\D+/'$SP2'/' > $SP.bed

SP and SP2 can be the same. It will be convenient to use a for/while loop to process all the species.

Installation

If you have installed OrthoIndex and activated the conda environment, all the commands used in this pipeline should have been installed and can be executed in the shell.

Run OrthoFinder for orthology

To infer 'orthology' using OrthoFinder2:

orthofinder -f OrthoFinder/ -M msa -t 60

orthofinder can by replaced by Proteinortho6, Broccoli and SonicParanoid2.

Run WGDI for synteny

This repo has provided all input files (*.lens, *.gff, *.blast, *.conf, *.ctl) for the downstream pipeline. But for your own data, these input files can be generated by the following commands:

indir=.
outdir=wgdi
sdesign=species.design

# to generate *.lens and *.gff files for `wgdi`, and *.ctl files for `soi dotplot`
soi-syn to_wgdi indir=$indir outdir=$outdir species=$sdesign gff=all_species_gene.gff chrLst=chr.list min_genes=100
# *.blast files for `wgdi`
soi-orth to_wgdi $indir/OrthoFinder/OrthoFinder/Results_* species=$sdesign outdir=$outdir

cd $outdir
chrl=../chr.list
../src/comb2 `cat ../species.design` | while read LINE
do
	arr=($LINE)
	SP1=${arr[0]}	# y
	SP2=${arr[1]}	# x
	conf=$SP1-$SP2.conf
	ctl=$SP1-$SP2.ctl
	# *.conf file for `wgdi`
	sh ../src/wgdi-conf.sh $SP1 $SP2 > $conf
#	# ctl file for `soi dotplot`
#	soi-ctl by_genes $chrl $SP1 $SP2 > $ctl
done
cd -

To detect 'synteny' by WGDI, with visualization by SOI :

cd wgdi

../src/comb2 `cat ../species.design` | while read LINE
do
    arr=($LINE)
    SP1=${arr[0]}
    SP2=${arr[1]}
    prefix=$SP1-$SP2
    conf=$prefix.conf

    # blast (can be skipped)
#    diamond blastp -q ../OrthoFinder/$SP1.fasta -d ../OrthoFinder/$SP2.fasta -o $prefix.blast --more-sensitive -p 10 --quiet -e 0.001

    # call synteny
    wgdi -icl $conf

    # dot plot colored by Orthology Index
    soi dotplot -s $prefix.collinearity \
        -g ../all_species_gene.gff -c $prefix.ctl  \
        --xlabel $SP1 --ylabel $SP2 \
        --ks-hist --max-ks 1 -o $prefix.io    \
        --plot-ploidy --gene-axis --number-plots \
        --ofdir ../OrthoFinder/OrthoFinder/Results_*/ \
        --of-color

    # to show only orthology
    soi dotplot -s $prefix.collinearity \
        -g ../all_species_gene.gff -c $prefix.ctl  \
        --xlabel $SP1 --ylabel $SP2 \
        --ks-hist --max-ks 1 -o $prefix.io    \
        --plot-ploidy --gene-axis --number-plots \
        --ofdir ../OrthoFinder/OrthoFinder/Results_*/ \
        --of-color --of-ratio 0.6

done

cd ..

If you also need Ks-based visualization:

cd wgdi

../src/comb2 `cat ../species.design` | while read LINE
do
    arr=($LINE)
    SP1=${arr[0]}
    SP2=${arr[1]}
    prefix=$SP1-$SP2
    conf=$prefix.conf

    # calculate Ks
    wgdi -ks $conf

    # dot plot colored by Ks
    soi dotplot -s $prefix.collinearity \
        -g ../all_species_gene.gff -c $prefix.ctl  \
        --kaks $prefix.collinearity.ks \
        --xlabel $SP1 --ylabel $SP2 \
        --ks-hist --max-ks 1.5 -o $prefix     \
        --plot-ploidy --gene-axis --number-plots

    # to show only orthology
    soi dotplot -s $prefix.collinearity \
        -g ../all_species_gene.gff -c $prefix.ctl  \
        --kaks $prefix.collinearity.ks \
        --xlabel $SP1 --ylabel $SP2 \
        --ks-hist --max-ks 1.5 -o $prefix     \
        --plot-ploidy --gene-axis --number-plots \
        --ofdir ../OrthoFinder/OrthoFinder/Results_*/ \
        --of-ratio 0.6       # filtering by OI

done

cd ..

wgdi can be replaced by JCVI and MCscanX.

Run SOI for orthologous synteny and species tree

To cluster syntenic orthogroups (SOGs) and construct phylogenomic analyses:

mkdir -p phylogenomics && cd phylogenomics

# to filter collinearity
soi filter -s ../wgdi/*.collinearity -o ../OrthoFinder/OrthoFinder/Results_*/ -c 0.6 -stat collinearity.ortho.stats > collinearity.ortho

# to cluster SOGs excluding outgroups that do not share the lineage-specific WGD
soi cluster -s collinearity.ortho -outgroup Lonicera_japonica Ilex_polyneura Vitis_vinifera -prefix cluster
# to add outgroups
soi outgroup -s collinearity.ortho -og cluster.mcl -outgroup Lonicera_japonica Ilex_polyneura Vitis_vinifera > cluster.mcl.plus

# to build multi-copy or single-copy gene trees
soi phylo -og cluster.mcl.plus -pep ../pep.faa -cds ../cds.fa -both -pre sog -fast -mm 0.4 -p 80 -tmp tmp.mc.0.4
soi phylo -og cluster.mcl.plus -pep ../pep.faa -cds ../cds.fa -both -pre sog -fast -mm 0.2 -p 80 -tmp tmp.sc.0.2 -sc -concat -trimal_opts " -gappyout" -iqtree_opts " -B 1000"

# to infer coalescent‐based species tree
astral-pro --root Vitis_vinifera sog.mc.cds.mm0.4.genetrees > sog.mc.cds.mm0.4.genetrees.astral
astral-hybrid --root Vitis_vinifera sog.sc.cds.mm0.2.genetrees > sog.sc.cds.mm0.2.genetrees.astral

# to infer concatenation‐based species tree
iqtree2 -s sog.sc.cds.mm0.2.concat.aln -T 60 -B 1000 -mset GTR -o Vitis_vinifera

Note: although we set a unified cutoff (0.6) in the pipeline, users should manually check the resulted dot plots for confirmation, and some extremely complex cases showing unexpected patterns need to be investigated on a case-by-case basis.

If all species share the WGD(s), -outgroup should not be used, and just run soi cluster and skip soi outgroup:

soi cluster -s collinearity.ortho -prefix cluster

About

Example for evlutionary analyses with Orthology Index

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages