Skip to content

homocomp

Sanzhen Liu edited this page Sep 3, 2024 · 22 revisions

The script homocomp implements homologous searching and visualization of the comparison. With a query gene sequence, homologous sequences can be identified in a genome. The homologous gene could be found if a related GTF is supplied. The alignment between two homologous regions are plotted.

required data

  1. a query sequence
  2. a blast database

Notes
The blast database can be produced by the following script.

makeblastdb -in <seq> --dbtype nucl

run 1

If the query sequence is from a gene, geneseq can help retrieve data. In this case, the Zm00001eb001720 sequence extracted is used to find a matched region in A188.

scripts of run 1

query=wikiexample/2_geneseq/1o_geneseq/1o_geneseq.1.Zm00001eb001720.fasta
ref_db=wikiexample/1_data/blastDB/A188.fasta
prefix=1o_homocomp

perl ../../homocomp \
	--prefix $prefix \
	--query $query \
	--db $ref_db

major output files from run 1

1o_homocomp.5b.delta.txt
1o_homocomp.5c.filt.delta.txt
1o_homocomp.5f.uniq.filt.delta.txt
1o_homocomp.o1.alnplot.pdf
1o_homocomp.o2.filt.alnplot.pdf
1o_homocomp.o3.uniq.filt.alnplot.pdf
1o_homocomp.o4.dotplot.pdf
1o_homocomp.o5.filt.dotplot.pdf
1o_homocomp.o6.uniq.filt.dotplot.pdf

Files suffixed with txt are outputs of NUCmer alignments between two sequences. xxx.5b.delta.txt includes raw alignments; xxx.5c.filt.delta.txt includes alignments after filtering based on the input criteria (identity and coverage); xxx.5f.uniq.filt.delta.txt includes unique alignments after filtering based on the input criteria. These alignments are used to produced alnplots and dotplots.

Here is the alnplot of 1o_homocomp.o2.filt.alnplot.pdf:
alnplot_1

run 2

scripts of run 2

We can add a BED file to highlight the query regions by using --qryadd. The parameter --dbacc can be used to add the accession/genotype of the reference/target sequence.
We can supply a GTF gene annotation file by using --tgtf to allow to find the best matched gene in ref/target.

query=wikiexample/2_geneseq/1o_geneseq/1o_geneseq.1.Zm00001eb001720.fasta
query_bed=wikiexample/2_geneseq/1o_geneseq/1o_geneseq.4.pos.adjusted.gtf.bed/Zm00001eb001720_T001.adjusted.bed
ref_db=wikiexample/1_data/blastDB/A188.fasta
ref_name=A188
target_gtf=wikiexample/1_data/gtf/A188.gtf
prefix=2o_homocomp

perl ../../homocomp \
	--prefix $prefix \
	--query $query \
	--qryadd $query_bed \
	--db $ref_db \
	--dbacc $ref_name \
	--tgtf $target_gtf

outputs from run 2 are similar to outputs from run 1

Here is the alnplot of 2o_homocomp.o2.filt.alnplot.pdf:
alnplot_2

full usage

Usage: homocomp --query <fasta> --db <blast_db> [options]
[Options]
    --query <file>   fasta file containing a sequence as the query; required
    --qryadd <file>  bed file to highlight regions in query; optional 
    [format, 7 columns separated by Tab]:
    chr start(0-based) end(1-based) label height(0.01-0.1) strand(+/-) color(R compatible)
    --db <db_name>   blast database; required
    --dbacc <str>    accession name of reference database; optional
    --ref <file>     fasta file of the reference; optional
                     if not supplied, reference sequences will be extracted from --db
    --tchr <str>     targeted chromosome or contig; optional
    --tstart <num>   bp position for the region start; optional
                     if specified, the value will be used as the left end
    --tend <num>     bp position for the region end; optional
                     if specified, the value will be used as the right end
    --tgtf <file>    the gene annotation GTF file of the target; optional
    --strand <str>   plus (or +) or minus (or -) (plus)
    --evalue <str>   maximal E-value (1e-10)
    --identity <num> minimal percentage of identity from 0 to 100 (80)
    --match <num>    minimal bp match of an alignment (100)
    --coverage <num> minimal coverage of the query (0)
    --repeatdust     remove repetitive blastn alignments if specified
    --lext <num>     extended bp from the left side (0);
    --rext <num>     extended bp from the right side (0);
    --expand <num>   expand scale relative to the query length (10)
                     e.g., 1kb query, 1kb x 10 extened from both sides will be scanned for the hit range
    --prefix <str>   the output directory and the prefix for output files (hpout)
    --threads <num>  number of cpus (1)
    --noblastn       skip blastn if specified
    --bandcol <str>  a valid R color name (bisque3)
    --version        version information
    --help:          help information.

Output from homocomp

Three alignment plots and two dotplots between two sequences are output.

Alignment plot without alignment filtering   : <prefix>.o1.alnplot.pdf
Alignment plot with alignment filtering      : <prefix>.o2.filt.alnplot.pdf
Alignment plot with filtered unique alignment: <prefix>.o2.filt.uniq.alnplot.pdf
Dotplot without alignment filtering          : <prefix>.o3.dotplot.pdf
Dotplot with alignment filtering             : <prefix>.o4.filt.dotplot.pdf
Clone this wiki locally