homograph

The script homograph visualizes alignments of homologous sequences, determines number of haplotypes, and views the haplotype alignment graph.

required data

a FASTA reference sequence
a BED file annotating the reference sequence with 7 columns separated by tabs
a directory containing FASTA sequences homologous with the reference sequence

procedure

the input FASTA sequences, the reference sequence and all sequences in the input FASTA directory, are clustered with cd-hit.
Multiple sequence alignment (MSA) is performed, which may take a while to complete.
Determine alignment segments each of which is not split by a gap in each sequence (taxon).
Split segments to sequence blocks by considering all segments in all taxa. Each block does not span two alignment segments in any taxon.
Identify substitution polymorphisms and determine a consensus sequence at each polymorphic site.
Calculate number of sites polymorphic with consensus sequences in per block in each taxon. The ratio of this number to the total number of polymorphic sites represents the divergence level of the taxon from the consensus at the block, which is referred to as blocktyping.
Collect all data and plot in R.
Out put sequences of haplotype blocks to a FASTA file.
Determine genotyping results of each haplotype block.

run

datadir=wikiexample/1_data/homograph
perl ../../homograph \
	--ref $datadir/0_ref/B73.fasta \
	--genebed $datadir/0_ref/B73.bed \
	--fasdir $datadir/1_fas

major output - haplotype alignment graph

The figure (*.6.msa.pdf) integrates data from multiple sources. From the top to the bottom, three areas are in the figure.

Rectangles stand for features (e.g., exon and CDS) annotated in the BED input based on the alignment coordinates. Colors are defined in the input BED. The underlying orange lines indicated the range of each feature, which could be split by a gap in the alignment.
Counts of polymorphic sites per alignment block that do not include gaps, which means that only the substitution is considered. The Y-axis unit is at the scale of log10.
Plotting of multiple sequence alignment. Each rectangle represents an alignment block without a gap. Light blue represents conserved regions with no substitution polymorphisms. The gradient colors from light yellow to orange indicate the divergent levels from the consensus sequences, from low divergence to high divergence, respectively. Names of taxa are prefixed with cluster groups (c_number).

other outputs - haplotype sequences and a genotyping table

A FASTA file (*.7.block.haplotype.fasta) contains sequences of all haplotype blocks.

A table file (*.8.block.genotype.txt) includes individual genotypes of each haplotype block. The genotype of "-" represents the absence of sequences. Other genotype names match sequence names in the FASTA file. For example, the sequence of the B0_1 genotype in the following table can be extracted from the FASTA file. "Num_haplotype" indicates the number of haplotypes, including the absence (-) haplotype, in the examined individuals. A block with the Num_haplotype of 1 represents a conserved region with no polymorphisms.

full usage

homograph

Usage

Usage: perl homograph --dir <dir_containing_fasta_files> [options]
    [Options]
    --fasdir <path>   path to a collection of non-ref fasta files with suffix of .fa, .fas, or .fasta; required
    --ref <file>      fasta file of the reference sequence; required
    --genebed <file>  BED file of gene structure using coordinates on --ref; optional
                      the BED file must have 7 columns separated by TAB:
                        1. seqname 2. start(0-based) 3. end(1-based) 4. label
                        5. height(0.01-0.1) 6. strand(+/-) 7. color(R compatible)
    --cdhitpara <str> parameters for cd-hit-est (-g 1 -s 0.8 -c 0.8 -r 0)
    --msatool <str>   software tool for multiple sequencee alignment: clustalo, muscle, or mafft (clustalo)
    --threads <num>   number of threads (1)
    --prefix <str>    prefix of outputs (hgout)
    --title <str>     label to add as the title of the output haplotype figure (Alignmens of multiple sequences)
    --version         version information
    --help:           help information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

homograph

required data

procedure

run

major output - haplotype alignment graph

other outputs - haplotype sequences and a genotyping table

full usage

homograph

Usage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally