-
Notifications
You must be signed in to change notification settings - Fork 4
homograph
The script homograph visualizes alignments of homologous sequences, determines number of haplotypes, and views the haplotype alignment graph.
- a FASTA reference sequence
- a BED file annotating the reference sequence with 7 columns separated by tabs
- a directory containing FASTA sequences homologous with the reference sequence
- the input FASTA sequences, the reference sequence and all sequences in the input FASTA directory, are clustered with cd-hit.
- Multiple sequence alignment (MSA) is performed, which may take a while to complete.
- Determine alignment segments each of which is not split by a gap in each sequence (taxon).
- Split segments to sequence blocks by considering all segments in all taxa. Each block does not span two alignment segments in any taxon.
- Identify substitution polymorphisms and determine a consensus sequence at each polymorphic site.
- Calculate number of sites polymorphic with consensus sequences in per block in each taxon. The ratio of this number to the total number of polymorphic sites represents the divergence level of the taxon from the consensus at the block, which is referred to as blocktyping.
- Collect all data and plot in R.
- Out put sequences of haplotype blocks to a FASTA file.
- Determine genotyping results of each haplotype block.
datadir=wikiexample/1_data/homograph
perl ../../homograph \
--ref $datadir/0_ref/B73.fasta \
--genebed $datadir/0_ref/B73.bed \
--fasdir $datadir/1_fas
The figure (*.6.msa.pdf) integrates data from multiple sources. From the top to the bottom, three areas are in the figure.
-
Rectangles stand for features (e.g., exon and CDS) annotated in the BED input based on the alignment coordinates. Colors are defined in the input BED. The underlying orange lines indicated the range of each feature, which could be split by a gap in the alignment.
-
Counts of polymorphic sites per alignment block that do not include gaps, which means that only the substitution is considered. The Y-axis unit is at the scale of log10.
-
Plotting of multiple sequence alignment. Each rectangle represents an alignment block without a gap. Light blue represents conserved regions with no substitution polymorphisms. The gradient colors from light yellow to orange indicate the divergent levels from the consensus sequences, from low divergence to high divergence, respectively. Names of taxa are prefixed with cluster groups (c_number).
- A FASTA file (*.7.block.haplotype.fasta) contains sequences of all haplotype blocks.
- A table file (*.8.block.genotype.txt) includes individual genotypes of each haplotype block. The genotype of "-" represents the absence of sequences. Other genotype names match sequence names in the FASTA file. For example, the sequence of the B0_1 genotype in the following table can be extracted from the FASTA file. "Num_haplotype" indicates the number of haplotypes, including the absence (-) haplotype, in the examined individuals. A block with the Num_haplotype of 1 represents a conserved region with no polymorphisms.
Usage: perl homograph --dir <dir_containing_fasta_files> [options]
[Options]
--fasdir <path> path to a collection of non-ref fasta files with suffix of .fa, .fas, or .fasta; required
--ref <file> fasta file of the reference sequence; required
--genebed <file> BED file of gene structure using coordinates on --ref; optional
the BED file must have 7 columns separated by TAB:
1. seqname 2. start(0-based) 3. end(1-based) 4. label
5. height(0.01-0.1) 6. strand(+/-) 7. color(R compatible)
--cdhitpara <str> parameters for cd-hit-est (-g 1 -s 0.8 -c 0.8 -r 0)
--msatool <str> software tool for multiple sequencee alignment: clustalo, muscle, or mafft (clustalo)
--threads <num> number of threads (1)
--prefix <str> prefix of outputs (hgout)
--title <str> label to add as the title of the output haplotype figure (Alignmens of multiple sequences)
--version version information
--help: help information