Skip to content

Chimeric alignments spanning multiple adjacent genes #35

Open
@vkkodali

Description

@vkkodali

When aligning proteins with paralogs located in a cluster, miniprot produces alignments encompassing two or more distinct genes.

For example, aligning mouse protein NP_001036176.1 to human chromosome 1 (NC_000001.11) returns 3 alignments, one of which aligns such that it encompasses 3 distinct genes.

$ miniprot --version
0.8-r220
$ efetch -db nucleotide -id NC_000001.11 -format fasta > genome.fa
$ efetch -db protein -id NP_001036176.1 -format fasta > prot.fa
## PAF output truncated for brevity
$ miniprot genome.fa prot.fa 2>/dev/null
NP_001036176.1  508  0  508  +  NC_000001.11  248956422  103617440  103625743  1305  1533  0  AS:i:2173  ms:i:2468  np:i:470  da:i:0  do:i:0  cg:Z:56M345N49M810N51M3D12M445N77M766N44M895V40M97V33M1972N39M114V41M1338V62M
NP_001036176.1  508  0  508  +  NC_000001.11  248956422  103571602  103579497  1299  1533  0  AS:i:2159  ms:i:2454  np:i:467  da:i:0  do:i:0  cg:Z:56M339N49M806N51M3D12M447N77M321N44M832V40M98V33M1949N39M114V41M1468V62M
NP_001036176.1  508  0  508  +  NC_000001.11  248956422  103617440  103719868  1302  1533  0  AS:i:2155  ms:i:2450  np:i:468  da:i:0  do:i:0  cg:Z:56M39235N49M803N51M3D12M55731N77M801N44M816V40M97V33M1972N39M114V41M1338V62M

These alignments can be seen in the following screenshot. Numbers adjacent to the alignment represent their ranking based on the alignment score. Appropriately, the chimeric alignment spanning multiple genes is ranked the lowest.
image

While the chimeric alignment in the previous case was ranked the lowest, and secondary to the best alignment it is not always the case. As an example, I have aligned 3 proteins (CAD87763.1, AAZ99031.1, AUO15579.1) to the genome sequence NC_069144.1 and all 3 alignments returned by miniprot are chimeric as shown in the following screenshot:
image
The genes in blue boxes on the left and right hand sides are both "pheloloxidase-8-like" genes and encode proteins that are paralogous. All 3 sequences align such that a portion of the sequence aligns to one paralog and the rest aligns to the other paralog.

The alignments are riddled with mismatches and indels, and understandably, when we align distant proteins to a genome it becomes increasingly difficult to arrive at perfect alignments. So, what's an aligner to do? Perhaps an appropriate outcome to this scenario would be to, say, generate two (subpar) alignments for each protein, with unaligned tails and identify the better one as primary. Splign and Prosplign offer some inspiration by using "compart" logic that avoids generating alignments spanning multiple distant locations. Could a similar approach be applied to miniprot?

I have a dataset of over >165k proteins aligned to the Anopheles cruzii genome and I see chimeric alignments such as the one described above quite often. After playing with -J and -E parameters individually and in combination, I have made little progress in systematically avoiding them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions