Chimeric alignments spanning multiple adjacent genes

When aligning proteins with paralogs located in a cluster, `miniprot` produces alignments encompassing two or more distinct genes. 

For example, aligning mouse protein NP_001036176.1 to human chromosome 1 (NC_000001.11) returns 3 alignments, one of which aligns such that it encompasses 3 distinct genes. 
```bash
$ miniprot --version
0.8-r220
$ efetch -db nucleotide -id NC_000001.11 -format fasta > genome.fa
$ efetch -db protein -id NP_001036176.1 -format fasta > prot.fa
## PAF output truncated for brevity
$ miniprot genome.fa prot.fa 2>/dev/null
NP_001036176.1  508  0  508  +  NC_000001.11  248956422  103617440  103625743  1305  1533  0  AS:i:2173  ms:i:2468  np:i:470  da:i:0  do:i:0  cg:Z:56M345N49M810N51M3D12M445N77M766N44M895V40M97V33M1972N39M114V41M1338V62M
NP_001036176.1  508  0  508  +  NC_000001.11  248956422  103571602  103579497  1299  1533  0  AS:i:2159  ms:i:2454  np:i:467  da:i:0  do:i:0  cg:Z:56M339N49M806N51M3D12M447N77M321N44M832V40M98V33M1949N39M114V41M1468V62M
NP_001036176.1  508  0  508  +  NC_000001.11  248956422  103617440  103719868  1302  1533  0  AS:i:2155  ms:i:2450  np:i:468  da:i:0  do:i:0  cg:Z:56M39235N49M803N51M3D12M55731N77M801N44M816V40M97V33M1972N39M114V41M1338V62M
```
These alignments can be seen in the following screenshot. Numbers adjacent to the alignment represent their ranking based on the alignment score. Appropriately, the chimeric alignment spanning multiple genes is ranked the lowest. 
![image](https://user-images.githubusercontent.com/19337104/224862275-b0e4a814-3158-431d-94b3-8b57e834cc7a.png)

While the chimeric alignment in the previous case was ranked the lowest, and secondary to the best alignment it is not always the case. As an example, I have aligned 3 proteins (CAD87763.1, AAZ99031.1, AUO15579.1) to the genome sequence NC_069144.1 and all 3 alignments returned by `miniprot` are chimeric as shown in the following screenshot: 
![image](https://user-images.githubusercontent.com/19337104/224863588-0410065a-4bee-490e-9c6d-4decfb82fd47.png)
The genes in blue boxes on the left and right hand sides are both "pheloloxidase-8-like" genes and encode proteins that are paralogous. All 3 sequences align such that a portion of the sequence aligns to one paralog and the rest aligns to the other paralog. 

The alignments are riddled with mismatches and indels, and understandably, when we align distant proteins to a genome it becomes increasingly difficult to arrive at perfect alignments. So, what's an aligner to do? Perhaps an appropriate outcome to this scenario would be to, say, generate two (subpar) alignments for each protein, with unaligned tails and identify the better one as primary. [Splign](https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi) and [Prosplign](https://www.ncbi.nlm.nih.gov/sutils/static/prosplign/prosplign.html) offer some inspiration by using "compart" logic that avoids generating alignments spanning multiple distant locations. Could a similar approach be applied to `miniprot`? 

I have a dataset of over >165k proteins aligned to the Anopheles cruzii genome and I see chimeric alignments such as the one described above quite often. After playing with `-J` and `-E` parameters individually and in combination, I have made little progress in systematically avoiding them. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chimeric alignments spanning multiple adjacent genes #35

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Chimeric alignments spanning multiple adjacent genes #35

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions