Add documentation page 'evaluation metrics - sequence identity'

Kuanhao-Chao · Apr 25, 2024 · eaadebd · eaadebd
1 parent a0272f6
commit eaadebd
Show file tree

Hide file tree

Showing 10 changed files with 270 additions and 502 deletions.
diff --git a/docs/source/_images/sequence_identity_DNA.png b/docs/source/_images/sequence_identity_DNA.png
diff --git a/docs/source/_images/sequence_identity_protein.png b/docs/source/_images/sequence_identity_protein.png
diff --git a/docs/source/content/behind_scenes.rst b/docs/source/content/behind_scenes.rst
@@ -8,10 +8,8 @@ Behind the scenes
 
 
 
-LiftOn is specifically designed for lifting genes, transcripts, and exons, with the capability to handle any feature or group of hierarchical features in a GFF or GTF file. It utilizes information from , and exons, with the capability to handle any feature or group of hierarchical features in a GFF or GTF file. It takes `Liftoff <https://academic.oup.com/bioinformatics/article/37/12/1639/6035128?login=true>`_  :cite:p:`shumate2021liftoff` and  `miniprot <https://academic.oup.com/bioinformatics/article/39/1/btad014/6989621>`_ :cite:p:`li2023protein` to improve protein-coding gene annotation. This section provides a more detailed explanation of how the algorithms work.
+LiftOn is designed for lifting genes, transcripts and exons, with the capability to handle any feature or group of hierarchical features in a GFF or GTF file. It takes `Liftoff <https://academic.oup.com/bioinformatics/article/37/12/1639/6035128?login=true>`_  :cite:p:`shumate2021liftoff` and  `miniprot <https://academic.oup.com/bioinformatics/article/39/1/btad014/6989621>`_ :cite:p:`li2023protein` to improve protein-coding gene annotation. This section provides a more detailed explanation of how LiftOn algorithm works.
 
-
-|
 |
 
 
@@ -39,19 +37,9 @@ LiftOn uses gene loci lifted by Liftoff as anchors to identify corresponding min
 **(2)** miniprot lacks the ability to reconcile overlapping gene loci. Additionally, given that certain gene families comprise numerous genes, a significant drawback of miniprot is its tendency to map all proteins in these genes to every gene.
 
 
-In most cases, miniprot identifies a single transcript per protein, facilitating LiftOn's matching with Liftoff. In other scenarios with multiple miniprot copies overlapping Liftoff, we select the overlapped miniprot locus. If multiple transcripts share the same ID, we eliminate any spanning multiple loci to avoid gene fusion annotations. If multiple transcripts persist, we choose the one with a higher protein sequence identity score. Additionally, if miniprot annotations do not overlap with Liftoff, we prioritize Liftoff annotations.
-
-
-Using lift-over RefSeq v110 annotations from GRCh38 to T2T-CHM13 V2.0 as an example, the left circos plot (:numref:`liftoff-miniprot-map-circos` A) aligns 129,487 miniprot-identified protein-coding gene loci with Liftoff gene loci. Conversely, the right circos plot (:numref:`liftoff-miniprot-map-circos` B) shows 8,239 miniprot-identified loci, including extra copies, without overlap with Liftoff gene loci. Once a one-to-one mapping is established, LiftOn considers both Liftoff and miniprot CDS chains, initiating the protein-maximization (PM) algorithm.
-
-
-.. _liftoff-miniprot-map-circos:
-.. figure::  ../_images/circous_plots_liftoff_miniprot_cmp.png
-    :align:   center
-    :scale:   28 %
-
-    Circos plots compare protein-coding transcript coordinates between miniprot and Liftoff annotations for RefSeq v110 from GRCh38 to T2T-CHM13. (A) aligns proteins to the same gene loci, while (B) aligns proteins to different gene loci. Left circle\: miniprot coordinates on T2T-CHM13; right circle\: Liftoff coordinates on T2T-CHM13.
+In most cases, miniprot identifies a single transcript per protein and aligns with Liftoff annotations. However, there are instances where miniprot detects multiple transcript copies for a protein, with more than one copy overlapping Liftoff annotations. To avoid gene fusion annotations, we first eliminate any transcripts spanning multiple loci. If multiple transcripts still persist, we select the one with the highest protein sequence identity score. Furthermore, if miniprot annotations do not overlap with those from Liftoff, we give precedence to Liftoff annotations.
 
+Check out :ref:`'Q & A How do the annotations generated by Liftoff differ from those produced by Miniprot?' <liftoff_miniprot_cmp_qa>` for the analysis of consensus and differences between Liftoff and miniprot annotations. Once a one-to-one mapping is established, LiftOn considers both Liftoff and miniprot CDS chains, initiating the protein maximization (PM) algorithm.
 
 |
 |
@@ -67,7 +55,7 @@ Step 1: Chaining algorithm
 
 The chaining algorithm (:numref:`lifton-chaining` A-E) starts by pairing up miniprot alignments with transcripts lifted over by Liftoff. After two transcripts are paired up, the protein sequences from the Liftoff and miniprot annotations are then aligned to the full-length reference protein, as illustrated in :numref:`lifton-chaining` B. Subsequently, LiftOn maps the CDS boundaries from both the Liftoff and miniprot annotations onto the protein alignment (:numref:`lifton-chaining` C). 
 
-The CDSs within the Liftoff and miniprot annotations are grouped from the 5’ to 3’ end direction. The CDSs group in Liftoff is represented as :math:`G_{L_i}`, while in miniprot, they are represented as :math:`G_{M_i}`. Here, :math:`i` denotes the i^th group in that annotation. 
+The CDSs within the Liftoff and miniprot annotations are grouped from the 5’ to 3’ end direction. The CDSs group in Liftoff is represented as :math:`G_{L_i}`, while in miniprot, they are represented as :math:`G_{M_i}`. Here, :math:`i` denotes the :math:`i^{th}` group in that annotation. 
 
 The grouping process begins with the first CDS in each annotation and continues until reaching the endpoints of the downstream CDSs in Liftoff and miniprot, where the number of aligned amino acids from the reference protein is equal. This forms the first CDSs group in Liftoff, denoted as :math:`G_{L_1}`, and the first CDSs group in miniprot, denoted as :math:`G_{M_1}`. Subsequent groups start from the previous endpoint in both Liftoff and miniprot, extending until the number of aligned amino acids from the reference protein matches for both annotations again. These subsequent groups are represented as :math:`G_{L_2}` and :math:`G_{M_2}`, respectively. The grouping process concludes upon reaching the last CDSs in both annotations.
 
@@ -83,18 +71,22 @@ Within each group, :math:`G_{L_i}` or :math:`G_{M_i}`, we calculate the partial
 
 |
 
-Step 2: Open-reading-frame search
+Step 2: Open reading frame search
 ----------------------------------
 
-Frameshift mutations, corrected by aligning annotated coding sequences with the reference protein, alter mRNA reading frames (:numref:`lifton-orf-search-alg` A). Stop codon gain due to point mutations is depicted in :numref:`lifton-orf-search-alg` B and C, where LiftOn searches for the longest open reading frame. :numref:`lifton-orf-search-alg` D highlights stop codon loss, resulting in a longer protein. :numref:`lifton-orf-search-alg` E and F illustrate start codon loss, with LiftOn searching for a new start codon based on sequence identity, selecting the one with the higher score.
+Following the chaining algorithm, LiftOn performs an open-reading frame search algorithm on the protein-coding regions of the mapped transcripts that have mutations likely to be more deleterious, such as “frameshift”, “stop codon gain”, “stop codon loss”, and “start codon loss” mutations. The objective is to generate the longest valid protein sequences that align with the full-length reference proteins.
+
 
+It searches the ORF translations of protein-coding transcripts and adjusts CDS boundaries to avoid early stop codons (:numref:`lifton-orf-search-alg` A-B), choose better translation start sites (:numref:`lifton-orf-search-alg` C, E, F), or extends proteins with stop codon loss (:numref:`lifton-orf-search-alg` D), in order to produce the longest valid protein that match the reference protein. 
 
 .. _lifton-orf-search-alg:
 .. figure::  ../_images/figure_LiftOn_ORF_search.png
     :align:   center
     :scale:   9 %
 
-    Schematic diagram illustrating sequence mutations pre-LiftOn ORF search, altering gene annotation in translated and untranslated regions. (A) Frameshift mutation introduces early translation start. (B) Point mutations introduce early stop codons; LiftOn selects the longer part as proteins. (C) Point M: Methionine, the first amino acid; INDEL gap: DNA sequence insertion/deletion; UTR: Untranslated region; CDS: Coding sequence.
+    Schematic diagram illustrating sequence mutations pre-LiftOn ORF search, altering gene annotation in translated and untranslated regions. (A) Frameshift mutation introduces early translation start. (B) Point mutations introduce early stop codons; LiftOn selects the longer part as proteins. (C) Point mutation introduces a premature stop codon. (D) Stop codon loss extends the protein. (E-F) Point mutation introduces a loss of the start site, and the LiftOn ORF search algorithm finds a downstream or upstream start site.
+
+    M: Methionine, the first amino acid; INDEL gap: DNA sequence insertion/deletion; UTR: Untranslated region; CDS: Coding sequence.
 
 | 
 |
@@ -104,30 +96,8 @@ Frameshift mutations, corrected by aligning annotated coding sequences with the
 Mutation report
 +++++++++++++++++++++++++++++++++++
 
-LiftOn identifies biological differences between reference and target genomes by aligning DNA and protein sequences. It classifies protein-coding transcripts as "identical" or provides detailed reports for mutations, including "synonymous," "non-synonymous," "in-frame insertion," and "in-frame deletion." For severe mutations, it reports "frameshift," "start codon loss," "stop codon gain," and "stop codon loss," conducting an open reading frame search.
-
-|
-|
-
-.. _lifton_sequence_identity:
-DNA & protein transcript sequence identity score calculation
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-In order to evaluate and compare annotation files generated by LiftOn, Liftoff, and miniprot, we calculated DNA and protein sequence identity scores for each protein-coding transcript within their respective annotation files. To compute DNA sequence identity scores, LiftOn extracted transcript sequences by concatenating exon regions. Subsequently, pairwise alignments were carried out between each compiled transcript sequence and the corresponding sequence extracted from the reference genome LiftOn adopted the BLAST definition of percent identity :cite:p:`altschul1990basic`, defined as the number of matching bases in the two sequences over the number of alignment columns.
-
-In terms of protein sequence identity scores, LiftOn initially generated protein sequences by translating the concatenated CDS regions. Subsequently, pairwise alignments were conducted between each extracted protein sequence and the corresponding full-length protein, with the sequence identity score calculated up to the first encountered stop codon in the proteins. Differing slightly from the BLAST-style metric employed for DNA sequence identity, LiftOn compresses gaps in the reference alignment :cite:p:`hengleeseq_identity`, treating them as a single difference. This adjustment serves two purposes: (1) to accommodate potential repeat regions that might result in a longer protein in the target genome, thereby preventing the introduction of false positive splice junctions caused by insertions in the target sequence or deletions in the reference sequence; and (2) to avoid over-penalization during the annotation of the start codon in the upstream 5' untranslated region (UTR) or the stop codon placement in the downstream 3' UTR.
-
-In the chaining algorithm, the calculation of partial protein sequence identity is employed to determine the optimal matching CDS(s). This process is akin to computing the identity of the full-length protein sequence, with the key distinction being that it focuses on evaluating substrings of the protein.
-
-.. To evaluate and compare annotation files generated by LiftOn, Liftoff, and miniprot, we conducted an analysis of DNA and protein sequence identity scores for each protein-coding transcript within the respective annotation files.
-
-.. For calculating DNA sequence identity scores, LiftOn extracted transcript sequences by concatenating exon regions. Subsequently, pairwise alignments were performed between each compiled transcript sequence and the corresponding sequence extracted from the reference genome. LiftOn adopted the BLAST-style identity, defined as the number of matching bases over the number of alignment columns.
-
-.. Regarding protein sequence identity scores, LiftOn initially extracted protein sequences by translating the concatenation of coding sequence (CDS) regions. Then, pairwise alignments were conducted between each extracted protein sequence and the corresponding full-length protein, with the sequence identity score calculated up to the first encountered stop codon in the proteins. 
-
-.. Differing slightly from the BLAST-style metric employed for DNA sequence identity, LiftOn compresses consecutive leading and trailing gaps in the reference alignment, treating them as one difference. The adjustment is made because LiftOn conducts open reading frame searches for truncated genes (e.g., "frameshift," "stop codon gain," "stop codon missing," and "start codon lost") in order to prevent the over-penalization of annotating the start codon in the upstream 5' untranslated region (UTR) or the stop codon placement in the downstream 3' UTR. :cite:p:`hengleeseq_identity`
+LiftOn identifies biological differences between reference and target genomes by aligning DNA and protein sequences. It classifies protein-coding transcripts as "identical" or provides detailed reports on mutations, including "synonymous", "non-synonymous", "in-frame insertion", "in-frame deletion", "frameshift", "start codon loss", "stop codon gain", and "stop codon loss".
 
-.. It is noteworthy that miniprot lacks the capability to resolve overlapping loci and has the potential to map a single protein-coding transcript to multiple loci. To mitigate the potential bias arising from miniprot exhibiting a higher protein sequence identity score but originating from an incorrect gene locus, we predominantly relied on the Liftoff coordinates. We identified the corresponding miniprot annotation that exhibited overlap and shared the same transcript ID for the purpose of comparison. In scenarios where two miniprot annotated transcripts with identical IDs overlapped with the Liftoff protein, the selection criterion favored the transcript with the higher protein sequence identity score, thereby representing that specific protein-coding transcript.
 
 |
 |