-
Notifications
You must be signed in to change notification settings - Fork 0
Score Calculation Algorithm
The application converts the DNA codon (triplet) counts to ratios and compares those ratios between pairs of organism sequences to generate a similarity score. Normally, the user inputs a single sequence and the application compares that sequence against many others (typically all of RefSeq). This request is a single job. Following are the details describing the algorithm used to generate the results.
The first thing the application does when passed a sequence is to count the codons. This produces a JSON object that maps all of the possible codons to the number of times that codon appears in the sequence. The information in this object is placed into the input table, which holds the counts for user-input sequences.
Each codon in the sequence represents a particular amino acid, and in almost all cases there are multiple codons that represent the same acid. The table that shows the mapping of these codons to acids is called a "codon table". The ratios used in the algorithm are the codon count compared to the total codons used to make a particular amino acid. The application calculates these ratios for the two sequences that are being compared, and uses them to calculate the similarity score.
In a normal comparison operation against RefSeq, the application first collects the ratios for the input organism from the input table. Then, it runs through each of the rows of the refseq table, which contains the codon counts for each of the RefSeq sequences and calculates ratios for those. Note that these codon ratios are solely dependent on the individual sequences and the codon table.
Once the ratios are in hand, the application calculates the sum of the differences of the ratios between the sequences to get the similarity score. Therefore, sequence pairs with very similar codon usage (and thus codon ratios) will have a score close to zero, while very different organisms will have a higher score. The results table holds one row for each RefSeq sequence the input sequence was compared against, noting the resulting similarity score and shuffle score.
The shuffle score option introduces a third sequence that is also compared against the second sequence. The shuffle sequence is generated by performing a cyclical shift one base pair to the right, then randomly shuffling the resulting sequence. This sequence's comparison results are reported as the shuffled score.