-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Hi, I was just doing some benchmark comparisons to the HIVdb and found some mismatches. Seems like Stanford API alphabetizes the mutations whereas sierra-local outputs mutation text in encounter order.
For example (in the "text" field)
// sierra-local
{"position": 215, "AAs": "SY", "text": "T215YS"}
// Stanford API
{"position": 215, "AAs": "SY", "text": "T215SY"}
The AAs field is alphabetical in both (comes from the aligner), but the text field differs.
I think this is because in nucaminohook.py lines 479-483, the code uses two different sources:
- mut['AminoAcidText'] → from aligner (alphabetical) → becomes the AAs field
- translate_na_triplet(codon) → re-translates locally (encounter order) → becomes the text field
The translate_na_triplet function joins amino acids as they're encountered during codon enumeration rather than sorting them alphabetically. A simple fix could be to just use the aligner's output for both fields instead of re-translating.
gene_muts.update(
{position - left: (mut['ReferenceText'],
mut['AminoAcidText'],
mut['AminoAcidText'] # Instead of translate_na_triplet(codon)
)}
)
According to https://hivdb.stanford.edu/_wrapper/pages/documentPage/user_guide.pdf, "The order of the mutations is not relevant," so it should still be meaningfully the same. But for consistency on other packages platforming off this then alphabetical ordering as Stanford's convention would make things easier all around.