-
Notifications
You must be signed in to change notification settings - Fork 6
Home
The ANARCII package is a suite of transformer language models that take antigen receptor sequences such as antibodies or TCRs and predict the IMGT (or alternate scheme) numbering through auto-regressive inference.
from anarcii import Anarcii
# Select the type of sequence (antibody, tcr, shark or unknown) and instantiate the model.
model = Anarcii(seq_type="antibody")
seqs = ["DVVMTQTPLSLPVSLGDQASISCRSSQSLVHSNGNTYLNWYLQKAGQSPKLLIYKVSNRFSGVPDRFSGSGSGTDFTLKISRVEAEDLGIYFCSQTTHVPPTFGGGTKLEIKR",
"LPGARCAYDMTQTPASVEVAVGGTVTIKCQASQSISTYLSWYQQKPGQRPKLLIYRASTLASGVSSRFKGSGSGTEFTLTISGVECADAATYYCQQGWSSSNVENVFGGGTEVVVKG"]
# Call the number method on a list of sequences, path to a fasta or PDB file.
results = model.number(seqs)
# Returns a dictionary with the sequence as key and a dictionary as value
# The second dictionary contains the numbering and other info.
for key, value in results.items():
print(key)
print(value.keys())
print(value["numbering"])
print("")Results look like:
Using device CUDA with 12 CPUs
Sequence 1
dict_keys(['numbering', 'chain_type', 'score', 'query_start', 'query_end', 'error', 'scheme'])
[((1, ' '), 'D'), ((2, ' '), 'V'), ((3, ' '), 'V'), ((4, ' '), 'M'), ((5, ' '), 'T'), ((6, ' '), 'Q'), ((7, ' '), 'T'), ((8, ' '), 'P'), ((9, ' '), 'L'), ((10, ' '), 'S'), ((11, ' '), 'L'), ((12, ' '), 'P'), ((13, ' '), 'V'), ((14, ' '), 'S'), ((15, ' '), 'L'), ((16, ' '), 'G'), ((17, ' '), 'D'), ((18, ' '), 'Q'), ((19, ' '), 'A'), ((20, ' '), 'S'), ((21, ' '), 'I'), ((22, ' '), 'S'), ((23, ' '), 'C'), ((24, ' '), 'R'), ((25, ' '), 'S'), ((26, ' '), 'S'), ((27, ' '), 'Q'), ((28, ' '), 'S'), ((29, ' '), 'L'), ((30, ' '), 'V'), ((31, ' '), 'H'), ((32, ' '), 'S'), ((33, ' '), '-'), ((34, ' '), 'N'), ((35, ' '), 'G'), ((36, ' '), 'N'), ((37, ' '), 'T'), ((38, ' '), 'Y'), ((39, ' '), 'L'), ((40, ' '), 'N'), ((41, ' '), 'W'), ((42, ' '), 'Y'), ((43, ' '), 'L'), ((44, ' '), 'Q'), ((45, ' '), 'K'), ((46, ' '), 'A'), ((47, ' '), 'G'), ((48, ' '), 'Q'), ((49, ' '), 'S'), ((50, ' '), 'P'), ((51, ' '), 'K'), ((52, ' '), 'L'), ((53, ' '), 'L'), ((54, ' '), 'I'), ((55, ' '), 'Y'), ((56, ' '), 'K'), ((57, ' '), 'V'), ((58, ' '), '-'), ((59, ' '), '-'), ((60, ' '), '-'), ((61, ' '), '-'), ((62, ' '), '-'), ((63, ' '), '-'), ((64, ' '), '-'), ((65, ' '), 'S'), ((66, ' '), 'N'), ((67, ' '), 'R'), ((68, ' '), 'F'), ((69, ' '), 'S'), ((70, ' '), 'G'), ((71, ' '), 'V'), ((72, ' '), 'P'), ((73, ' '), '-'), ((74, ' '), 'D'), ((75, ' '), 'R'), ((76, ' '), 'F'), ((77, ' '), 'S'), ((78, ' '), 'G'), ((79, ' '), 'S'), ((80, ' '), 'G'), ((81, ' '), '-'), ((82, ' '), '-'), ((83, ' '), 'S'), ((84, ' '), 'G'), ((85, ' '), 'T'), ((86, ' '), 'D'), ((87, ' '), 'F'), ((88, ' '), 'T'), ((89, ' '), 'L'), ((90, ' '), 'K'), ((91, ' '), 'I'), ((92, ' '), 'S'), ((93, ' '), 'R'), ((94, ' '), 'V'), ((95, ' '), 'E'), ((96, ' '), 'A'), ((97, ' '), 'E'), ((98, ' '), 'D'), ((99, ' '), 'L'), ((100, ' '), 'G'), ((101, ' '), 'I'), ((102, ' '), 'Y'), ((103, ' '), 'F'), ((104, ' '), 'C'), ((105, ' '), 'S'), ((106, ' '), 'Q'), ((107, ' '), 'T'), ((108, ' '), 'T'), ((109, ' '), 'H'), ((110, ' '), '-'), ((111, ' '), '-'), ((112, ' '), '-'), ((113, ' '), '-'), ((114, ' '), 'V'), ((115, ' '), 'P'), ((116, ' '), 'P'), ((117, ' '), 'T'), ((118, ' '), 'F'), ((119, ' '), 'G'), ((120, ' '), 'G'), ((121, ' '), 'G'), ((122, ' '), 'T'), ((123, ' '), 'K'), ((124, ' '), 'L'), ((125, ' '), 'E'), ((126, ' '), 'I'), ((127, ' '), 'K'), ((128, ' '), '-')]
Sequence 2
dict_keys(['numbering', 'chain_type', 'score', 'query_start', 'query_end', 'error', 'scheme'])
[((1, ' '), 'A'), ((2, ' '), 'Y'), ((3, ' '), 'D'), ((4, ' '), 'M'), ((5, ' '), 'T'), ((6, ' '), 'Q'), ((7, ' '), 'T'), ((8, ' '), 'P'), ((9, ' '), 'A'), ((10, ' '), 'S'), ((11, ' '), 'V'), ((12, ' '), 'E'), ((13, ' '), 'V'), ((14, ' '), 'A'), ((15, ' '), 'V'), ((16, ' '), 'G'), ((17, ' '), 'G'), ((18, ' '), 'T'), ((19, ' '), 'V'), ((20, ' '), 'T'), ((21, ' '), 'I'), ((22, ' '), 'K'), ((23, ' '), 'C'), ((24, ' '), 'Q'), ((25, ' '), 'A'), ((26, ' '), 'S'), ((27, ' '), 'Q'), ((28, ' '), 'S'), ((29, ' '), 'I'), ((30, ' '), '-'), ((31, ' '), '-'), ((32, ' '), '-'), ((33, ' '), '-'), ((34, ' '), '-'), ((35, ' '), '-'), ((36, ' '), 'S'), ((37, ' '), 'T'), ((38, ' '), 'Y'), ((39, ' '), 'L'), ((40, ' '), 'S'), ((41, ' '), 'W'), ((42, ' '), 'Y'), ((43, ' '), 'Q'), ((44, ' '), 'Q'), ((45, ' '), 'K'), ((46, ' '), 'P'), ((47, ' '), 'G'), ((48, ' '), 'Q'), ((49, ' '), 'R'), ((50, ' '), 'P'), ((51, ' '), 'K'), ((52, ' '), 'L'), ((53, ' '), 'L'), ((54, ' '), 'I'), ((55, ' '), 'Y'), ((56, ' '), 'R'), ((57, ' '), 'A'), ((58, ' '), '-'), ((59, ' '), '-'), ((60, ' '), '-'), ((61, ' '), '-'), ((62, ' '), '-'), ((63, ' '), '-'), ((64, ' '), '-'), ((65, ' '), 'S'), ((66, ' '), 'T'), ((67, ' '), 'L'), ((68, ' '), 'A'), ((69, ' '), 'S'), ((70, ' '), 'G'), ((71, ' '), 'V'), ((72, ' '), 'S'), ((73, ' '), '-'), ((74, ' '), 'S'), ((75, ' '), 'R'), ((76, ' '), 'F'), ((77, ' '), 'K'), ((78, ' '), 'G'), ((79, ' '), 'S'), ((80, ' '), 'G'), ((81, ' '), '-'), ((82, ' '), '-'), ((83, ' '), 'S'), ((84, ' '), 'G'), ((85, ' '), 'T'), ((86, ' '), 'E'), ((87, ' '), 'F'), ((88, ' '), 'T'), ((89, ' '), 'L'), ((90, ' '), 'T'), ((91, ' '), 'I'), ((92, ' '), 'S'), ((93, ' '), 'G'), ((94, ' '), 'V'), ((95, ' '), 'E'), ((96, ' '), 'C'), ((97, ' '), 'A'), ((98, ' '), 'D'), ((99, ' '), 'A'), ((100, ' '), 'A'), ((101, ' '), 'T'), ((102, ' '), 'Y'), ((103, ' '), 'Y'), ((104, ' '), 'C'), ((105, ' '), 'Q'), ((106, ' '), 'Q'), ((107, ' '), 'G'), ((108, ' '), 'W'), ((109, ' '), 'S'), ((110, ' '), 'S'), ((111, ' '), '-'), ((112, ' '), 'S'), ((113, ' '), 'N'), ((114, ' '), 'V'), ((115, ' '), 'E'), ((116, ' '), 'N'), ((117, ' '), 'V'), ((118, ' '), 'F'), ((119, ' '), 'G'), ((120, ' '), 'G'), ((121, ' '), 'G'), ((122, ' '), 'T'), ((123, ' '), 'E'), ((124, ' '), 'V'), ((125, ' '), 'V'), ((126, ' '), 'V'), ((127, ' '), 'K'), ((128, ' '), '-')]The suite contains 5 models:
-
Antibody-accuracy - predicts chain H, K or L and numbering, speed = ~50,000 sequences per minute on an A100.
-
Antibody-speed - predicts chain H, K or L and numbering, speed = ~90,000 sequences per minute on an A100.
-
TCR-accuracy - antibody accuracy model with expanded weights (vocabulary expansion) and fine tuned on TCR sequences to predict chain A, B, G or D and numbering (does not predict H, K or L).
-
TCR-speed - antibody speed model with expanded weights (vocabulary expansion) and fine tuned on TCR sequences to predict chain A, B, G or D and numbering (does not predict H, K or L).
-
Shark/VNAR (accuracy model only) - antibody accuracy model fine tuned on a small dataset of shark VNAR sequences from PLAbDab-Nano (https://opig.stats.ox.ac.uk/webapps/plabdab-nano/) which were numbered using antibody accuracy model with conditioning to correct outputs to conform fully to IMGT definitions.
The accuracy models show slightly improved agreement with the original ANARCI numbering predictions on held out test sets. They also are more consistently able to identify conserved residues in rare sequence types not seen in training. For this reason we recommend using the accuracy models when compute/time is not limited or else when working with rare or novel sequence types and formats not well represented in antibody databases such as OAS which formed the training data for these models (https://opig.stats.ox.ac.uk/webapps/oas).
The antibody models are capable of identifying conserved residues and CDR regions in TCR sequences - however the chain calls do not correspond, and numbering will be much less accurate than the TCR specific models.
The Shark/VNAR model serves as an example of how conditioning can generate training data which can be used to fine tune the models. This model can successfully identify the large CDR2 gap characteristic of VNAR formats, as well as correctly call the CDR3 regions and identify the starting residues.