-
Notifications
You must be signed in to change notification settings - Fork 6
FAQs
Increasing batch size does not increase inference speed. This only works if all sequences are of a similar length.
- If most sequences are of a similar length, increase batch size.
- For example on an A100, we have observed inference speeds increasing with batch sizes of between 1024 to 8192 (for very large numbers of seqs of similar length).
- When sequence lengths are highly diverse then inference speeds may be higher at a reduced batch size of 32-512 as the autoregressive inference loop does not need to run for as many iterations on the shorter sequences.
- Reduce the batch size to 8-128 to ensure that the inference loop only runs as long as it needs to.
- Greater diversity of sequence lengths means lower batch size is better.
Sequence scores are calculated by summing the logits of the integer tokens in a numbered sequence (not the logit values of the residues predicted to be insertions). These scores were demonstrated to differentiate between antibody sequences, TCRs and non-antibody sequences in our tests (see paper Figure 4B). However, we have observed that they show minor variation on different architectures and versions of python/torch in a small number of sequences. These differences are minimal.
Another factor that may result in slight differences to sequence scores are the surrounding non-antibody content. If a sequence is passed to the model in isolation, versus being contained with a 200 residue block, then the scores will again differ slightly due to the impact of the alternate context. Again these differences are small.
Unknown mode first passes all sequences to a classifier model that has been trained on the ANARCII training data to rapidly classify whether a sequence is an antibody or a TCR. All sequences pass through this model and are dropped into separate buckets. Each bucket is then passed as normal to the relevant antibody/TCR model and returned to the user.