Skip to content
AlexGW edited this page Mar 18, 2025 · 2 revisions

Recommended batch sizes

Increasing batch size does not increase inference speed. This only works if all sequences are of a similar length.

When working with large numbers of sequences (10,000+):

  • If most sequences are of a similar length, increase batch size.
  • For example on an A100, we have observed inference speeds increasing with batch sizes of between 1024 to 8192 (for very large numbers of seqs of similar length).
  • When sequence lengths are highly diverse then inference speeds may be higher at a reduced batch size of 32-512 as the autoregressive inference loop does not need to run for as many iterations on the shorter sequences.

When working with small numbers of sequences (100-10,000):

  • Reduce the batch size to 8-128 to ensure that the inference loop only runs as long as it needs to.
  • Greater diversity of sequence lengths means lower batch size is better.

Understanding sequence scores

Sequence scores are calculated by summing the logits of the integer tokens in a numbered sequence (not the logit values of the residues predicted to be insertions). These scores were demonstrated to differentiate between antibody sequences, TCRs and non-antibody sequences in our tests (see paper Figure 4B). However, we have observed that they show minor variation on different architectures and versions of python/torch in a small number of sequences. These differences are minimal.

Another factor that may result in slight differences to sequence scores are the surrounding non-antibody content. If a sequence is passed to the model in isolation, versus being contained with a 200 residue block, then the scores will again differ slightly due to the impact of the alternate context. Again these differences are small.


Unknown mode

Unknown mode first passes all sequences to a classifier model that has been trained on the ANARCII training data to rapidly classify whether a sequence is an antibody or a TCR. All sequences pass through this model and are dropped into separate buckets. Each bucket is then passed as normal to the relevant antibody/TCR model and returned to the user.

Clone this wiki locally