Proteins are fundamental to nearly all biological processes, and understanding the relationship between their amino acid sequences, structures, and functions is a key challenge in computational biology. Direct Coupling Analysis (DCA) has emerged as a powerful statistical approach for inferring residue–residue interactions from the vast archives of protein data, enabling contact prediction and generative modeling of protein families. In this work, we present a Python reimplementation of arDCA, an autoregressive DCA model that factorizes the joint sequence probability distribution into conditionals, to allow exact likelihood maximization and efficient parameter inference without requiring Monte Carlo sampling. Our implementation introduces a block-sparse formulation for computing autoregressive logits, significantly reducing computational overhead and memory usage. We evaluate the model on protein families from Pfam, demonstrating smooth convergence of negative log-likelihood (NLL) and high structural plausibility of generated sequences as measured by AlphaFold’s pLDDT scores. While our implementation remains slower than the original Julia version, it provides an easily expandable and practical framework for protein sequence generation.
LF3199684.typ: typst source fileLF3199684.pdf: compiled pdf filereferences.bib: bibliography for typst
In the code directory:
out/: contains all the produced images and plotsardca.py: model, training and evaluation of arDCAclasses.py: dataclass definitionsutils.py: utilities for preprocessing, training, and analyzing resultstraining.ipynb: training notebook for the model and experimentsanalyzing_results.ipynb: notebook used to analyze the model-generated samplesrequirements.txt: all the python dependencies
In addition, ColabFold was used for images, plots, and pLDDT values.