- Install EvoDiff according to their instructions. They recommend using
python 3.8. - To get clustering to work with
protclust, you will need to havemmseqs2installed in the command line. If you are using conda or micromamba, this is as simple as runningconda install -c bioconda mmseqs2. You can visit the mmseqs2 GitHub for more installation options.
- Begin with scripts in the
preprocessingdirectory, starting withdownload_data.py. By altering themainfunction, you can change which EC class you want to download. If you are starting, I recommend using EC 5 since it is the smallest. - Next, access
filter.ipynb. You should modify thepandasat the beginning to import your data, wherever you stored it. From here, the notebook will combine the data, assign it labels, and cluster it usingprotclust.
At this point, you are ready to run. Try running train.py making sure that the data directory correctly points to your stored data. train_full.py implements several useful features like Autocast, learning rate warmup and scheduling, and advanced logging.