Skip to content

SCARAP v1.0.0

Latest
Compare
Choose a tag to compare
@SWittouck SWittouck released this 15 Nov 15:58

After having been in development for a long time, SCARAP has now reached the first version that I consider mature! To celebrate this milestone, I have put this version of SCARAP on PyPI: it can now be installed with pip install scarap 🎉

A special thank you to @TheOafidian for supplying some code structure updates, bug fixes, a test suite via GitHub Actions and a very nice logo!

Features:

  • The pan module delimits orthogroups in a binary splitting process of initial gene clusters. Previously, each splitting step involved the selection of 50 representative sequences, which were aligned and hierarchically clustered. Three parameters were added to the module to give the user control over this process:
    • --max-align (default: 512): Only sequence clusters with more sequences than this number will go through representative sequence selection; otherwise, all sequences will be considered representatives.
    • --max-reps (default: 40): The maximum number of representatives to use for splitting large clusters. Initially, this number of representatives will be selected.
    • --min-reps (default: 32): The minimum number of representatives to use for splitting large clusters. Subclusters inherit representative sequences from their parent cluster if they are available. If the number of inherited clusters is larger than MIN-REPS, they will be re-used.
  • The values of the --max-align and --min-reps parameters were optimized, leading to a speedup of the pan and core modules in many situations.
  • The search module received a slight speedup (and therefore indirectly the core module as well).
  • If supplied fasta filenames are not unique, the problematic names will now be listed in the log.

Bug fixes:

  • The presence of the ">" character in a fasta header no longer leads to an error.
  • In the build module, the core filter is now applied before score cutoff training instead of the core prefilter (as intended).
  • MAFFT would sometimes mistakenly identify amino acid sequences as nucleotide sequences; this was fixed.