Skip to content

Code Quality Benchmark

angtft edited this page Apr 28, 2022 · 54 revisions

To generate a benchmark, we have executed softwipe on a collection of programs, most of which are bioinformatics tools from the area of evolutionary biology. Some of the below tools (genesis, raxml-ng, repeatscounter, hyperphylo) have been developed in our lab. You will find a table containing the code quality scores below. Note that this is subject to change as we are refining our scoring criteria and including more tools.

Softwipe scores for each category are assigned such that the "best" program in each category that is not an outlier obtains a 10 out of 10 score, and the "worst" program in each category that is not an outlier is assigned a 0 out of 10 score. An outlier is defined to be a value that lies outside of Tukey's fences.

All code quality categories use relative scores. For instance, we calculate the number of compiler warnings per total Lines Of Code (LOC). Hence, we can use those relative scores to compare and rank the different programs in our benchmark. The overall score that is used for our ranking is simply the average over all score categories. You can find a detailed description of the scoring categories and the tools included in our benchmark below.

program overall relative score compiler_and_sanitizer assertions cppcheck clang_tidy cyclomatic_complexity lizard_warnings unique kwstyle infer test_count
genesis-0.24.0 9.0 9.1 9.9 8.7 8.4 9.2 9.0 9.4 8.2 8.2 N/A 10.0
fastspar 8.3 8.6 9.6 2.0 9.9 9.9 8.8 7.9 8.8 6.4 9.7 10.0
axe-0.3.3 7.6 7.6 9.4 1.2 6.6 9.3 6.2 7.6 8.4 9.8 N/A 10.0
pstl 7.5 7.1 10.0 0.4 8.0 5.6 9.3 9.9 6.3 8.4 N/A 10.0
raxml-ng_v1.0.1 7.5 7.8 9.9 4.2 6.6 9.0 7.9 6.6 4.0 9.2 N/A 10.0
kahypar 7.3 7.6 6.7 2.4 8.0 N/A 9.2 9.6 3.3 9.1 N/A 10.0
bindash-1.0 7.2 6.9 8.3 8.8 5.8 7.1 8.7 9.5 8.2 8.5 N/A 0.0
ExpansionHunter-4.0.2 7.2 7.3 8.7 1.8 8.6 9.4 8.9 9.1 0.4 7.9 N/A 10.0
ripser-1.2.1 6.9 6.7 10.0 6.3 6.4 2.4 8.9 9.1 8.6 9.9 7.1 0.0
naf-1.1.0/unnaf 6.8 7.3 9.9 4.0 9.8 10.0 6.9 7.5 7.2 3.3 9.5 0.0
virulign-1.0.1 6.8 7.0 9.1 3.4 9.4 9.0 7.3 5.8 7.5 9.3 N/A 0.0
naf-1.1.0/ennaf 6.8 6.8 9.9 10.0 9.4 10.0 7.2 6.7 0.0 5.2 9.0 0.0
glucose-3-drup 6.7 6.7 8.6 10.0 5.2 9.4 8.7 8.4 8.5 1.4 N/A 0.0
Treerecs-v1.2 6.7 6.6 5.8 1.8 6.7 8.6 9.0 9.0 1.6 7.5 N/A 10.0
dawg-1.2 6.6 6.6 10.0 0.0 6.3 10.0 8.4 8.1 7.9 9.1 N/A 0.0
RepeatsCounter 6.6 6.1 7.7 0.0 7.0 6.8 9.0 10.0 9.3 9.5 N/A 0.0
samtools-1.11 6.5 6.6 8.6 1.2 7.4 9.1 3.8 2.2 8.2 6.3 8.1 9.9
bpp-4.3.8 6.4 6.4 9.8 9.3 7.1 8.9 2.8 2.0 6.6 9.3 7.9 0.0
swarm-3.0.0 6.3 6.1 10.0 0.3 9.3 3.8 8.0 7.7 4.3 9.9 10.0 0.0
usher-0.3.2 6.3 6.4 8.9 2.1 7.4 9.3 7.5 7.5 7.5 6.4 N/A 0.0
ntEdit-1.2.3 6.1 6.0 8.4 0.0 7.1 9.7 7.9 6.7 3.8 7.7 9.4 0.0
prank-msa 5.9 6.2 5.3 5.1 9.9 9.0 7.0 6.6 1.4 5.8 9.0 0.0
IQ-TREE-2.0.6 5.9 5.5 2.3 2.5 4.7 7.8 8.2 7.7 5.3 6.6 N/A 7.7
emeraLD 5.7 5.5 4.2 0.0 9.4 8.4 6.3 5.3 9.0 8.6 N/A 0.0
dna-nn-0.1 5.6 5.4 7.9 4.1 6.8 6.0 6.7 5.0 6.1 7.8 N/A 0.0
openmp 5.5 5.4 5.8 0.9 0.2 1.5 8.1 7.3 7.6 8.3 N/A 10.0
HLA-LA 5.5 5.5 7.9 10.0 4.1 9.5 5.0 4.1 2.9 3.1 8.0 0.0
BGSA-1.0 5.4 5.0 7.3 0.0 0.2 10.0 7.5 6.8 8.2 9.4 5.1 0.0
minimap2-2.17 5.3 4.9 6.8 2.6 5.2 6.6 6.1 5.2 8.0 5.1 7.6 0.0
ngsTools/ngsLD 5.3 4.9 9.0 0.0 7.3 6.1 5.0 3.9 8.3 7.9 N/A 0.0
Seq-Gen-1.3.4 5.3 5.0 8.9 0.0 6.8 8.3 5.7 5.2 8.9 2.5 6.3 0.0
defor 5.3 5.2 0.1 0.0 6.1 9.4 6.9 6.4 9.0 9.4 N/A 0.0
copmem-0.2 5.2 5.2 10.0 0.2 7.6 8.6 8.5 7.8 4.2 4.5 0.3 0.0
phyml-3.3.20200621 5.2 5.3 9.6 5.5 5.0 8.1 4.3 2.7 5.9 3.7 6.8 0.0
dr_sasa_n 4.8 5.1 0.4 0.0 9.8 10.0 2.3 1.6 9.2 9.9 N/A 0.0
SF2 4.8 4.9 10.0 1.3 4.6 7.9 3.0 0.8 3.3 6.9 10.0 0.0
vsearch-2.15.1 4.7 4.4 7.1 0.0 8.2 1.1 5.0 3.9 5.6 9.7 6.6 0.0
clustal-omega-1.2.4 4.7 5.1 7.4 3.1 6.9 8.8 3.9 2.5 5.3 3.9 N/A 0.2
cellcoal-1.0.0 4.6 4.1 9.7 0.0 6.2 7.5 0.8 0.1 7.2 6.9 8.1 0.0
ms 4.6 4.6 8.4 0.0 0.0 10.0 6.2 5.3 6.4 0.0 9.6 0.0
MrBayes-3.2.7a 4.3 4.0 9.6 1.4 8.2 7.1 0.0 0.1 3.8 4.5 8.1 0.0
Gadget-2.0.7 4.2 4.2 10.0 0.0 0.0 10.0 0.4 0.1 5.4 9.1 N/A 3.0
prequal 4.1 4.6 2.4 5.9 0.3 9.9 6.0 4.0 1.0 2.8 8.8 0.0
crisflash 4.1 4.1 5.9 0.0 3.9 10.0 5.4 4.1 6.2 4.9 0.5 0.0
cryfa-18.06 3.9 4.1 6.2 2.0 0.0 9.7 5.9 5.5 6.0 0.0 N/A 0.0
athena-public-version-21.0 3.9 3.4 3.1 0.0 1.7 8.2 4.5 2.5 0.6 9.1 8.7 0.3
sumo 3.8 3.8 0.0 1.2 6.6 9.4 8.0 7.4 0.0 0.5 N/A 0.7
PopLDdecay 3.8 3.6 9.2 0.0 9.6 10.0 0.1 0.0 0.0 0.0 8.6 0.0
gargammel 3.8 3.4 10.0 0.0 8.4 6.4 0.0 0.1 0.9 3.4 9.1 0.0
mafft-7.475 3.7 3.0 9.3 0.0 6.4 7.8 0.3 0.4 0.7 6.5 4.6 0.8
covid-sim-0.13.0 2.8 2.6 7.5 0.0 5.2 0.0 0.0 0.0 7.3 0.3 N/A 4.9
INDELibleV1.03 2.5 2.3 6.1 0.0 0.7 9.3 0.7 0.8 6.7 0.0 0.5 0.0

Tools included

Bioinformatics-related tools:

  • indelible 1.03 simulates sequence data on phylogenetic trees paper
  • ms population genetics simulations paper
  • mafft 7.429 multiple sequence alignment paper
  • mrbayes 3.2.6 Bayesian phylogenetic inference paper
  • bpp 3.4 multispecies coalescent analyses paper
  • tcoffee multiple sequence alignment paper
  • prank 0.170427 multiple sequence alignment paper
  • sf (SweepFinder) population genetics paper
  • seq-gen 1.3.4 phylogenetic sequence evolution simulation paper
  • dawg 1.2 phylogenetic sequence evolution simulation github
  • repeatscounter evaluates quality of a data distribution for phylogenetic inference github
  • raxml-ng 0.8.1 phylogenetic inference paper
  • genesis 0.22.1 phylogeny library github
  • minimap 2.17-r943 pairwise sequence alignment paper
  • Clustal Omega 1.2.4 multiple sequence alignment paper
  • samtools 1.9 utilities for processing SAM (Sequence Alignment/Map) files paper
  • vsearch 2.13.4 metagenomics functions paper github
  • swarm 3.0.0 amplicon clustering paper github
  • phyml 3.3.20190321 phylogenetic inference paper
  • IQ-TREE 1.6.10 phylogenetic inference paper
  • cellcoal 1.0.0 coalescent simulation of single-cell NGS genotypes github
  • treerecs 1.0 species- and gene-tree reconciliation gitlab
  • HyperPhylo judicious hypergraph partitioning, for creating a data distribution for phylogenetic inference paper
  • HLA*LA - HLA (human leukocyte antigen) typing from linearly projected graph alignments paper
  • Dna-nn 0.1 implements a proof-of-concept deep-learning model to learn relatively simple features on DNA sequences paper
  • ntEdit 1.2.3 scalable genome sequence polishing paper
  • lemon framework for rapidly mining structural information from Protein Data Bank paper
  • DEFOR depth- and frequency-based somatic copy number alteration detector paper
  • naf 1.1.0 Nucleotide Archival Format for lossless reference-free compression of DNA sequences paper
  • ngsLD - Evaluating linkage disequilibrium using genotype likelihoods paper
  • dr_sasa 0.4b - Calculation of accurate interatomic contact surface areas for quantitative analysis of non-bonded molecular interactions paper
  • Crisflash software to generate CRISPR guide RNAs against genomes annotated with individual variation paper
  • BGSA 1.0 global sequence alignment toolkit paper
  • virulign 1.0.1 codon-correct alignment and annotation of viral genomes paper
  • PopLDdecay 3.40 tool for linkage disequilibrium decay analysis paper
  • fastspar 0.0.10 rapid and scalable correlation estimation for compositional data paper
  • ExpansionHunter 3.1.2 tool to analyze variation in short tandem repeat regions paper
  • bindash 1.0 fast genome distance estimation paper
  • copMEM 0.2 finding maximal exact matches via sampling both genomes paper
  • cryfa 18.06 secure encryption tool for genomic data paper
  • emeraLD rapid linkage disequilibrium estimation with massive datasets paper
  • axe 0.3.3 rapid sequence read demultiplexing paper
  • prequal detecting non-homologous characters in sets of unaligned homologous sequences paper
  • SCIPhI-0.1.7 mutation detection in tumor cells github
  • UShER a program that rapidly places new samples onto an existing phylogeny using maximum parsimony github
  • gargammel "a set of programs aimed at simulating ancient DNA fragments" github

Other tools:

  • KaHyPar hypergraph partitioning tool website
  • Athena++ magnetohydrodynamics paper
  • Gadget 2 simulations of cosmological structure formations paper
  • Candy Kingdom modular collection of SAT solvers and tools for structure analysis in SAT problems github
  • glucose-3-drup Glucose 3.0 (a SAT solver) with online DRUP proofs and proof traversal github
  • CovidSim 0.13.0 COVID-19 microsimulation model developed by the MRC Centre for Global Infectious Disease Analysis hosted at Imperial College, London. github
  • Eclipse SUMO traffic simulation package github
  • LLVM OpenMP github
  • LLVM Parallel-STL github
  • Ripser 1.2.1 "code for the computation of Vietoris–Rips persistence barcodes" github

Scoring categories

  • compiler and sanitizer: Here, we compile each benchmark tool using the clang compiler and count the number of warnings. We activate almost all warnings for this. We have weighted the warnings, such that each warning has a weight of 1, 2, or 3, where 3 is most dangerous (for instance, implicit type conversions that might result in precision loss are level 3 warnings). We calculate a weighted sum of clang warnings, where each warning that occurs in the compilation adds its level (1, 2, or 3) to the weighted sum. Additionally, we execute the tool with clang sanitizers (ASan and UBSan) and if the sanitizers find warnings, we add them to the weighted sum. Sanitizer warnings default to level 3. The compiler and sanitizer score is calculated from the weighted sum of warnings per total LOC.
  • assertions: The count of assertions (C-Style assert(), static_assert(), or custom assert macros, if defined) per total LOC.
  • cppcheck: The count of warnings found by the static code analyzer cppcheck per total LOC. Cppcheck categorizes its warnings; we have assigned each category a weight, similarly to the compiler warnings.
  • clang-tidy: The count of warnings found by the static code analyzer clang-tidy per total LOC. Clang-tidy categorizes its warnings; we have assigned each category a weight, similarly to the cppcheck and compiler warnings.
  • cyclomatic complexity: The cyclomatic complexity is a software metric to quantify the complexity/modularity of a program. See full Wikipedia article here. We use the lizard tool to assess the cyclomatic complexity of our benchmark tools. Keep in mind that the above table does not contain the real cyclomatic complexity values, but the scores, which rate all tools relative to each other regarding their cyclomatic complexity.
  • lizard warnings: The number of functions that are considered too complex, relative to the total number of functions. Lizard counts a function as "too complex" if its cyclomatic complexity, its length, or its parameter count exceeds a certain treshold value.
  • unique rate: The amount of unique code; a higher amount of code duplication decreases this value. The unique rate is obtained using lizard.
  • kwstyle: The count of warnings found by the static code style analyzer KWStyle per total LOC. We configure KWStyle using the KWStyle.xml file that is delivered with softwipe.
  • infer: We weight the warnings found by the static analyzer Infer and use the weighted warnings per LOC rate to calculate a score.
  • test count: We try to put the amount of written unit test LOC in relation with the overall LOC count and compute the rate as test_code_loc/overall_loc. At the moment we keep the detection of unit test LOC simple and declare files which have the keyword "test" in their path as test code files.

Analysis tool versions

For the benchmark we used the following analysis tool versions:

  • clang 11.0.0
  • clang-tidy 5.0.1
  • cppcheck 2.1
  • lizard 1.17.7
  • kwstyle latest git version (25.02.2021)
  • infer 0.17.0

Absolute values

For comparability reasons, we provide the absolute values for all results from which the above table is derived. The following table contains each programs total lines of pure code (by which it is sorted), total number of functions, and the absolute results for each scoring category. Note that these are already weighted results, that is, for example, level 3 compiler warnings are counted as 3 warnings here.

program loc functions compiler sanitizer assertions cppcheck clang_tidy cyclomatic_complexity lizard_warnings unique kwstyle infer test_count
sumo 514811 23788 1664573 0 995 9585 9057 4.6 1285 0.7079 43493 N/A 2563
IQ-TREE-2.0.6 220709 10930 93864 0 852 6368 13767 4.2 527 0.9098 5827 N/A 10994
Treerecs-v1.2 171121 10189 40533 0 483 3140 6861 2.4 210 0.845 3314 N/A 64810
dr_sasa_n 146963 86 97039 0 0 182 22 11.5 15 0.9968 94 N/A 0
kahypar 109786 9732 20475 0 417 1207 N/A 1.7 72 0.8796 751 N/A 72051
MrBayes-3.2.7a 95597 962 1959 4 205 964 7969 22.6 287 0.8872 3984 242 0
openmp 91040 3530 21406 0 127 6600 22104 4.4 196 0.9479 1157 N/A 16895
raxml-ng_v1.0.1 87135 2545 645 0 572 1633 2602 4.7 181 0.8903 531 N/A 13049
samtools-1.11 78959 2321 6414 0 151 1125 2038 9.5 364 0.9626 2264 200 7626
mafft-7.475 77251 932 3275 0 0 1558 4849 17.8 226 0.81 2098 534 405
ExpansionHunter-4.0.2 72944 3945 5275 5 208 584 1296 2.8 74 0.7912 1157 N/A 18758
phyml-3.3.20200621 70845 1609 1800 1 596 1942 3762 9.0 235 0.9188 3301 298 0
athena-public-version-21.0 65302 1509 24518 1 3 2990 3325 8.8 229 0.8005 463 111 131
genesis-0.24.0 62886 3855 266 0 859 567 1472 2.4 49 0.9608 885 N/A 7658
bpp-4.3.8 41109 793 499 0 646 668 1314 10.8 129 0.9305 210 110 0
clustal-omega-1.2.4 34160 883 4970 0 162 576 1133 9.4 133 0.9106 1557 N/A 42
vsearch-2.15.1 24384 506 4039 0 0 242 6409 8.2 62 0.9142 65 107 0
prank-msa 24023 756 6334 0 188 12 660 6.0 54 0.8378 773 31 0
HLA-LA 23811 462 2817 0 1653 753 337 8.3 55 0.872 1217 62 0
usher-0.3.2 22140 849 1405 1 72 319 456 5.3 45 0.9463 610 N/A 0
covid-sim-0.13.0 13200 124 1857 0 0 350 9280 32.5 42 0.9434 1255 N/A 433
Gadget-2.0.7 12589 148 0 0 0 1534 4 16.9 47 0.9117 83 N/A 257
fastspar 11346 90 226 0 35 4 23 3.1 4 0.9779 310 4 9933
cellcoal-1.0.0 11000 66 189 0 0 229 793 14.7 21 0.9406 264 27 0
pstl 10380 1162 0 0 7 115 1303 1.6 2 0.9247 128 N/A 6617
INDELibleV1.03 9697 216 2150 0 0 543 199 14.9 45 0.9321 4252 139 0
minimap2-2.17 8841 339 1599 0 35 236 859 7.1 34 0.9569 334 28 0
swarm-3.0.0 7092 212 0 0 3 26 1204 4.6 10 0.8945 7 0 0
dawg-1.2 7058 256 0 0 0 146 0 3.9 10 0.9539 47 N/A 0
PopLDdecay 6557 57 292 0 0 16 3 19.5 20 0.4369 1418 12 0
SF2 5337 121 0 0 11 158 312 10.5 25 0.8789 129 0 0
glucose-3-drup 4772 479 390 0 149 126 78 3.3 16 0.9705 318 N/A 0
dna-nn-0.1 4768 210 574 1 30 85 541 6.4 22 0.923 82 N/A 0
ngsTools/ngsLD 4373 113 236 0 0 65 487 8.3 14 0.9643 69 N/A 0
Seq-Gen-1.3.4 3980 120 237 0 0 70 195 7.5 12 0.9828 222 19 0
gargammel 3444 17 0 0 0 31 357 48.7 5 0.8183 169 4 0
crisflash 3279 84 763 0 0 107 0 7.8 10 0.9238 128 47 0
copmem-0.2 3026 133 4 0 1 40 123 3.7 6 0.8939 125 48 0
axe-0.3.3 2781 60 94 0 5 53 54 7.0 3 0.9677 4 N/A 802
prequal 2600 99 1083 0 23 179 4 7.2 12 0.8228 139 4 0
ntEdit-1.2.3 2365 87 213 0 0 38 23 4.8 6 0.8867 42 2 0
cryfa-18.06 2216 74 473 5 7 370 20 7.3 7 0.9213 372 N/A 0
ms 2182 71 193 1 0 201 0 7.0 7 0.9263 641 1 0
emeraLD 1642 51 524 0 0 5 74 6.8 5 0.988 18 N/A 0
bindash-1.0 1622 88 152 0 23 38 133 3.2 1 0.963 19 N/A 0
naf-1.1.0/unnaf 1620 77 4 2 10 2 0 6.1 4 0.9415 80 1 0
naf-1.1.0/ennaf 1615 73 7 1 78 5 0 5.7 5 0.6041 60 2 0
BGSA-1.0 1405 30 216 0 0 100 0 5.3 2 0.9621 7 9 0
virulign-1.0.1 1149 46 56 0 6 4 33 5.6 4 0.9464 6 N/A 0
ripser-1.2.1 1053 105 0 0 10 21 221 2.8 2 0.9742 1 4 0
defor 695 27 602 0 0 15 11 6.2 2 0.9876 3 N/A 0
RepeatsCounter 243 19 30 2 0 4 22 2.4 0 1.0 1 N/A 0

How to create the benchmark

To calculate this benchmark, the results of all softwipe runs must be saved into a results directory that has one subdirectory for each tool that should be included in the benchmark. Most importantly, for each tool, the output of softwipe must be saved into a file called "softwipe_output.txt", which has to lie in the according subdirectory for that tool. For example, the directory structure has to look like this:

results/
results/tool1/
results/tool1/softwipe_output.txt
results/tool2/
results/tool2/softwipe_output.txt
...

Then, the script calculate_score_table.py can be used to parse all the softwipe output files and generate a csv that contains all scores. The script requires the path to the results directory (results/ in our example). The script contains a list called FOLDERS that contains the names of all subdirectories that will be included in the benchmark (tool1, tool2, etc. in out example). To add or remove a tool to/from the benchmark, edit this list.

The script recalculates all scores from the rates, rather than parsing the scores directly. This is done so that softwipe doesn't need to be rerun for all tools if the scoring functions get changed. The script simply uses softwipe's scoring functions from scoring.py. These scoring functions use the values calculated by the compare_results.py script, which are the best/worst values that are not outliers, as mentioned above.

Clone this wiki locally